Re: [PR] Scrollable python notebook table rendering [datafusion-python]

2025-03-13 Thread via GitHub


timsaucer commented on PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#issuecomment-2721032760

   I am testing my updated consolidated code now. When running on a 10 GB scale 
factor TPC-H query, I get comparable times for both q1 and q2, which take 10 
and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale 
factor and then the tiny batches that were discussed in #1015 
   
   
![render-with-limited-returns](https://github.com/user-attachments/assets/3544ce81-8bbf-4cf4-9a64-afc193c5fb8c)
   
   One metric is comparing `df.show()` with `df.__repr__()`. The former calls 
the previous code essentially. The latter is the updated call. I also tested 
against main to find comparable values.
   
   For q2 for example:
   
   df.show():  53.881720781326294
   df.__repr__(): 52.33351922035217
   
   
   When dropping down to a 1GB data set
   
   df.show() took:  0.8244500160217285
   df.__repr__() took 0.8161180019378662
   
   The same 1GB against main
   
   df.show() took:  0.8473942279815674
   df.__repr__() took 0.8100850582122803
   
   Finally, for the tiny dataset (increased to 3 record batches so we do get 
multiple processing steps)
   
   Average runtime over 100 runs: 0.001016 seconds (this branch)
   Average runtime over 100 runs: 0.001011 seconds (main)
   
   And lastly, to verify it also resolves #1014
   
   https://github.com/user-attachments/assets/1a0ccd6a-ad0c-475f-a6ca-da1758930423";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Scrollable python notebook table rendering [datafusion-python]

2025-03-12 Thread via GitHub


timsaucer commented on PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#issuecomment-2719319267

   Still needs careful testing and performance evaluation, but trying to pull 
in methods for this issue, #1041 and #1015 into a single PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Scrollable python notebook table rendering [datafusion-python]

2025-03-04 Thread via GitHub


kevinjqliu commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r1979772203


##
src/dataframe.rs:
##
@@ -100,46 +106,153 @@ impl PyDataFrame {
 }
 
 fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
-let mut html_str = "\n".to_string();
-
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
-
+let (batches, mut has_more) =
+wait_for_future(py, 
get_first_few_record_batches(self.df.as_ref().clone()))?;
+let Some(batches) = batches else {
+return Ok("No data to display".to_string());
+};
 if batches.is_empty() {
-html_str.push_str("\n");
-return Ok(html_str);
+// This should not be reached, but do it for safety since we index 
into the vector below
+return Ok("No data to display".to_string());
 }
 
+let table_uuid = uuid::Uuid::new_v4().to_string();
+
+let mut html_str = "
+
+.expandable-container {
+display: inline-block;
+max-width: 200px;
+}
+.expandable {
+white-space: nowrap;
+overflow: hidden;
+text-overflow: ellipsis;
+display: block;
+}
+.full-text {
+display: none;
+white-space: normal;
+}
+.expand-btn {
+cursor: pointer;
+color: blue;
+text-decoration: underline;
+border: none;
+background: none;
+font-size: inherit;
+display: block;
+margin-top: 5px;
+}
+
+
+
+
+\n".to_string();
+
 let schema = batches[0].schema();
 
 let mut header = Vec::new();
 for field in schema.fields() {
-header.push(format!("{}", field.name()));
+header.push(format!("{}", field.name()));
 }
 let header_str = header.join("");
-html_str.push_str(&format!("{}\n", header_str));
+html_str.push_str(&format!("{}\n", 
header_str));
+
+let batch_formatters = batches
+.iter()
+.map(|batch| {
+batch
+.columns()
+.iter()
+.map(|c| ArrayFormatter::try_new(c.as_ref(), 
&FormatOptions::default()))
+.map(|c| {
+c.map_err(|e| PyValueError::new_err(format!("Error: 
{:?}", e.to_string(
+})
+.collect::, _>>()
+})
+.collect::, _>>()?;
+
+let total_memory: usize = batches
+.iter()
+.map(|batch| batch.get_array_memory_size())
+.sum();
+let rows_per_batch = batches.iter().map(|batch| batch.num_rows());
+let total_rows = rows_per_batch.clone().sum();
+
+// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, 
batch| {
+// (acc.0 + batch.get_array_memory_size(), acc.1 + 
batch.num_rows())
+// });

Review Comment:
   nit remove commented out code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Scrollable python notebook table rendering [datafusion-python]

2025-03-01 Thread via GitHub


timsaucer commented on PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#issuecomment-2692325079

   I need to update the unit test. I don't think the just validating it hasn't 
changed is exactly what we need, so I may think about it some more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]