Re: [PR] Scrollable python notebook table rendering [datafusion-python]
timsaucer commented on PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#issuecomment-2721032760 I am testing my updated consolidated code now. When running on a 10 GB scale factor TPC-H query, I get comparable times for both q1 and q2, which take 10 and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale factor and then the tiny batches that were discussed in #1015  One metric is comparing `df.show()` with `df.__repr__()`. The former calls the previous code essentially. The latter is the updated call. I also tested against main to find comparable values. For q2 for example: df.show(): 53.881720781326294 df.__repr__(): 52.33351922035217 When dropping down to a 1GB data set df.show() took: 0.8244500160217285 df.__repr__() took 0.8161180019378662 The same 1GB against main df.show() took: 0.8473942279815674 df.__repr__() took 0.8100850582122803 Finally, for the tiny dataset (increased to 3 record batches so we do get multiple processing steps) Average runtime over 100 runs: 0.001016 seconds (this branch) Average runtime over 100 runs: 0.001011 seconds (main) And lastly, to verify it also resolves #1014 https://github.com/user-attachments/assets/1a0ccd6a-ad0c-475f-a6ca-da1758930423"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Scrollable python notebook table rendering [datafusion-python]
timsaucer commented on PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#issuecomment-2719319267 Still needs careful testing and performance evaluation, but trying to pull in methods for this issue, #1041 and #1015 into a single PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Scrollable python notebook table rendering [datafusion-python]
kevinjqliu commented on code in PR #1036:
URL:
https://github.com/apache/datafusion-python/pull/1036#discussion_r1979772203
##
src/dataframe.rs:
##
@@ -100,46 +106,153 @@ impl PyDataFrame {
}
fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
-let mut html_str = "\n".to_string();
-
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
-
+let (batches, mut has_more) =
+wait_for_future(py,
get_first_few_record_batches(self.df.as_ref().clone()))?;
+let Some(batches) = batches else {
+return Ok("No data to display".to_string());
+};
if batches.is_empty() {
-html_str.push_str("\n");
-return Ok(html_str);
+// This should not be reached, but do it for safety since we index
into the vector below
+return Ok("No data to display".to_string());
}
+let table_uuid = uuid::Uuid::new_v4().to_string();
+
+let mut html_str = "
+
+.expandable-container {
+display: inline-block;
+max-width: 200px;
+}
+.expandable {
+white-space: nowrap;
+overflow: hidden;
+text-overflow: ellipsis;
+display: block;
+}
+.full-text {
+display: none;
+white-space: normal;
+}
+.expand-btn {
+cursor: pointer;
+color: blue;
+text-decoration: underline;
+border: none;
+background: none;
+font-size: inherit;
+display: block;
+margin-top: 5px;
+}
+
+
+
+
+\n".to_string();
+
let schema = batches[0].schema();
let mut header = Vec::new();
for field in schema.fields() {
-header.push(format!("{}", field.name()));
+header.push(format!("{}", field.name()));
}
let header_str = header.join("");
-html_str.push_str(&format!("{}\n", header_str));
+html_str.push_str(&format!("{}\n",
header_str));
+
+let batch_formatters = batches
+.iter()
+.map(|batch| {
+batch
+.columns()
+.iter()
+.map(|c| ArrayFormatter::try_new(c.as_ref(),
&FormatOptions::default()))
+.map(|c| {
+c.map_err(|e| PyValueError::new_err(format!("Error:
{:?}", e.to_string(
+})
+.collect::, _>>()
+})
+.collect::, _>>()?;
+
+let total_memory: usize = batches
+.iter()
+.map(|batch| batch.get_array_memory_size())
+.sum();
+let rows_per_batch = batches.iter().map(|batch| batch.num_rows());
+let total_rows = rows_per_batch.clone().sum();
+
+// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc,
batch| {
+// (acc.0 + batch.get_array_memory_size(), acc.1 +
batch.num_rows())
+// });
Review Comment:
nit remove commented out code
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Scrollable python notebook table rendering [datafusion-python]
timsaucer commented on PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#issuecomment-2692325079 I need to update the unit test. I don't think the just validating it hasn't changed is exactly what we need, so I may think about it some more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
