timsaucer commented on code in PR #1015:
URL:
https://github.com/apache/datafusion-python/pull/1015#discussion_r1957173874
##########
src/dataframe.rs:
##########
@@ -90,8 +91,16 @@ impl PyDataFrame {
}
fn __repr__(&self, py: Python) -> PyDataFusionResult<String> {
- let df = self.df.as_ref().clone().limit(0, Some(10))?;
- let batches = wait_for_future(py, df.collect())?;
+ let df = self.df.as_ref().clone();
+
+ let stream = wait_for_future(py,
df.execute_stream()).map_err(py_datafusion_err)?;
+
+ let batches: Vec<RecordBatch> = wait_for_future(
+ py,
+ stream.take(10).collect::<Vec<_>>())
+ .into_iter()
+ .collect::<Result<Vec<_>,_>>()?;
+
Review Comment:
I did a test and this changes how `__repr__` works from what we currently
have. With this change it looks like it is returning the first 10 record
batches instead of the first 10 rows, as I would expect. The idea of putting
the `limit(0, Some(10))` into the logical plan was so that you can get a small
sampling of the data.
I think we need to change this to support the bug but also to make sure we
don't change the output here.
I suspect we have the same problem for `__repr_html__`
##########
src/dataframe.rs:
##########
@@ -90,8 +91,16 @@ impl PyDataFrame {
}
fn __repr__(&self, py: Python) -> PyDataFusionResult<String> {
- let df = self.df.as_ref().clone().limit(0, Some(10))?;
- let batches = wait_for_future(py, df.collect())?;
+ let df = self.df.as_ref().clone();
+
+ let stream = wait_for_future(py,
df.execute_stream()).map_err(py_datafusion_err)?;
+
+ let batches: Vec<RecordBatch> = wait_for_future(
+ py,
+ stream.take(10).collect::<Vec<_>>())
+ .into_iter()
+ .collect::<Result<Vec<_>,_>>()?;
+
Review Comment:
As a side note, I wonder if we want to enhance `__repr__` to also check to
see the total number of rows in the DataFrame. My guess is that we don't want
to do that. But if we did we could add a line int the return that was something
like `... {} additional rows`. A lighter weight would be to change the limit to
11, get the number returned, show the first 10 and if there was 11 returned to
sad `... and additional rows` just so the end user knows that you're only
seeing a portion of the DF.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]