timsaucer commented on code in PR #1015:
URL: 
https://github.com/apache/datafusion-python/pull/1015#discussion_r1957173874


##########
src/dataframe.rs:
##########
@@ -90,8 +91,16 @@ impl PyDataFrame {
     }
 
     fn __repr__(&self, py: Python) -> PyDataFusionResult<String> {
-        let df = self.df.as_ref().clone().limit(0, Some(10))?;
-        let batches = wait_for_future(py, df.collect())?;
+        let df = self.df.as_ref().clone();
+
+        let stream = wait_for_future(py, 
df.execute_stream()).map_err(py_datafusion_err)?;
+
+        let batches: Vec<RecordBatch>  = wait_for_future(
+            py,
+            stream.take(10).collect::<Vec<_>>())
+            .into_iter()
+            .collect::<Result<Vec<_>,_>>()?;
+

Review Comment:
   I did a test and this changes how `__repr__` works from what we currently 
have. With this change it looks like it is returning the first 10 record 
batches instead of the first 10 rows, as I would expect. The idea of putting 
the `limit(0, Some(10))` into the logical plan was so that you can get a small 
sampling of the data.
   
   I think we need to change this to support the bug but also to make sure we 
don't change the output here.
   
   I suspect we have the same problem for `__repr_html__`



##########
src/dataframe.rs:
##########
@@ -90,8 +91,16 @@ impl PyDataFrame {
     }
 
     fn __repr__(&self, py: Python) -> PyDataFusionResult<String> {
-        let df = self.df.as_ref().clone().limit(0, Some(10))?;
-        let batches = wait_for_future(py, df.collect())?;
+        let df = self.df.as_ref().clone();
+
+        let stream = wait_for_future(py, 
df.execute_stream()).map_err(py_datafusion_err)?;
+
+        let batches: Vec<RecordBatch>  = wait_for_future(
+            py,
+            stream.take(10).collect::<Vec<_>>())
+            .into_iter()
+            .collect::<Result<Vec<_>,_>>()?;
+

Review Comment:
   As a side note, I wonder if we want to enhance `__repr__` to also check to 
see the total number of rows in the DataFrame. My guess is that we don't want 
to do that. But if we did we could add a line int the return that was something 
like `... {} additional rows`. A lighter weight would be to change the limit to 
11, get the number returned, show the first 10 and if there was 11 returned to 
sad `... and additional rows` just so the end user knows that you're only 
seeing a portion of the DF.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to