[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12311: ARROW-10643: [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

GitBox Thu, 03 Feb 2022 00:49:10 -0800


jorisvandenbossche commented on a change in pull request #12311:
URL: https://github.com/apache/arrow/pull/12311#discussion_r798329894




##########
File path: python/pyarrow/pandas_compat.py
##########
@@ -623,7 +623,21 @@ def _can_definitely_zero_copy(arr):
     metadata.update(pandas_metadata)
     schema = schema.with_metadata(metadata)
 
-    return arrays, schema
+    # If dataframe is empty but with RangeIndex ->
+    # remember the length of the indexes
+    n_rows = None
+    if len(arrays) == 0 and schema is not None:

Review comment:
       Is the `schema is not None` needed? Above, there is a block `if schema 
is None` which then creates a schema, which seems to indicate that at this 
point a `schema` object will always be `not None`?

##########
File path: python/pyarrow/table.pxi
##########
@@ -1177,13 +1177,13 @@ cdef class RecordBatch(_PandasConvertible):
         pyarrow.RecordBatch
         """
         from pyarrow.pandas_compat import dataframe_to_arrays
-        arrays, schema = dataframe_to_arrays(
+        arrays, schema, n_rows = dataframe_to_arrays(
             df, schema, preserve_index, nthreads=nthreads, columns=columns
         )
-        return cls.from_arrays(arrays, schema=schema)
+        return cls.from_arrays(arrays, schema=schema, n_rows=n_rows)
 
     @staticmethod
-    def from_arrays(list arrays, names=None, schema=None, metadata=None):
+    def from_arrays(list arrays, names=None, schema=None, metadata=None, 
n_rows=None):

Review comment:
       If we keep the keyword, I would also make this `num_rows` to keep it 
consistent with the public property of a Table. 
   But, another option is also to handle this `n_rows==0` case in 
`from_pandas`. Now, I suppose in theory it can be useful to expose here, to 
make it possible to create an empty table with a given number of rows via the 
Python API (although is is very much an esoteric use case)

##########
File path: python/pyarrow/pandas_compat.py
##########
@@ -623,7 +623,21 @@ def _can_definitely_zero_copy(arr):
     metadata.update(pandas_metadata)
     schema = schema.with_metadata(metadata)
 
-    return arrays, schema
+    # If dataframe is empty but with RangeIndex ->
+    # remember the length of the indexes
+    n_rows = None
+    if len(arrays) == 0 and schema is not None:
+        try:
+            kind = index_descriptors[0]["kind"]
+            if kind == "range":
+                start = index_descriptors[0]["start"]
+                stop = index_descriptors[0]["stop"]
+                step = index_descriptors[0]["step"]
+                n_rows = (stop - start - 1)//step + 1

Review comment:
       Is the `- 1` part needed? (in any case for the test example of (0, 10, 
3) it's not needed)
   
   And I am just thinking: a `step` can be negative, so we should take that 
into account (or at least test, it might already work because `stop - start` 
will then also be negative, so the result would still be positive).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12311: ARROW-10643: [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

Reply via email to