This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 788200a434 GH-40428: [Python][CI] Fix dataset partition filter tests 
with pandas nightly (#40429)
788200a434 is described below

commit 788200a434462325c9feff4b52203520a90694e4
Author: Joris Van den Bossche <[email protected]>
AuthorDate: Wed Mar 13 14:20:52 2024 +0100

    GH-40428: [Python][CI] Fix dataset partition filter tests with pandas 
nightly (#40429)
    
    ### Rationale for this change
    
    From debugging the failure, it seems this is due to pandas changing a 
filter operation to sometimes preserve a RangeIndex now instead of returning an 
Integer64Index. And the conversion to Arrow changes based on that (RangeIndex 
is metadata only by default, integer index becomes a column)
    
    Therefore making the tests more robust to ensure there is always at least 
one non-partition column in the DataFrame, so it doesn't depend on the index 
whether the result is empty or not.
    
    * GitHub Issue: #40428
    
    Authored-by: Joris Van den Bossche <[email protected]>
    Signed-off-by: Joris Van den Bossche <[email protected]>
---
 python/pyarrow/tests/parquet/test_dataset.py | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/python/pyarrow/tests/parquet/test_dataset.py 
b/python/pyarrow/tests/parquet/test_dataset.py
index 30dae05124..47e608a140 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -107,9 +107,9 @@ def test_filters_equivalency(tempdir):
     df = pd.DataFrame({
         'integer': np.array(integer_keys, dtype='i4').repeat(15),
         'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
-        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
-                           3),
-    }, columns=['integer', 'string', 'boolean'])
+        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5), 
3),
+        'values': np.arange(30),
+    })
 
     _generate_partition_directories(local, base_path, partition_spec, df)
 
@@ -312,9 +312,9 @@ def test_filters_inclusive_set(tempdir):
     df = pd.DataFrame({
         'integer': np.array(integer_keys, dtype='i4').repeat(15),
         'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
-        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
-                           3),
-    }, columns=['integer', 'string', 'boolean'])
+        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5), 
3),
+        'values': np.arange(30),
+    })
 
     _generate_partition_directories(local, base_path, partition_spec, df)
 

Reply via email to