This is an automated email from the ASF dual-hosted git repository.
jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 788200a434 GH-40428: [Python][CI] Fix dataset partition filter tests
with pandas nightly (#40429)
788200a434 is described below
commit 788200a434462325c9feff4b52203520a90694e4
Author: Joris Van den Bossche <[email protected]>
AuthorDate: Wed Mar 13 14:20:52 2024 +0100
GH-40428: [Python][CI] Fix dataset partition filter tests with pandas
nightly (#40429)
### Rationale for this change
From debugging the failure, it seems this is due to pandas changing a
filter operation to sometimes preserve a RangeIndex now instead of returning an
Integer64Index. And the conversion to Arrow changes based on that (RangeIndex
is metadata only by default, integer index becomes a column)
Therefore making the tests more robust to ensure there is always at least
one non-partition column in the DataFrame, so it doesn't depend on the index
whether the result is empty or not.
* GitHub Issue: #40428
Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
---
python/pyarrow/tests/parquet/test_dataset.py | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/python/pyarrow/tests/parquet/test_dataset.py
b/python/pyarrow/tests/parquet/test_dataset.py
index 30dae05124..47e608a140 100644
--- a/python/pyarrow/tests/parquet/test_dataset.py
+++ b/python/pyarrow/tests/parquet/test_dataset.py
@@ -107,9 +107,9 @@ def test_filters_equivalency(tempdir):
df = pd.DataFrame({
'integer': np.array(integer_keys, dtype='i4').repeat(15),
'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
- 'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
- 3),
- }, columns=['integer', 'string', 'boolean'])
+ 'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
3),
+ 'values': np.arange(30),
+ })
_generate_partition_directories(local, base_path, partition_spec, df)
@@ -312,9 +312,9 @@ def test_filters_inclusive_set(tempdir):
df = pd.DataFrame({
'integer': np.array(integer_keys, dtype='i4').repeat(15),
'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
- 'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
- 3),
- }, columns=['integer', 'string', 'boolean'])
+ 'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
3),
+ 'values': np.arange(30),
+ })
_generate_partition_directories(local, base_path, partition_spec, df)