jorisvandenbossche commented on code in PR #33969: URL: https://github.com/apache/arrow/pull/33969#discussion_r1099132752
########## python/pyarrow/tests/parquet/test_dataset.py: ########## @@ -1925,3 +1925,20 @@ def test_write_to_dataset_kwargs_passed(tempdir, write_dataset_kwarg): pq.write_to_dataset(table, path, **{key: arg}) _name, _args, kwargs = mock_write_dataset.mock_calls[0] assert kwargs[key] == arg + + +def test_partition_schema_validity(tempdir): Review Comment: Can you move this tests to `tests/test_dataset.py`? (this file is specifically for the ParquetDataset implemention part of `pyarrow.parquet`) ########## python/pyarrow/dataset.py: ########## @@ -437,7 +437,7 @@ def _filesystem_dataset(source, schema=None, filesystem=None, FileSystemDataset """ format = _ensure_format(format or 'parquet') - partitioning = _ensure_partitioning(partitioning) + partitioning = _ensure_partitioning(partitioning, schema) Review Comment: This doesn't seem right to me, since the user-specified `schema` is meant to be the full dataset schema, while the `partitioning` functions takes the schema of the only the partitioning (although it is then a bit surprising that no tests fails). Is this change needed to fix the issue (apart from the fix in C++)? (on first sight, it doesn't seem needed to have the python test you added passing) ########## python/pyarrow/tests/test_dataset.py: ########## @@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir): read_table = ds.dataset( source=path, format='ipc', - partitioning='hive', - schema=pa.schema([pa.field("exp_id", pa.int32()), - pa.field("exp_meta", pa.utf8())]) - ).to_table().combine_chunks() + schema=dt_table.schema, Review Comment: I am a trying to gauge the diff here, but did something actually change? It seems that the schema that was specified manually before should be identical to `dt_table.schema`, and so this change is just a simplification? But what is then the "bug when the schema was specified" you are mentioning? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org