[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

via GitHub Tue, 07 Feb 2023 11:55:24 -0800


jorisvandenbossche commented on code in PR #33969:
URL: https://github.com/apache/arrow/pull/33969#discussion_r1099132752



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1925,3 +1925,20 @@ def test_write_to_dataset_kwargs_passed(tempdir, 
write_dataset_kwarg):
         pq.write_to_dataset(table, path, **{key: arg})
         _name, _args, kwargs = mock_write_dataset.mock_calls[0]
         assert kwargs[key] == arg
+
+
+def test_partition_schema_validity(tempdir):

Review Comment:
   Can you move this tests to `tests/test_dataset.py`? (this file is 
specifically for the ParquetDataset implemention part of `pyarrow.parquet`)



##########
python/pyarrow/dataset.py:
##########
@@ -437,7 +437,7 @@ def _filesystem_dataset(source, schema=None, 
filesystem=None,
     FileSystemDataset
     """
     format = _ensure_format(format or 'parquet')
-    partitioning = _ensure_partitioning(partitioning)
+    partitioning = _ensure_partitioning(partitioning, schema)

Review Comment:
   This doesn't seem right to me, since the user-specified `schema` is meant to 
be the full dataset schema, while the `partitioning` functions takes the schema 
of the only the partitioning (although it is then a bit surprising that no 
tests fails). 
   
   Is this change needed to fix the issue (apart from the fix in C++)? 
   (on first sight, it doesn't seem needed to have the python test you added 
passing)
   



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -5084,10 +5084,8 @@ def test_dataset_partition_with_slash(tmpdir):
     read_table = ds.dataset(
         source=path,
         format='ipc',
-        partitioning='hive',
-        schema=pa.schema([pa.field("exp_id", pa.int32()),
-                          pa.field("exp_meta", pa.utf8())])
-    ).to_table().combine_chunks()
+        schema=dt_table.schema,

Review Comment:
   I am a trying to gauge the diff here, but did something actually change? It 
seems that the schema that was specified manually before should be identical to 
`dt_table.schema`, and so this change is just a simplification? 
   But what is then the "bug when the schema was specified" you are mentioning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33969: GH-33668: [Python] Reading flat dataset with partitioning="hive" results in partition schema equal to dataset schema

Reply via email to