[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

GitBox Mon, 04 Apr 2022 08:41:17 -0700


jorisvandenbossche commented on code in PR #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r841886153



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review Comment:
   Ah, so it's for the case where only a subset of your fields would be 
dictionary encoded, I see. 
   
   Now, in that case returning a list with shorter length can also a bit 
confusing: to get the dictionaries for a certain partition key, you would need 
to count how many dictionary encoded columns are present in the schema before 
the specific key you are looking for.  
   Another option could be to always return a list of the same length as the 
number of fields in the schema, but then with `None` entries for keys that are 
not dictionary encoded?  (eg `[None, pa.array(["first", "second", "third"])]` 
for the specific case in the tests)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Reply via email to