jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993


   @bkietz thanks for the update ensuring all uniques as dictionary values!
   
   Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / 
#7608
   
   Further, a usability issue: this now creates partition expressions that use 
a dictionary type. Which means that doing something like 
`dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition 
field with a plain string expression doesn't work, limiting the usability of 
this option (and even with the new Python scalar stuff, it would not be easy to 
construct the correct expression):
   
   ```
   In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2) 
 
   
   In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
partitioning=part)
   
   In [11]: fragment = list(dataset.get_fragments())[0]   
   
   In [12]: fragment.partition_expression  
   Out[12]: 
   <pyarrow.dataset.Expression (part == [
     "A",
     "B"
   ][0]:dictionary<values=string, indices=int32, ordered=0>)>
   
   In [13]: dataset.to_table(filter=ds.field("part") == "A") 
   ...
   ArrowNotImplementedError: cast from string
   ```
   
   It might also be an option to keep the `partition_expression` use the 
dictionary *value type* instead of dictionary type?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to