jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017


   Currently for the ParquetDataset, it also simply uses int32 for the indices. 
   
   Now, there is a more fundamental issue I had not thought of: the actual 
dictionary of the DictionaryArray. Right now, you create a DictionaryArray with 
only the (single) value of the partition field for that specific fragment 
(because we don't keep track of all unique values of a certain partition 
level?). 
   In the python ParquetDataset, however, we create a DictionaryArray with all 
occurring values of that partition field (also for the other fragments/pieces).
   
   To illustrate with a small dataset with `part=A` and `part=B` directories:
   
   ```python
   In [1]: import pyarrow.dataset as ds
   
   In [6]: part = 
ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   
   
   In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", 
partitioning=part) 
   
   In [10]: fragment = list(dataset.get_fragments())[0] 
   
   In [11]: fragment.to_table(schema=dataset.schema) 
   Out[11]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int8, ordered=0>
   
   # only A included
   In [13]: fragment.to_table(schema=dataset.schema).column("part")  
   Out[13]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
   [
   
     -- dictionary:
       [
         "A"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   
   In [15]: import pyarrow.parquet as pq  
   
   In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  
   
   In [19]: piece = dataset2.pieces[0] 
   
   In [25]: piece.read(partitions=dataset2.partitions) 
   Out[25]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int32, ordered=0>
   
   # both A and B included
   In [26]: piece.read(partitions=dataset2.partitions).column("part")   
   Out[26]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
   [
   
     -- dictionary:
       [
         "A",
         "B"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   ```
   
   I think for this being valuable (eg in the context of dask, or for pandas 
where reading in only a part of the parquet dataset), it's important to get all 
values of the partition field. But I am not sure to what extent that fits in 
the Dataset design (although I think that during the discovery in the Factory, 
we could keep track of all unique values of a partition field?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to