Suvayu Ali created ARROW-1956:
---------------------------------
Summary: Support reading specific partitions from a partitioned
parquet dataset
Key: ARROW-1956
URL: https://issues.apache.org/jira/browse/ARROW-1956
Project: Apache Arrow
Issue Type: Improvement
Components: Format
Affects Versions: 0.8.0
Environment: Kernel: 4.14.8-300.fc27.x86_64
Python: 3.6.3
Reporter: Suvayu Ali
Priority: Minor
Attachments: so-example.py
I want to read specific partitions from a partitioned parquet dataset. This is
very useful in case of large datasets. I have attached a small script that
creates a dataset and shows what is expected when reading (quoting salient
points below).
# There is no way to read specific partitions in Pandas
# In pyarrow I tried to achieve the goal by providing a list of
files/directories to ParquetDataset, but it didn't work:
# In PySpark it works if I simply do:
{code:none}
spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
{code}
I also couldn't find a way to easily write partitioned parquet files. In the
end I did it by hand by creating the directory hierarchies, and writing the
individual files myself (similar to the implementation in the attached script).
Again, in PySpark I can do
{code:none}
df.write.partitionBy(*list_of_partitions).parquet(output)
{code}
to achieve that.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)