[ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Gruener updated ARROW-2656:
----------------------------------
    Description: 
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
 takes a significant amount of time since it serially visits directories to 
find all parquet files. In a dataset with thousands of partition values this 
can take several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.

  was:
When a parquet dataset is highly partitioned, the time to call the constructor 
for 
[ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]takes
 a significant amount of time since it serially visits directories to find all 
parquet files. In a dataset with thousands of partition values this can take 
several minutes from a personal laptop.

A quick win to vastly improve this performance would be to use a ThreadPool to 
have calls to {{_visit_level}} happen concurrently to prevent wasting a ton of 
time waiting on I/O.

An even faster option could be to allow for optional indexing of dataset 
metadata in something like the {{common_metadata}}. This could contain all 
files in the manifest and their row_group information. This would also allow 
for 
[split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
 to be implemented efficiently without needing to open every parquet file in 
the dataset to retrieve the metadata which is quite time consuming for large 
datasets. The main problem with the indexing approach are it requires 
immutability of the dataset, which doesn't seem too unreasonable. This specific 
implementation seems related to 
https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
write portion.


> [Python] Improve ParquetManifest creation time 
> -----------------------------------------------
>
>                 Key: ARROW-2656
>                 URL: https://issues.apache.org/jira/browse/ARROW-2656
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Robert Gruener
>            Priority: Major
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to