[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3244:
--------------------------------
    Fix Version/s: 0.14.0

> [Python] Multi-file parquet loading without scan
> ------------------------------------------------
>
>                 Key: ARROW-3244
>                 URL: https://issues.apache.org/jira/browse/ARROW-3244
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Martin Durant
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to