Thank you for the responses Wes and Joris! These summaries are very helpful to me.
I decided to look into ARROW-5349 to get my feet wet, and just submitted a WIP PR (https://github.com/apache/arrow/pull/4361). If you get a chance, please take a look and provide feedback. I have limited experience with C++ development, so feel free to be harsh __ Once the metadata can include file-path information, I will be happy to help with ARROW-5138 as well Best, Rick On 5/16/19, 2:48 PM, "Joris Van den Bossche" <jorisvandenboss...@gmail.com> wrote: Missed the email of Wes, but yeah, I think we basically said the same. Answer to another question you raised in the notebook: > [about writing a _common_metadata file] ... uses the schema object for > the 0th partition. This actually means that not *all* information in > _common_metadata will be true for the entire dataset. More specifically, > the "index_columns" [in the pandas_metadata] its "start" and "stop" > values will correspond to the 0th partition, rather than the global dataset. > That's indeed a problem with storing the index information not as a column. We have seen some other related issues about this, such as ARROW-5138 (when reading a single row group of a parquet file). In those cases, I think the only solution is to ignore this part of the metadata. But, specifically for dask, I think the idea actually is to not write the index at all (based on discussion in https://github.com/dask/dask/pull/4336), so then you would not have this problem. However, note that writing the _common_metadata file like that from the schema of the first partition might not be fully correct: it might have the correct schema, but it will not have the correct dataset size (eg number of row groups). Although I am not sure what the "common practice" is on this aspect of _common_metadata file. Joris Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche < jorisvandenboss...@gmail.com>: > Hi Rick, > > Thanks for exploring this! > > I am still quite new to Parquet myself, so the following might not be > fully correct, but based on my current understanding, to enable projects > like dask to write the different pieces of a Parquet dataset using pyarrow, > we need the following functionalities: > > - Write a single Parquet file (for one pieces / partition) and get the > metadata of that file > -> Writing is already long possible and ARROW-5258 (GH4236) enabled > getting the metadata > - Update and combine this list of metadata objects > -> Dask needs a way to update the metadata (eg the exact file path > where they put it inside the partitioned dataset): I opened ARROW-5349 > for this. > -> We need to combine the metadata, discussed in ARROW-1983 > - Write a metadata object (for both the _metadata and _common_metadata > files) > -> Also discussed in ARROW-1983. The Python interface could also > combine (step above) and write together. > > But it would be good if some people more familiar with Parquet could chime > in here. > > Best, > Joris > > Op do 16 mei 2019 om 16:37 schreef Richard Zamora <rzam...@nvidia.com>: > >> Note that I was asked to post here after making a similar comment on >> GitHub (https://github.com/apache/arrow/pull/4236)… >> >> I am hoping to help improve the use of pyarrow.parquet within dask ( >> https://github.com/dask/dask). To this end, I put together a simple >> notebook to explore how pyarrow.parquet can be used to read/write a >> partitioned dataset without dask (see: >> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). >> If your search for "Assuming that a single-file metadata solution is >> currently missing" in that notebook, you will see where I am unsure of the >> best way to write/read metadata to/from a centralized location using >> pyarrow.parquet. >> >> I believe that it would be best for dask to have a way to read/write a >> single metadata file for a partitioned dataset using pyarrow (perhaps a >> ‘_metadata’ file?). Am I correct to assume that: (1) this functionality >> is missing in pyarrow, and (2) this approach is the best way to process a >> partitioned dataset in parallel? >> >> Best, >> Rick >> >> -- >> Richard J. Zamora >> NVIDA >> >> >> >> >> ----------------------------------------------------------------------------------- >> This email message is for the sole use of the intended recipient(s) and >> may contain >> confidential information. Any unauthorized review, use, disclosure or >> distribution >> is prohibited. If you are not the intended recipient, please contact the >> sender by >> reply email and destroy all copies of the original message. >> >> ----------------------------------------------------------------------------------- >> >