Re: Metadata for partitioned datasets in pyarrow.parquet

Richard Zamora Tue, 21 May 2019 08:34:27 -0700

Thank you for the responses Wes and Joris!  These summaries are very helpful to 
me.


I decided to look into ARROW-5349 to get my feet wet, and just submitted a WIP 
PR (https://github.com/apache/arrow/pull/4361).  If you get a chance, please 
take a look and provide feedback.  I have limited experience with C++ 
development, so feel free to be harsh __

Once the metadata can include file-path information, I will be happy to help 
with ARROW-5138 as well

Best,
Rick

On 5/16/19, 2:48 PM, "Joris Van den Bossche" <jorisvandenboss...@gmail.com> 
wrote:

    Missed the email of Wes, but yeah, I think we basically said the same.
    
    Answer to another question you raised in the notebook:
    
    > [about writing a _common_metadata file] ... uses the schema object for
    > the 0th partition. This actually means that not *all* information in
    > _common_metadata will be true for the entire dataset. More specifically,
    > the "index_columns" [in the pandas_metadata] its "start" and "stop"
    > values will correspond to the 0th partition, rather than the global 
dataset.
    >
    That's indeed a problem with storing the index information not as a column.
    We have seen some other related issues about this, such as ARROW-5138 (when
    reading a single row group of a parquet file).
    In those cases, I think the only solution is to ignore this part of the
    metadata. But, specifically for dask, I think the idea actually is to not
    write the index at all (based on discussion in
    https://github.com/dask/dask/pull/4336), so then you would not have this
    problem.
    
    However, note that writing the _common_metadata file like that from the
    schema of the first partition might not be fully correct: it might have the
    correct schema, but it will not have the correct dataset size (eg number of
    row groups). Although I am not sure what the "common practice" is on this
    aspect of _common_metadata file.
    
    Joris
    
    
    
    Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche <
    jorisvandenboss...@gmail.com>:
    
    > Hi Rick,
    >
    > Thanks for exploring this!
    >
    > I am still quite new to Parquet myself, so the following might not be
    > fully correct, but based on my current understanding, to enable projects
    > like dask to write the different pieces of a Parquet dataset using 
pyarrow,
    > we need the following functionalities:
    >
    > - Write a single Parquet file (for one pieces / partition) and get the
    > metadata of that file
    >     -> Writing is already long possible and ARROW-5258 (GH4236) enabled
    > getting the metadata
    > - Update and combine this list of metadata objects
    >     -> Dask needs a way to update the metadata (eg the exact file path
    > where they put it inside the partitioned dataset): I opened ARROW-5349
    > for this.
    >     -> We need to combine the metadata, discussed in ARROW-1983
    > - Write a metadata object (for both the _metadata and _common_metadata
    > files)
    >     -> Also discussed in ARROW-1983. The Python interface could also
    > combine (step above) and write together.
    >
    > But it would be good if some people more familiar with Parquet could chime
    > in here.
    >
    > Best,
    > Joris
    >
    > Op do 16 mei 2019 om 16:37 schreef Richard Zamora <rzam...@nvidia.com>:
    >
    >> Note that I was asked to post here after making a similar comment on
    >> GitHub (https://github.com/apache/arrow/pull/4236)…
    >>
    >> I am hoping to help improve the use of pyarrow.parquet within dask (
    >> https://github.com/dask/dask). To this end, I put together a simple
    >> notebook to explore how pyarrow.parquet can be used to read/write a
    >> partitioned dataset without dask (see:
    >> 
https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
    >> If your search for "Assuming that a single-file metadata solution is
    >> currently missing" in that notebook, you will see where I am unsure of 
the
    >> best way to write/read metadata to/from a centralized location using
    >> pyarrow.parquet.
    >>
    >> I believe that it would be best for dask to have a way to read/write a
    >> single metadata file for a partitioned dataset using pyarrow (perhaps a
    >> ‘_metadata’ file?).   Am I correct to assume that: (1) this functionality
    >> is missing in pyarrow, and (2) this  approach is the best way to process 
a
    >> partitioned dataset in parallel?
    >>
    >> Best,
    >> Rick
    >>
    >> --
    >> Richard J. Zamora
    >> NVIDA
    >>
    >>
    >>
    >>
    >> 
-----------------------------------------------------------------------------------
    >> This email message is for the sole use of the intended recipient(s) and
    >> may contain
    >> confidential information.  Any unauthorized review, use, disclosure or
    >> distribution
    >> is prohibited.  If you are not the intended recipient, please contact the
    >> sender by
    >> reply email and destroy all copies of the original message.
    >>
    >> 
-----------------------------------------------------------------------------------
    >>
    >

Re: Metadata for partitioned datasets in pyarrow.parquet

Reply via email to