[ 
https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071845#comment-17071845
 ] 

Joris Van den Bossche commented on ARROW-8244:
----------------------------------------------

So to summarize the issue: the python {{write_to_dataset}} API provides an 
option to return the FileMetadata of the parquet files it has written through 
{{metadata_collector}} (the FileMetadata python objects are appended to the 
list passed to this {{metadata_collector}} option).

In practice, this feature is being used by dask to collect all FileMetadata of 
all parquet files in the partitioned dataset, and to concatenate them to create 
the {{_metadata}} file.

For this use case, though, the {{file_path}} field in the 
{{ColumnChunkMetadata}} of the {{FileMetadata}} object needs to be set to the 
relative path inside the partitioned dataset. But the problem is that dask 
doesn't have these paths (it only gets the list of FileMetaData objects, and 
the files are already written to disk by pyarrow. So we have the option to 
either return those paths in some way as well, or either set those paths before 
returning the FileMetaData object in the {{write_to_dataset}} function (and to 
be clear, we would _only_ set the path in the FileMetaData being returned to 
the collector, and _not_ in the actual FileMetaData in the parquet data files 
that are being written).

I would personally just change this to set the paths in pyarrow (and consider 
this a bug fix), as I think creating the {{_metadata}} file is probably the 
only use case for this (but this is a not-very-much-educated guess). And that 
way we don't need to complicate the API further with additional options to also 
set or return the paths (but this is certainly possible to do, if we don't want 
to change the current behaviour)

cc [~wesm] [~fsaintjacques]

> [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" 
> metadata fields
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8244
>                 URL: https://issues.apache.org/jira/browse/ARROW-8244
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>            Reporter: Rick Zamora
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.17.0
>
>
> Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been 
> using the `write_to_dataset` API to write partitioned parquet datasets.  This 
> PR is switching to a (hopefully temporary) custom solution, because that API 
> makes it difficult to populate the the "file_path"  column-chunk metadata 
> fields that are returned within the optional `metadata_collector` kwarg.  
> Dask needs to set these fields correctly in order to generate a proper global 
> `"_metadata"` file.
> Possible solutions to this problem:
>  # Optionally populate the file-path fields within `write_to_dataset`
>  # Always populate the file-path fields within `write_to_dataset`
>  # Return the file paths for the data written within `write_to_dataset` (up 
> to the user to manually populate the file-path fields)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to