[ https://issues.apache.org/jira/browse/ARROW-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071845#comment-17071845 ]
Joris Van den Bossche commented on ARROW-8244: ---------------------------------------------- So to summarize the issue: the python {{write_to_dataset}} API provides an option to return the FileMetadata of the parquet files it has written through {{metadata_collector}} (the FileMetadata python objects are appended to the list passed to this {{metadata_collector}} option). In practice, this feature is being used by dask to collect all FileMetadata of all parquet files in the partitioned dataset, and to concatenate them to create the {{_metadata}} file. For this use case, though, the {{file_path}} field in the {{ColumnChunkMetadata}} of the {{FileMetadata}} object needs to be set to the relative path inside the partitioned dataset. But the problem is that dask doesn't have these paths (it only gets the list of FileMetaData objects, and the files are already written to disk by pyarrow. So we have the option to either return those paths in some way as well, or either set those paths before returning the FileMetaData object in the {{write_to_dataset}} function (and to be clear, we would _only_ set the path in the FileMetaData being returned to the collector, and _not_ in the actual FileMetaData in the parquet data files that are being written). I would personally just change this to set the paths in pyarrow (and consider this a bug fix), as I think creating the {{_metadata}} file is probably the only use case for this (but this is a not-very-much-educated guess). And that way we don't need to complicate the API further with additional options to also set or return the paths (but this is certainly possible to do, if we don't want to change the current behaviour) cc [~wesm] [~fsaintjacques] > [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" > metadata fields > ------------------------------------------------------------------------------------------- > > Key: ARROW-8244 > URL: https://issues.apache.org/jira/browse/ARROW-8244 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Reporter: Rick Zamora > Priority: Minor > Labels: parquet > Fix For: 0.17.0 > > > Prior to [dask#6023|[https://github.com/dask/dask/pull/6023]], Dask has been > using the `write_to_dataset` API to write partitioned parquet datasets. This > PR is switching to a (hopefully temporary) custom solution, because that API > makes it difficult to populate the the "file_path" column-chunk metadata > fields that are returned within the optional `metadata_collector` kwarg. > Dask needs to set these fields correctly in order to generate a proper global > `"_metadata"` file. > Possible solutions to this problem: > # Optionally populate the file-path fields within `write_to_dataset` > # Always populate the file-path fields within `write_to_dataset` > # Return the file paths for the data written within `write_to_dataset` (up > to the user to manually populate the file-path fields) -- This message was sent by Atlassian Jira (v8.3.4#803005)