[ https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-16613: ----------------------------------- Component/s: C++ > [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector > appears to be O(n^2) > --------------------------------------------------------------------------------------------- > > Key: ARROW-16613 > URL: https://issues.apache.org/jira/browse/ARROW-16613 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet, Python > Affects Versions: 8.0.0 > Reporter: Kyle Barron > Assignee: Antoine Pitrou > Priority: Critical > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hello! > > I've noticed that when writing a `_metadata` file with > `pyarrow.parquet.write_metadata`, it is very slow with a large > `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears > that the concatenation inside `metadata.append_row_groups` is very slow. The > writer first [iterates over every item of the > list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] > and then [concatenates them on each > iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799]. > > Would it be possible to make a vectorized implementation of this? Where > `append_row_groups` accepts a list of `FileMetaData` objects, and where > concatenation happens only once? > > Repro (in IPython to use `%time`) > {code:java} > from io import BytesIO > import pyarrow as pa > import pyarrow.parquet as pq > def create_example_file_meta_data(): > data = { > "str": pa.array(["a", "b", "c", "d"], type=pa.string()), > "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), > "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), > "bool": pa.array([True, True, False, False], type=pa.bool_()), > } > table = pa.table(data) > metadata_collector = [] > pq.write_table(table, BytesIO(), metadata_collector=metadata_collector) > return table.schema, metadata_collector[0] > schema, meta = create_example_file_meta_data() > metadata_collector = [meta] * 500 > %time pq.write_metadata(schema, BytesIO(), > metadata_collector=metadata_collector) > # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms > # Wall time: 234 ms > metadata_collector = [meta] * 1000 > %time pq.write_metadata(schema, BytesIO(), > metadata_collector=metadata_collector) > # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms > # Wall time: 970 ms > metadata_collector = [meta] * 2000 > %time pq.write_metadata(schema, BytesIO(), > metadata_collector=metadata_collector) > # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s > # Wall time: 4.3 s > metadata_collector = [meta] * 4000 > %time pq.write_metadata(schema, BytesIO(), > metadata_collector=metadata_collector) > # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s > # Wall time: 17.3 s > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)