[ 
https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542233#comment-17542233
 ] 

Kshiteej K edited comment on ARROW-16613 at 5/25/22 9:17 PM:
-------------------------------------------------------------

Hi,

I'll be interested in giving it a shot :)

Draft PR ready at : https://github.com/apache/arrow/pull/13234


was (Author: JIRAUSER289998):
Hi,

I'll be interested in giving it a shot :)

> [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector 
> appears to be O(n^2)
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16613
>                 URL: https://issues.apache.org/jira/browse/ARROW-16613
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>    Affects Versions: 8.0.0
>            Reporter: Kyle Barron
>            Priority: Critical
>             Fix For: 9.0.0
>
>
> Hello!
>  
> I've noticed that when writing a `_metadata` file with 
> `pyarrow.parquet.write_metadata`, it is very slow with a large 
> `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears 
> that the concatenation inside `metadata.append_row_groups` is very slow. The 
> writer first [iterates over every item of the 
> list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
>  and then [concatenates them on each 
> iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].
>  
> Would it be possible to make a vectorized implementation of this? Where 
> `append_row_groups` accepts a list of `FileMetaData` objects, and where 
> concatenation happens only once?
>  
> Repro (in IPython to use `%time`)
> {code:java}
> from io import BytesIO
> import pyarrow as pa
> import pyarrow.parquet as pq
> def create_example_file_meta_data():
>     data = {
>         "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
>         "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
>         "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
>         "bool": pa.array([True, True, False, False], type=pa.bool_()),
>     }
>     table = pa.table(data)
>     metadata_collector = []
>     pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
>     return table.schema, metadata_collector[0]
> schema, meta = create_example_file_meta_data()
> metadata_collector = [meta] * 500
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
> # Wall time: 234 ms
> metadata_collector = [meta] * 1000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
> # Wall time: 970 ms
> metadata_collector = [meta] * 2000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
> # Wall time: 4.3 s
> metadata_collector = [meta] * 4000
> %time pq.write_metadata(schema, BytesIO(), 
> metadata_collector=metadata_collector)
> # CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
> # Wall time: 17.3 s
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to