[ https://issues.apache.org/jira/browse/ARROW-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li resolved ARROW-14701. ------------------------------ Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11928 [https://github.com/apache/arrow/pull/11928] > [Python] parquet.write_table has an undocumented and silent cap on > row_group_size > --------------------------------------------------------------------------------- > > Key: ARROW-14701 > URL: https://issues.apache.org/jira/browse/ARROW-14701 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 6.0.0 > Reporter: Adrien Hoarau > Assignee: Will Jones > Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > from io import BytesIO > import pandas as pd > import pyarrow > from pyarrow import parquet > from pyarrow import fs > print(pyarrow._version_) > def check_row_groups_created(size: int): > df = pd.DataFrame({"a": range(size)}) > t = pyarrow.Table.from_pandas(df) > buffer = BytesIO() > parquet.write_table(t, buffer, row_group_size=size) > buffer.seek(0) > print(parquet.read_metadata(buffer)) > > check_row_groups_created(50_000_000) > check_row_groups_created(100_000_000) {code} > outputs: > {code:java} > 6.0.0 > <pyarrow._parquet.FileMetaData object at 0x7f838584ab80> > created_by: parquet-cpp-arrow version 6.0.0 > num_columns: 1 > num_rows: 50000000 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1493 > <pyarrow._parquet.FileMetaData object at 0x7f838584ab80> > created_by: parquet-cpp-arrow version 6.0.0 > num_columns: 1 > num_rows: 100000000 > num_row_groups: 2 > format_version: 1.0 > serialized_size: 1640 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)