[ https://issues.apache.org/jira/browse/ARROW-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451371#comment-17451371 ]
Will Jones commented on ARROW-14701: ------------------------------------ {quote}The cap does not seem to be based on the number of rows, I would guess it's based on the memory size of the `pyarrow.Table` but I haven't investigated further. {quote} It does seem to me to be based on rows; what makes you say otherwise? The Python write table dispatches to the parquet::arrow::FileWriterImp::WriteTable() C++ function. There are writer properties that specify a maximum row group length here, and this will override the row_group_size you provided in these lines: [writer.cc|https://github.com/apache/arrow/blob/00d55bb84982cd8ea8f3968b4fab68af595e79fe/cpp/src/parquet/arrow/writer.cc#L331-L333] The default (64 * 1024 * 1024 rows) is set here: [properties.h|https://github.com/apache/arrow/blob/00d55bb84982cd8ea8f3968b4fab68af595e79fe/cpp/src/parquet/properties.h#L97] As far I as I can tell, that option isn't currently exposed in the Python bindings and can't be changed. You're correct it's not well documented by the write_table function, but is documented in the ParquetWriter.write_table method that function wraps. > [Python] parquet.write_table has an undocumented and silent cap on > row_group_size > --------------------------------------------------------------------------------- > > Key: ARROW-14701 > URL: https://issues.apache.org/jira/browse/ARROW-14701 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 6.0.0 > Reporter: Adrien Hoarau > Priority: Minor > > {code:java} > from io import BytesIO > import pandas as pd > import pyarrow > from pyarrow import parquet > from pyarrow import fs > print(pyarrow._version_) > def check_row_groups_created(size: int): > df = pd.DataFrame({"a": range(size)}) > t = pyarrow.Table.from_pandas(df) > buffer = BytesIO() > parquet.write_table(t, buffer, row_group_size=size) > buffer.seek(0) > print(parquet.read_metadata(buffer)) > > check_row_groups_created(50_000_000) > check_row_groups_created(100_000_000) {code} > outputs: > {code:java} > 6.0.0 > <pyarrow._parquet.FileMetaData object at 0x7f838584ab80> > created_by: parquet-cpp-arrow version 6.0.0 > num_columns: 1 > num_rows: 50000000 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 1493 > <pyarrow._parquet.FileMetaData object at 0x7f838584ab80> > created_by: parquet-cpp-arrow version 6.0.0 > num_columns: 1 > num_rows: 100000000 > num_row_groups: 2 > format_version: 1.0 > serialized_size: 1640 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)