[
https://issues.apache.org/jira/browse/ARROW-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-4542:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21090
> [C++][Parquet] Denominate row group size in bytes (not in no of rows)
> ---------------------------------------------------------------------
>
> Key: ARROW-4542
> URL: https://issues.apache.org/jira/browse/ARROW-4542
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Remek Zajac
> Priority: Major
>
> Both the C++ [implementation of parquet writer for
> arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1174]
> and the [Python code bound to
> it|https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L911]
> appears denominated in the *number of rows* (without making it very
> explicit). Whereas:
> (1) [The Apache parquet
> documentation|https://parquet.apache.org/documentation/latest/] states:
> "_Row group size: Larger row groups allow for larger column chunks which
> makes it possible to do larger sequential IO. Larger groups also require more
> buffering in the write path (or a two pass write). *We recommend large row
> groups (512MB - 1GB)*. Since an entire row group might need to be read, we
> want it to completely fit on one HDFS block. Therefore, HDFS block sizes
> should also be set to be larger. An optimized read setup would be: 1GB row
> groups, 1GB HDFS block size, 1 HDFS block per HDFS file._"
> (2) Reference Apache [parquet-mr
> implementation|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L146]
> for Java accepts the row size expressed in bytes.
> (3) The [low-level parquet read-write
> example|https://github.com/apache/arrow/blob/master/cpp/examples/parquet/low-level-api/reader-writer2.cc#L88]
> also considers row group be denominated in bytes.
> These insights make me conclude that:
> * Per parquet design and to take advantage of HDFS block level operations,
> it only makes sense to work with row group sizes as expressed in bytes - as
> that is the only consequential desire the caller can utter and want to
> influence.
> * Arrow implementation of ParquetWriter would benefit from re-nominating its
> `row_group_size` into bytes. I will also note it is impossible to use pyarrow
> to shape equally byte-sized row groups as the size the row group takes is
> post-compression and the caller only know how much uncompressed data they
> have managed to put in.
> Now, my conclusions can be wrong and I may be blind to some alley of
> reasoning, so this ticket is more of a question than a bug. A question on
> whether the audience here agrees with my reasoning and if not - to explain
> what detail I have missed.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)