Hi all,

While looking into how file splits work, we noticed that parquet-java
assigns a row group to a split using the row group’s midpoint
(parquet-java code
link
<https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555>
):

start = offset of the first column chunk

midpoint = start + total_compressed_size / 2

This seems to rely on two assumptions:

   1.

   Column chunks are physically stored in schema order, so the first column
   in RowGroup.columns also has the smallest file offset.
   2.

   The column chunks of a row group are stored contiguously (i.e. row
   groups are not overlapping)

Are these requirements guaranteed by the Parquet format specification, or
are they only conventions followed by common writers?

Based on our reading, the Parquet specification only requires the column
metadata to follow the same order as the SchemaElement list in FileMetaData.
It does not appear to require the corresponding file offsets to be in
ascending order, nor does it explicitly prohibit row groups from
overlapping.

Is that the correct interpretation? Either way, it would be helpful for the
specification to clarify these requirements and remove the ambiguity. Best,
Jiayi

Reply via email to