Hi all, While looking into how file splits work, we noticed that parquet-java assigns a row group to a split using the row group’s midpoint (parquet-java code link <https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555> ):
start = offset of the first column chunk midpoint = start + total_compressed_size / 2 This seems to rely on two assumptions: 1. Column chunks are physically stored in schema order, so the first column in RowGroup.columns also has the smallest file offset. 2. The column chunks of a row group are stored contiguously (i.e. row groups are not overlapping) Are these requirements guaranteed by the Parquet format specification, or are they only conventions followed by common writers? Based on our reading, the Parquet specification only requires the column metadata to follow the same order as the SchemaElement list in FileMetaData. It does not appear to require the corresponding file offsets to be in ascending order, nor does it explicitly prohibit row groups from overlapping. Is that the correct interpretation? Either way, it would be helpful for the specification to clarify these requirements and remove the ambiguity. Best, Jiayi
