curioustien commented on PR #44043: URL: https://github.com/apache/arrow/pull/44043#issuecomment-2586007265
> Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files? @mapleFU @pitrou I'm trying to pick up this task to ramp up on the parquet codebase. Upon checking the parquet java implementation of this `ordinal` field in `RowGroup`, I think there is a small bug/discrepancy in the current cpp implementation vs the java implementation. For the java implementation, there is [this check whether it's an encrypted file or not](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L606-L653) by checking the `fileEncryptor` pointer. If it's not an encrypted file, the `rowGroupOrdinal` value is `-1` which I believe it means that Thrift won't set that field. Otherwise, we use the `rowGroupOrdinal` value which is [incremented every time we flush the row group](https://github.com/apache/parquet-java/blob/apache-parquet-1.15.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L207). Though, there is a small bug here in the java implementation that we use `int` instead of `short` for `int16 ordinal` variable. Plus, there are no checks here on integer overflow. In practice, people probably don't hit this limit on the number of row groups anyways, so this bug hasn't showed up yet. For the cpp implementation, we don't really have a check for the encrypted file like the java implementation when we write row group ordinal (not that I've found, or maybe I'm missing something), so we always write this ordinal value even for unencrypted files. In my opinion, we should change the logic here to behave like the java implementation with an additional check on integer overflow. What do you both think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
