Re: [I] Arrow reader should reject unsigned Parquet integers [iceberg]


nandorKollar commented on issue #14547:
URL: https://github.com/apache/iceberg/issues/14547#issuecomment-4027202558

> Hi [@nandorKollar](https://github.com/nandorKollar) I was going through
this today and I think it does throw the exception if schema has some types
which are not supported
https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java#L256
Let me know if I am missing something or how can I reproduce this. Thanks

I think the problem is not there. If I recall correctly, then the problem is
how Arrow reader interprets unsigned Parquet types. Let's say there's a Parquet
file, which was not written via Iceberg, and uses unsigned 64 bit integers. I'm
afraid that the vectorized reader will simply allocate a vector for a signed
long type
[here](https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L594),
and possibly read incorrectly as a signed value from the Parquet file. Take
[BaseParquetReaders](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java#L197),
which correctly throws an exception when encounters an unsigned int64.

This is an edge case, fixing this is probably simple in the Arrow reader,
what's more challenging is writing a test case for this scenario.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to