Parth Upadhyay created SPARK-43427: -------------------------------------- Summary: Unsigned integer types are deserialized as signed numeric equivalents Key: SPARK-43427 URL: https://issues.apache.org/jira/browse/SPARK-43427 Project: Spark Issue Type: Bug Components: Protobuf Affects Versions: 3.4.0 Reporter: Parth Upadhyay
h2. Issue Protobuf supports unsigned integer types, including `uint32` and `uint64`. When deserializing protobuf values with fields of these types, uint32 is converted to `IntegerType` and uint64 is converted to `LongType` in the resulting spark struct. `IntegerType` and `LongType` are [signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will come out as negative (I.e. overflow). I propose that we deserialize unsigned integer types into a type that can contain them correctly, e.g. uint32 => `LongType` uint64 => `Decimal(20, 0)` h2. Backwards Compatibility Should we maintain backwards compatibility and change the default behavior (with an option to go back to the old way)? Or should we add an option that allows deserializing these types differently? I think by default it makes more sense to deserialize them as the larger types so that it's semantically more correct. However, there may be existing users of this library that would be affected by this behavior change. Maybe we can justify the change since the function is tagged as `Experimental`. h2. Precedent I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and https://github.com/apache/spark/pull/31921 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org