[ 
https://issues.apache.org/jira/browse/SPARK-43427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Upadhyay updated SPARK-43427:
-----------------------------------
    Description: 
I'm not sure if "bug" is the correct tag for this jira, but i've tagged it like 
that for now since the behavior seems odd, happy to update to "improvement" or 
something else based on the conversation!

h2. Issue

Protobuf supports unsigned integer types, including `uint32` and `uint64`. When 
deserializing protobuf values with fields of these types, uint32 is converted 
to `IntegerType` and uint64 is converted to `LongType` in the resulting spark 
struct. `IntegerType` and `LongType` are 
[signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer 
types, so this can lead to confusing results.

Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is 
above 2^63, their representation in binary will contain a 1 in the highest bit, 
which when interpreted as a signed integer will come out as negative (I.e. 
overflow).

I propose that we deserialize unsigned integer types into a type that can 
contain them correctly, e.g.
uint32 => `LongType`
uint64 => `Decimal(20, 0)`

h2. Backwards Compatibility / Default Behavior
Should we maintain backwards compatibility and we add an option that allows 
deserializing these types differently? Or should we change change the default 
behavior (with an option to go back to the old way)? 

I think by default it makes more sense to deserialize them as the larger types 
so that it's semantically more correct. However, there may be existing users of 
this library that would be affected by this behavior change. Though, maybe we 
can justify the change since the function is tagged as `Experimental` (and 
spark 3.4.0 was only released very recently).

h2. Precedent
I believe that unsigned integer types in parquet are deserialized in a similar 
manner, i.e. put into a larger type so that the unsigned representation 
natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and 
https://github.com/apache/spark/pull/31921

  was:
h2. Issue

Protobuf supports unsigned integer types, including `uint32` and `uint64`. When 
deserializing protobuf values with fields of these types, uint32 is converted 
to `IntegerType` and uint64 is converted to `LongType` in the resulting spark 
struct. `IntegerType` and `LongType` are 
[signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer 
types, so this can lead to confusing results.

Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is 
above 2^63, their representation in binary will contain a 1 in the highest bit, 
which when interpreted as a signed integer will come out as negative (I.e. 
overflow).

I propose that we deserialize unsigned integer types into a type that can 
contain them correctly, e.g.
uint32 => `LongType`
uint64 => `Decimal(20, 0)`

h2. Backwards Compatibility / Default Behavior
Should we maintain backwards compatibility and we add an option that allows 
deserializing these types differently? Or should we change change the default 
behavior (with an option to go back to the old way)? 

I think by default it makes more sense to deserialize them as the larger types 
so that it's semantically more correct. However, there may be existing users of 
this library that would be affected by this behavior change. Though, maybe we 
can justify the change since the function is tagged as `Experimental` (and 
spark 3.4.0 was only released very recently).

h2. Precedent
I believe that unsigned integer types in parquet are deserialized in a similar 
manner, i.e. put into a larger type so that the unsigned representation 
natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and 
https://github.com/apache/spark/pull/31921


> Unsigned integer types are deserialized as signed numeric equivalents
> ---------------------------------------------------------------------
>
>                 Key: SPARK-43427
>                 URL: https://issues.apache.org/jira/browse/SPARK-43427
>             Project: Spark
>          Issue Type: Bug
>          Components: Protobuf
>    Affects Versions: 3.4.0
>            Reporter: Parth Upadhyay
>            Priority: Major
>
> I'm not sure if "bug" is the correct tag for this jira, but i've tagged it 
> like that for now since the behavior seems odd, happy to update to 
> "improvement" or something else based on the conversation!
> h2. Issue
> Protobuf supports unsigned integer types, including `uint32` and `uint64`. 
> When deserializing protobuf values with fields of these types, uint32 is 
> converted to `IntegerType` and uint64 is converted to `LongType` in the 
> resulting spark struct. `IntegerType` and `LongType` are 
> [signed|https://spark.apache.org/docs/latest/sql-ref-datatypes.html] integer 
> types, so this can lead to confusing results.
> Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value 
> is above 2^63, their representation in binary will contain a 1 in the highest 
> bit, which when interpreted as a signed integer will come out as negative 
> (I.e. overflow).
> I propose that we deserialize unsigned integer types into a type that can 
> contain them correctly, e.g.
> uint32 => `LongType`
> uint64 => `Decimal(20, 0)`
> h2. Backwards Compatibility / Default Behavior
> Should we maintain backwards compatibility and we add an option that allows 
> deserializing these types differently? Or should we change change the default 
> behavior (with an option to go back to the old way)? 
> I think by default it makes more sense to deserialize them as the larger 
> types so that it's semantically more correct. However, there may be existing 
> users of this library that would be affected by this behavior change. Though, 
> maybe we can justify the change since the function is tagged as 
> `Experimental` (and spark 3.4.0 was only released very recently).
> h2. Precedent
> I believe that unsigned integer types in parquet are deserialized in a 
> similar manner, i.e. put into a larger type so that the unsigned 
> representation natively fits. 
> https://issues.apache.org/jira/browse/SPARK-34817 and 
> https://github.com/apache/spark/pull/31921



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to