Thomas Buhrmann created ARROW-16843:
---------------------------------------

             Summary: [Python][CSV] CSV reader performs unsafe type conversion
                 Key: ARROW-16843
                 URL: https://issues.apache.org/jira/browse/ARROW-16843
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
            Reporter: Thomas Buhrmann


Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
given the largest possible (uint64) value (i.e. they fail correctly when trying 
to cast to float e.g.), the CSV reader happily converts strings representing 
uint64 values to float (see example below). Is this intended? Would it be 
possible to have a safe-conversion-only option?

The problem is that at the moment the only safe option to read a CSV whose 
types are not known in advance is to read without any conversion (string only) 
and perform the type inference oneself.

It would be ok if Uint64 types couldn't be inferred, as long as the 
corresponding columns aren't coerced in a destructive manner to float. I.e., if 
they were left as string columns, one could then implement a custom conversion, 
while still benefiting from the correct and automatic conversion of the 
remaining columns.

 

The following correctly rejects the float type for uint64 values:
{code:java}
import pyarrow as pa

uint64_max = 18_446_744_073_709_551_615

type_ = pa.uint64()
uint64_scalar = pa.scalar(uint64_max, type=type_)
uint64_array = pa.array([uint64_max], type=type_)

try:
    f = pa.scalar(uint64_max, type=pa.float64())
except Exception as exc:
    print(exc)
    
try:
    f = pa.scalar(uint64_max // 2, type=pa.float64())
except Exception as exc:
    print(exc) {code}
{code:java}
>> PyLong is too large to fit int64
>> Integer value 9223372036854775807 is outside of the range exactly 
>> representable by a IEEE 754 double precision value
{code}
The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
as documented here 
[https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
  but does coerce values to float which shouldn't be coercable according to 
above examples:
{code:java}
import io

csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))

print(tbl.schema)
print(tbl.column("uint64")[1] == uint64_scalar)
print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
{code:java}
int64: int64
uint64: double

False
0
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to