[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555180#comment-17555180
 ] 

Thomas Buhrmann commented on ARROW-16843:
-----------------------------------------

Or even expose the type inference itself in some way, so one could simply read 
all columns as strings and then use the underlying type inference on a column 
by column basis, using additional custom logic. I'm currently creating an 
additional inference layer, e.g., that also infers list types from string 
columns, timestamps with non-iso formats, downcasts ints to the smallest 
possible type etc. (the uint64 case if the only "problem" I had so far fwiw..)

> [Python][CSV] CSV reader performs unsafe type conversion
> --------------------------------------------------------
>
>                 Key: ARROW-16843
>                 URL: https://issues.apache.org/jira/browse/ARROW-16843
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Thomas Buhrmann
>            Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to