[ 
https://issues.apache.org/jira/browse/ARROW-16843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555172#comment-17555172
 ] 

Thomas Buhrmann commented on ARROW-16843:
-----------------------------------------

You're right, this also performs destructive conversion:
{code:java}
pa.scalar("18446744073709551615").cast(pa.float64()) {code}
{noformat}
 >> <pyarrow.DoubleScalar: 1.8446744073709552e+19>

{noformat}
Which is why I think it would be good to have an option to not perform certain 
conversions automatically if they have the potential to be destructive (in the 
sense that one cannot cast back to string or another type without loss of 
information), even if the default may be destructive. E.g. it is quite common 
to have ID columns in the uint64 range, which at the moment cannot be read 
using the CSV reader (without disabling all type inference). 

Another possibility would be to pass a list of inferrable types (so one could 
exclude float64), in addition to the explicit [column_types 
parameter|https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv14ConvertOptions12column_typesE].

> [Python][CSV] CSV reader performs unsafe type conversion
> --------------------------------------------------------
>
>                 Key: ARROW-16843
>                 URL: https://issues.apache.org/jira/browse/ARROW-16843
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Thomas Buhrmann
>            Priority: Major
>
> Hi, I've noticed that although pa.scalar and pa.array behave correctly when 
> given the largest possible (uint64) value (i.e. they fail correctly when 
> trying to cast to float e.g.), the CSV reader happily converts strings 
> representing uint64 values to float (see example below). Is this intended? 
> Would it be possible to have a safe-conversion-only option?
> The problem is that at the moment the only safe option to read a CSV whose 
> types are not known in advance is to read without any conversion (string 
> only) and perform the type inference oneself.
> It would be ok if Uint64 types couldn't be inferred, as long as the 
> corresponding columns aren't coerced in a destructive manner to float. I.e., 
> if they were left as string columns, one could then implement a custom 
> conversion, while still benefiting from the correct and automatic conversion 
> of the remaining columns.
>  
> The following correctly rejects the float type for uint64 values:
> {code:java}
> import pyarrow as pa
> uint64_max = 18_446_744_073_709_551_615
> type_ = pa.uint64()
> uint64_scalar = pa.scalar(uint64_max, type=type_)
> uint64_array = pa.array([uint64_max], type=type_)
> try:
>     f = pa.scalar(uint64_max, type=pa.float64())
> except Exception as exc:
>     print(exc)
>     
> try:
>     f = pa.scalar(uint64_max // 2, type=pa.float64())
> except Exception as exc:
>     print(exc) {code}
> {code:java}
> >> PyLong is too large to fit int64
> >> Integer value 9223372036854775807 is outside of the range exactly 
> >> representable by a IEEE 754 double precision value
> {code}
> The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, 
> as documented here 
> [https://arrow.apache.org/docs/cpp/csv.html#data-types),|https://arrow.apache.org/docs/cpp/csv.html#data-types)]
>   but does coerce values to float which shouldn't be coercable according to 
> above examples:
> {code:java}
> import io
> csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
> tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))
> print(tbl.schema)
> print(tbl.column("uint64")[1] == uint64_scalar)
> print(tbl.column("uint64")[1].cast(pa.uint64())) {code}
> {code:java}
> int64: int64
> uint64: double
> False
> 0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to