[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24357.
----------------------------------
    Resolution: Incomplete

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24357
>                 URL: https://issues.apache.org/jira/browse/SPARK-24357
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Major
>              Labels: bulk-closed
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to