[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486571#comment-16486571
 ] 

Joel Croteau commented on SPARK-24357:
--------------------------------------

Fair enough, here is some code to reproduce it:
{code:python}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=1 << 65)]


def init_session():
    builder = SparkSession.builder.appName("Demonstrate integer overflow")
    return builder.getOrCreate()


def main():
    spark = init_session()
    frame = spark.createDataFrame(TEST_DATA)
    frame.printSchema()
    print(frame.collect())


__name__ == '__main__' and main()

{code}
This should either infer a type that can hold the 1 << 65 value from TEST_DATA, 
or produce a runtime error about inferring the schema or serializing the data. 
This is the actual output:
{noformat}
root
 |-- data: long (nullable = true)

[Row(data=None)]
{noformat}

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24357
>                 URL: https://issues.apache.org/jira/browse/SPARK-24357
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to