[ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24357. ---------------------------------- Resolution: Incomplete > createDataFrame in Python infers large integers as long type and then fails > silently when converting them > --------------------------------------------------------------------------------------------------------- > > Key: SPARK-24357 > URL: https://issues.apache.org/jira/browse/SPARK-24357 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Joel Croteau > Priority: Major > Labels: bulk-closed > > When inferring the schema type of an RDD passed to createDataFrame, PySpark > SQL will infer any integral type as a LongType, which is a 64-bit integer, > without actually checking whether the values will fit into a 64-bit slot. If > the values are larger than 64 bits, then when pickled and unpickled in Java, > Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is > called, it will ignore the BigInteger type and return Null. This results in > any large integers in the resulting DataFrame being silently converted to > None. This can create some very surprising and difficult to debug behavior, > in particular if you are not aware of this limitation. There should either be > a runtime error at some point in this conversion chain, or else _infer_type > should infer larger integers as DecimalType with appropriate precision, or as > BinaryType. The former would be less convenient, but the latter may be > problematic to implement in practice. In any case, we should stop silently > converting large integers to None. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org