[ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Drake updated SPARK-5722:
-----------------------------
    Description: 
The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:
{code}
>>> sqlCtx = SQLContext(sc)
>>> from pyspark.sql import Row
>>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
{code}
That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
{code}
>>> from pyspark.sql import _infer_type
# OK
>>> print _infer_type(1)
IntegerType
# OK
>>> print _infer_type(2**31-1)
IntegerType
#WRONG
>>> print _infer_type(2**31)
#WRONG
IntegerType
>>> print _infer_type(2**61 )
#OK
IntegerType
>>> print _infer_type(2**71 )
LongType
{code}

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric


  was:

The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:

>>> sqlCtx = SQLContext(sc)
>>> from pyspark.sql import Row
>>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))

That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
>>> from pyspark.sql import _infer_type
# OK
>>> print _infer_type(1)
IntegerType
# OK
>>> print _infer_type(2**31-1)
IntegerType
#WRONG
>>> print _infer_type(2**31)
#WRONG
IntegerType
>>> print _infer_type(2**61 )
#OK
IntegerType
>>> print _infer_type(2**71 )
LongType

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric



> Infer_schema_type incorrect for Integers in pyspark
> ---------------------------------------------------
>
>                 Key: SPARK-5722
>                 URL: https://issues.apache.org/jira/browse/SPARK-5722
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>            Reporter: Don Drake
>
> The Integers datatype in Python does not match what a Scala/Java integer is 
> defined as.   This causes inference of data types and schemas to fail when 
> data is larger than 2^32 and it is inferred incorrectly as an Integer.
> Since the range of valid Python integers is wider than Java Integers, this 
> causes problems when inferring Integer vs. Long datatypes.  This will cause 
> problems when attempting to save SchemaRDD as Parquet or JSON.
> Here's an example:
> {code}
> >>> sqlCtx = SQLContext(sc)
> >>> from pyspark.sql import Row
> >>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
> >>> srdd = sqlCtx.inferSchema(rdd)
> >>> srdd.schema()
> StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
> {code}
> That number is a LongType in Java, but an Integer in python.  We need to 
> check the value to see if it should really by a LongType when a IntegerType 
> is initially inferred.
> More tests:
> {code}
> >>> from pyspark.sql import _infer_type
> # OK
> >>> print _infer_type(1)
> IntegerType
> # OK
> >>> print _infer_type(2**31-1)
> IntegerType
> #WRONG
> >>> print _infer_type(2**31)
> #WRONG
> IntegerType
> >>> print _infer_type(2**61 )
> #OK
> IntegerType
> >>> print _infer_type(2**71 )
> LongType
> {code}
> Java Primitive Types defined:
> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
> Python Built-in Types:
> https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to