[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Don Drake updated SPARK-5722: ----------------------------- Summary: Infer_schema_type incorrect for Integers in pyspark (was: Infer_schma_type incorrect for Integers in pyspark) > Infer_schema_type incorrect for Integers in pyspark > --------------------------------------------------- > > Key: SPARK-5722 > URL: https://issues.apache.org/jira/browse/SPARK-5722 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Reporter: Don Drake > > The Integers datatype in Python does not match what a Scala/Java integer is > defined as. This causes inference of data types and schemas to fail when > data is larger than 2^32 and it is inferred incorrectly as an Integer. > Since the range of valid Python integers is wider than Java Integers, this > causes problems when inferring Integer vs. Long datatypes. This will cause > problems when attempting to save SchemaRDD as Parquet or JSON. > Here's an example: > >>> sqlCtx = SQLContext(sc) > >>> from pyspark.sql import Row > >>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)]) > >>> srdd = sqlCtx.inferSchema(rdd) > >>> srdd.schema() > StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) > That number is a LongType in Java, but an Integer in python. We need to > check the value to see if it should really by a LongType when a IntegerType > is initially inferred. > More tests: > >>> from pyspark.sql import _infer_type > # OK > >>> print _infer_type(1) > IntegerType > # OK > >>> print _infer_type(2**31-1) > IntegerType > #WRONG > >>> print _infer_type(2**31) > #WRONG > IntegerType > >>> print _infer_type(2**61 ) > #OK > IntegerType > >>> print _infer_type(2**71 ) > LongType > Java Primitive Types defined: > http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html > Python Built-in Types: > https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org