[ https://issues.apache.org/jira/browse/SPARK-16542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiang Gao updated SPARK-16542: ------------------------------ Description: This is a bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows. A simple code to reproduce this is: {code:title=test.py|borderStyle=solid} from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() {code} which have output {code} +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ {code} was: This is a bugs about types that result an array of null when creating dataframe using python. Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows. A simple code to reproduce this is: {code:title=test.py|borderStyle=solid} from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() {code} which have output {code} +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ {code} > bugs about types that result an array of null when creating dataframe using > python > ---------------------------------------------------------------------------------- > > Key: SPARK-16542 > URL: https://issues.apache.org/jira/browse/SPARK-16542 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Reporter: Xiang Gao > > This is a bugs about types that result an array of null when creating > DataFrame using python. > Python's array.array have richer type than python itself, e.g. we can have > array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take > this into consideration which might cause a problem that you get an array of > null values when you have array('f') in your rows. > A simple code to reproduce this is: > {code:title=test.py|borderStyle=solid} > from pyspark import SparkContext > from pyspark.sql import SQLContext,Row,DataFrame > from array import array > sc = SparkContext() > sqlContext = SQLContext(sc) > row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) > rows = sc.parallelize([ row1 ]) > df = sqlContext.createDataFrame(rows) > df.show() > {code} > which have output > {code} > +---------------+------------------+ > | doublearray| floatarray| > +---------------+------------------+ > |[1.0, 2.0, 3.0]|[null, null, null]| > +---------------+------------------+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org