[ https://issues.apache.org/jira/browse/SPARK-16542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiang Gao updated SPARK-16542: ------------------------------ Description: This is a bugs about types that result an array of null when creating dataframe using python. Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows. A simple code to reproduce this is: {{from pyspark import SparkContext}} from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() which have output +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ was: This is a bugs about types that result an array of null when creating dataframe using python. Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows. A simple code to reproduce this is: {{from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show()}} which have output +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ > bugs about types that result an array of null when creating dataframe using > python > ---------------------------------------------------------------------------------- > > Key: SPARK-16542 > URL: https://issues.apache.org/jira/browse/SPARK-16542 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Reporter: Xiang Gao > > This is a bugs about types that result an array of null when creating > dataframe using python. > Python's array.array have richer type than python itself, e.g. we can have > array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take > this into consideration which might cause a problem that you get an array of > null values when you have array('f') in your rows. > A simple code to reproduce this is: > {{from pyspark import SparkContext}} > from pyspark.sql import SQLContext,Row,DataFrame > from array import array > sc = SparkContext() > sqlContext = SQLContext(sc) > row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) > rows = sc.parallelize([ row1 ]) > df = sqlContext.createDataFrame(rows) > df.show() > which have output > +---------------+------------------+ > | doublearray| floatarray| > +---------------+------------------+ > |[1.0, 2.0, 3.0]|[null, null, null]| > +---------------+------------------+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org