[jira] [Updated] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

Yin Huai (JIRA) Wed, 02 Dec 2015 16:49:57 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yin Huai updated SPARK-11868:
-----------------------------
    Target Version/s: 1.6.0

> wrong results returned from dataframe create from Rows without consistent 
> schma on pyspark
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11868
>                 URL: https://issues.apache.org/jira/browse/SPARK-11868
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.5.2
>         Environment: pyspark
>            Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's 
> possible to create a dataframe form dictionaries and if a key is missing, its 
> value is None. But when trying to create dataframe from corresponding rows, 
> we get inconsistent behavior (wrong values for keys) without exception. See 
> example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting 
> dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal 
> the number of in the inferred schema, and we place wrong values for wrong 
> keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks. 
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

Reply via email to