[ https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai updated SPARK-11868: ----------------------------- Target Version/s: 1.6.0 > wrong results returned from dataframe create from Rows without consistent > schma on pyspark > ------------------------------------------------------------------------------------------ > > Key: SPARK-11868 > URL: https://issues.apache.org/jira/browse/SPARK-11868 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.5.2 > Environment: pyspark > Reporter: Yuval Tanny > > When schema is inconsistent (but is the sames for the 10 first rows), it's > possible to create a dataframe form dictionaries and if a key is missing, its > value is None. But when trying to create dataframe from corresponding rows, > we get inconsistent behavior (wrong values for keys) without exception. See > example below. > The problems seems to be: > 1. Not verifying all rows in schema. > 2. In pyspark.sql.types._create_converter, None is being set when converting > dictionary and field is not exist: > {code} > return tuple([conv(d.get(name)) for name, conv in zip(names, converters)]) > {code} > But for Rows, it is just assumed that the number of fields in tuple is equal > the number of in the inferred schema, and we place wrong values for wrong > keys otherwise: > {code} > return tuple(conv(v) for v, conv in zip(obj, converters)) > {code} > Thanks. > example: > {code} > dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}] > rows = [pyspark.sql.Row(**r) for r in dicts] > rows_rdd = sc.parallelize(rows) > dicts_rdd = sc.parallelize(dicts) > rows_df = sqlContext.createDataFrame(rows_rdd) > dicts_df = sqlContext.createDataFrame(dicts_rdd) > print(rows_df.select(['2']).collect()[10]) > print(dicts_df.select(['2']).collect()[10]) > {code} > output: > {code} > Row(2=3) > Row(2=None) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org