[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

Davies Liu (JIRA) Wed, 08 Oct 2014 16:00:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164342#comment-14164342
 ]


Davies Liu commented on SPARK-2870:
-----------------------------------

[~nchammas] There is something different from sc.inferSchema() and jsonRDD(), 
we try to support nested StructType (different types for fields) and MapType 
(same types for values) in inferSchema(), but jsonRDD() can not support MapType.

Maybe we can do some smart guess for dict during inferSchema(), if all the keys 
are string and having more than one types for values, then we use StructType 
for it, otherwise we use MapType.

> Thorough schema inference directly on RDDs of Python dictionaries
> -----------------------------------------------------------------
>
>                 Key: SPARK-2870
>                 URL: https://issues.apache.org/jira/browse/SPARK-2870
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>            Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

Reply via email to