[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

Nicholas Chammas (JIRA) Thu, 02 Oct 2014 11:13:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156904#comment-14156904
 ]


Nicholas Chammas commented on SPARK-2870:
-----------------------------------------

[~marmbrus] - A related feature that I think would be very important and useful 
is the ability to infer a complete schema as described here, but do so by key. 
i.e. something like {{inferSchemaByKey()}}

Say I have a large, single RDD of data that includes many different event 
types. I want to key the RDD by event type and make a single pass over it to 
get the schema for each event type. This would probably yield something like a 
{{keyedSchemaRDD}} which I would want to register as multiple tables (one table 
per key/schema) in one go.

Do you think this would be a useful feature? If so, should I track it in a 
separate JIRA issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -----------------------------------------------------------------
>
>                 Key: SPARK-2870
>                 URL: https://issues.apache.org/jira/browse/SPARK-2870
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>            Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

Reply via email to