[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156904#comment-14156904 ]
Nicholas Chammas commented on SPARK-2870: ----------------------------------------- [~marmbrus] - A related feature that I think would be very important and useful is the ability to infer a complete schema as described here, but do so by key. i.e. something like {{inferSchemaByKey()}} Say I have a large, single RDD of data that includes many different event types. I want to key the RDD by event type and make a single pass over it to get the schema for each event type. This would probably yield something like a {{keyedSchemaRDD}} which I would want to register as multiple tables (one table per key/schema) in one go. Do you think this would be a useful feature? If so, should I track it in a separate JIRA issue? > Thorough schema inference directly on RDDs of Python dictionaries > ----------------------------------------------------------------- > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org