I was just about to ask about this. Currently, there are two methods, sqlContext.jsonFile() and sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers the whole data set.
For example: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) >>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', >>> '{"foo":"boom", "baz":[1,2,3]}']))>>> a.printSchema() root |-- baz: array (nullable = true) | |-- element: integer (containsNull = false) |-- foo: string (nullable = true) It works really well! It handles fields with inconsistent value types by inferring a value type that covers all the possible values. But say you’ve already deserialized the JSON to do some pre-processing or filtering. You’d commonly want to do this, say, to remove bad data. So now you have an RDD of Python dictionaries, as opposed to an RDD of JSON strings. It would be perfect if you could get the completeness of the json...() methods, but against dictionaries. Unfortunately, as you noted, inferSchema() only looks at the first element in the set. Furthermore, inferring schemata from RDDs of dictionaries is being deprecated <https://issues.apache.org/jira/browse/SPARK-2010> in favor of doing so from RDDs of Rows. I’m not sure what the intention behind this move is, but as a user I’d like to be able to convert RDDs of dictionaries directly to SchemaRDDs with the completeness of the jsonRDD()/jsonFile() methods. Right now if I really want that, I have to serialize the dictionaries to JSON text and then call jsonRDD(), which is expensive. Nick On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu> wrote: > Hi All, > > I have a data set where each record is serialized using JSON, and I'm > interested to use SchemaRDDs to work with the data. Unfortunately I've hit > a snag since some fields in the data are maps and list, and are not > guaranteed to be populated for each record. This seems to cause > inferSchema to throw an error: > > Produces error: > srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, > {'foo':'boom', 'baz':[1,2,3]}])) > > Works fine: > srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]}, > {'foo':'boom', 'baz':[]}])) > > To be fair inferSchema says it "peeks at the first row", so a possible > work-around would be to make sure the type of any collection can be > determined using the first instance. However, I don't believe that items > in an RDD are guaranteed to remain in an ordered, so this approach seems > somewhat brittle. > > Does anybody know a robust solution to this problem in PySpark? I'm am > running the 1.0.1 release. > > -Brad > >