Hi Nick, Thanks for the great response.
I actually already investigated jsonRDD and jsonFile, although I did not realize they provide more complete schema inference. I did however have other problems with jsonRDD and jsonFile, but I will now describe in a separate thread with an appropriate subject. I did notice that when I run your example code, I do not receive the exact same output. For example, I see: > srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', '{"foo":"boom", "baz":[1,2,3]}'])) > srdd.printSchema() root |-- baz: ArrayType[IntegerType] |-- foo: StringType Notice the difference in the schema. Are you running the 1.0.1 release, or a more bleeding-edge version from the repository? best, -Brad On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I was just about to ask about this. > > Currently, there are two methods, sqlContext.jsonFile() and > sqlContext.jsonRDD(), that work on JSON text and infer a schema that > covers the whole data set. > > For example: > > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > >>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', > >>> '{"foo":"boom", "baz":[1,2,3]}']))>>> a.printSchema() > root > |-- baz: array (nullable = true) > | |-- element: integer (containsNull = false) > |-- foo: string (nullable = true) > > It works really well! It handles fields with inconsistent value types by > inferring a value type that covers all the possible values. > > But say you’ve already deserialized the JSON to do some pre-processing or > filtering. You’d commonly want to do this, say, to remove bad data. So now > you have an RDD of Python dictionaries, as opposed to an RDD of JSON > strings. It would be perfect if you could get the completeness of the > json...() methods, but against dictionaries. > > Unfortunately, as you noted, inferSchema() only looks at the first > element in the set. Furthermore, inferring schemata from RDDs of > dictionaries is being deprecated > <https://issues.apache.org/jira/browse/SPARK-2010> in favor of doing so > from RDDs of Rows. > > I’m not sure what the intention behind this move is, but as a user I’d > like to be able to convert RDDs of dictionaries directly to SchemaRDDs with > the completeness of the jsonRDD()/jsonFile() methods. Right now if I > really want that, I have to serialize the dictionaries to JSON text and > then call jsonRDD(), which is expensive. > > Nick > > > > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> Hi All, >> >> I have a data set where each record is serialized using JSON, and I'm >> interested to use SchemaRDDs to work with the data. Unfortunately I've hit >> a snag since some fields in the data are maps and list, and are not >> guaranteed to be populated for each record. This seems to cause >> inferSchema to throw an error: >> >> Produces error: >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, >> {'foo':'boom', 'baz':[1,2,3]}])) >> >> Works fine: >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]}, >> {'foo':'boom', 'baz':[]}])) >> >> To be fair inferSchema says it "peeks at the first row", so a possible >> work-around would be to make sure the type of any collection can be >> determined using the first instance. However, I don't believe that items >> in an RDD are guaranteed to remain in an ordered, so this approach seems >> somewhat brittle. >> >> Does anybody know a robust solution to this problem in PySpark? I'm am >> running the 1.0.1 release. >> >> -Brad >> >> >