I've followed up in a thread more directly related to jsonRDD and jsonFile, but it seems like after building from the current master I'm still having some problems with nested dictionaries.
http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-jsonRDD-and-jsonFile-in-pyspark-tp11461p11517.html On Tue, Aug 5, 2014 at 12:56 PM, Yin Huai <yh...@databricks.com> wrote: > Yes, 2376 has been fixed in master. Can you give it a try? > > Also, for inferSchema, because Python is dynamically typed, I agree with > Davies to provide a way to scan a subset (or entire) of the dataset to > figure out the proper schema. We will take a look it. > > Thanks, > > Yin > > > On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> Assuming updating to master fixes the bug I was experiencing with jsonRDD >> and jsonFile, then pushing "sample" to master will probably not be >> necessary. >> >> We believe that the link below was the bug I experienced, and I've been >> told it is fixed in master. >> >> https://issues.apache.org/jira/browse/SPARK-2376 >> >> best, >> -brad >> >> >> On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com> >> wrote: >> >>> This "sample" argument of inferSchema is still no in master, if will >>> try to add it if it make >>> sense. >>> >>> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu> >>> wrote: >>> > Hi Davies, >>> > >>> > Thanks for the response and tips. Is the "sample" argument to >>> inferSchema >>> > available in the 1.0.1 release of pyspark? I'm not sure (based on the >>> > documentation linked below) that it is. >>> > >>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema >>> > >>> > It sounds like updating to master may help address my issue (and may >>> also >>> > make the "sample" argument available), so I'm going to go ahead and do >>> that. >>> > >>> > best, >>> > -Brad >>> > >>> > >>> > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com> >>> wrote: >>> >> >>> >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas >>> >> <nicholas.cham...@gmail.com> wrote: >>> >> > I was just about to ask about this. >>> >> > >>> >> > Currently, there are two methods, sqlContext.jsonFile() and >>> >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that >>> >> > covers >>> >> > the whole data set. >>> >> > >>> >> > For example: >>> >> > >>> >> > from pyspark.sql import SQLContext >>> >> > sqlContext = SQLContext(sc) >>> >> > >>> >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', >>> >> >>>> '{"foo":"boom", "baz":[1,2,3]}'])) >>> >> >>>> a.printSchema() >>> >> > root >>> >> > |-- baz: array (nullable = true) >>> >> > | |-- element: integer (containsNull = false) >>> >> > |-- foo: string (nullable = true) >>> >> > >>> >> > It works really well! It handles fields with inconsistent value >>> types by >>> >> > inferring a value type that covers all the possible values. >>> >> > >>> >> > But say you’ve already deserialized the JSON to do some >>> pre-processing >>> >> > or >>> >> > filtering. You’d commonly want to do this, say, to remove bad data. >>> So >>> >> > now >>> >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON >>> >> > strings. It would be perfect if you could get the completeness of >>> the >>> >> > json...() methods, but against dictionaries. >>> >> > >>> >> > Unfortunately, as you noted, inferSchema() only looks at the first >>> >> > element >>> >> > in the set. Furthermore, inferring schemata from RDDs of >>> dictionaries is >>> >> > being deprecated in favor of doing so from RDDs of Rows. >>> >> > >>> >> > I’m not sure what the intention behind this move is, but as a user >>> I’d >>> >> > like >>> >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs >>> with >>> >> > the >>> >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I >>> really >>> >> > want >>> >> > that, I have to serialize the dictionaries to JSON text and then >>> call >>> >> > jsonRDD(), which is expensive. >>> >> >>> >> Before upcoming 1.1 release, we did not support nested structures via >>> >> inferSchema, >>> >> the nested dictionary will be MapType. This introduces inconsistance >>> >> for dictionary that >>> >> the top level will be structure type (can be accessed by name of >>> >> field) but others will be >>> >> MapType (can be accesses as map). >>> >> >>> >> So deprecated top level dictionary is try to solve this kind of >>> >> inconsistance. >>> >> >>> >> The Row class in pyspark.sql has a similar interface to dict, so you >>> >> can easily convert >>> >> you dic into a Row: >>> >> >>> >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d))) >>> >> >>> >> In order to get the correct schema, so we need another argument to >>> specify >>> >> the number of rows to be infered? Such as: >>> >> >>> >> inferSchema(rdd, sample=None) >>> >> >>> >> with sample=None, it will take the first row, or it will do the >>> >> sampling to figure out the >>> >> complete schema. >>> >> >>> >> Does this work for you? >>> >> >>> >> > Nick >>> >> > >>> >> > >>> >> > >>> >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller < >>> bmill...@eecs.berkeley.edu> >>> >> > wrote: >>> >> >> >>> >> >> Hi All, >>> >> >> >>> >> >> I have a data set where each record is serialized using JSON, and >>> I'm >>> >> >> interested to use SchemaRDDs to work with the data. Unfortunately >>> I've >>> >> >> hit >>> >> >> a snag since some fields in the data are maps and list, and are not >>> >> >> guaranteed to be populated for each record. This seems to cause >>> >> >> inferSchema >>> >> >> to throw an error: >>> >> >> >>> >> >> Produces error: >>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, >>> >> >> {'foo':'boom', 'baz':[1,2,3]}])) >>> >> >> >>> >> >> Works fine: >>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', >>> 'baz':[1,2,3]}, >>> >> >> {'foo':'boom', 'baz':[]}])) >>> >> >> >>> >> >> To be fair inferSchema says it "peeks at the first row", so a >>> possible >>> >> >> work-around would be to make sure the type of any collection can be >>> >> >> determined using the first instance. However, I don't believe that >>> >> >> items in >>> >> >> an RDD are guaranteed to remain in an ordered, so this approach >>> seems >>> >> >> somewhat brittle. >>> >> >> >>> >> >> Does anybody know a robust solution to this problem in PySpark? >>> I'm am >>> >> >> running the 1.0.1 release. >>> >> >> >>> >> >> -Brad >>> >> >> >>> >> > >>> > >>> > >>> >> >> >