Assuming updating to master fixes the bug I was experiencing with jsonRDD and jsonFile, then pushing "sample" to master will probably not be necessary.
We believe that the link below was the bug I experienced, and I've been told it is fixed in master. https://issues.apache.org/jira/browse/SPARK-2376 best, -brad On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com> wrote: > This "sample" argument of inferSchema is still no in master, if will > try to add it if it make > sense. > > On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > > Hi Davies, > > > > Thanks for the response and tips. Is the "sample" argument to > inferSchema > > available in the 1.0.1 release of pyspark? I'm not sure (based on the > > documentation linked below) that it is. > > > http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema > > > > It sounds like updating to master may help address my issue (and may also > > make the "sample" argument available), so I'm going to go ahead and do > that. > > > > best, > > -Brad > > > > > > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas > >> <nicholas.cham...@gmail.com> wrote: > >> > I was just about to ask about this. > >> > > >> > Currently, there are two methods, sqlContext.jsonFile() and > >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that > >> > covers > >> > the whole data set. > >> > > >> > For example: > >> > > >> > from pyspark.sql import SQLContext > >> > sqlContext = SQLContext(sc) > >> > > >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', > >> >>>> '{"foo":"boom", "baz":[1,2,3]}'])) > >> >>>> a.printSchema() > >> > root > >> > |-- baz: array (nullable = true) > >> > | |-- element: integer (containsNull = false) > >> > |-- foo: string (nullable = true) > >> > > >> > It works really well! It handles fields with inconsistent value types > by > >> > inferring a value type that covers all the possible values. > >> > > >> > But say you’ve already deserialized the JSON to do some pre-processing > >> > or > >> > filtering. You’d commonly want to do this, say, to remove bad data. So > >> > now > >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON > >> > strings. It would be perfect if you could get the completeness of the > >> > json...() methods, but against dictionaries. > >> > > >> > Unfortunately, as you noted, inferSchema() only looks at the first > >> > element > >> > in the set. Furthermore, inferring schemata from RDDs of dictionaries > is > >> > being deprecated in favor of doing so from RDDs of Rows. > >> > > >> > I’m not sure what the intention behind this move is, but as a user I’d > >> > like > >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs with > >> > the > >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I > really > >> > want > >> > that, I have to serialize the dictionaries to JSON text and then call > >> > jsonRDD(), which is expensive. > >> > >> Before upcoming 1.1 release, we did not support nested structures via > >> inferSchema, > >> the nested dictionary will be MapType. This introduces inconsistance > >> for dictionary that > >> the top level will be structure type (can be accessed by name of > >> field) but others will be > >> MapType (can be accesses as map). > >> > >> So deprecated top level dictionary is try to solve this kind of > >> inconsistance. > >> > >> The Row class in pyspark.sql has a similar interface to dict, so you > >> can easily convert > >> you dic into a Row: > >> > >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d))) > >> > >> In order to get the correct schema, so we need another argument to > specify > >> the number of rows to be infered? Such as: > >> > >> inferSchema(rdd, sample=None) > >> > >> with sample=None, it will take the first row, or it will do the > >> sampling to figure out the > >> complete schema. > >> > >> Does this work for you? > >> > >> > Nick > >> > > >> > > >> > > >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller < > bmill...@eecs.berkeley.edu> > >> > wrote: > >> >> > >> >> Hi All, > >> >> > >> >> I have a data set where each record is serialized using JSON, and I'm > >> >> interested to use SchemaRDDs to work with the data. Unfortunately > I've > >> >> hit > >> >> a snag since some fields in the data are maps and list, and are not > >> >> guaranteed to be populated for each record. This seems to cause > >> >> inferSchema > >> >> to throw an error: > >> >> > >> >> Produces error: > >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]}, > >> >> {'foo':'boom', 'baz':[1,2,3]}])) > >> >> > >> >> Works fine: > >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', > 'baz':[1,2,3]}, > >> >> {'foo':'boom', 'baz':[]}])) > >> >> > >> >> To be fair inferSchema says it "peeks at the first row", so a > possible > >> >> work-around would be to make sure the type of any collection can be > >> >> determined using the first instance. However, I don't believe that > >> >> items in > >> >> an RDD are guaranteed to remain in an ordered, so this approach seems > >> >> somewhat brittle. > >> >> > >> >> Does anybody know a robust solution to this problem in PySpark? I'm > am > >> >> running the 1.0.1 release. > >> >> > >> >> -Brad > >> >> > >> > > > > > >