Re: pyspark inferSchema

Brad Miller Tue, 05 Aug 2014 12:21:16 -0700

Assuming updating to master fixes the bug I was experiencing with jsonRDD
and jsonFile, then pushing "sample" to master will probably not be
necessary.


We believe that the link below was the bug I experienced, and I've been
told it is fixed in master.

https://issues.apache.org/jira/browse/SPARK-2376

best,
-brad


On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com> wrote:

> This "sample" argument of inferSchema is still no in master, if will
> try to add it if it make
> sense.
>
> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
> > Hi Davies,
> >
> > Thanks for the response and tips.  Is the "sample" argument to
> inferSchema
> > available in the 1.0.1 release of pyspark?  I'm not sure (based on the
> > documentation linked below) that it is.
> >
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema
> >
> > It sounds like updating to master may help address my issue (and may also
> > make the "sample" argument available), so I'm going to go ahead and do
> that.
> >
> > best,
> > -Brad
> >
> >
> > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com>
> wrote:
> >>
> >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
> >> <nicholas.cham...@gmail.com> wrote:
> >> > I was just about to ask about this.
> >> >
> >> > Currently, there are two methods, sqlContext.jsonFile() and
> >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that
> >> > covers
> >> > the whole data set.
> >> >
> >> > For example:
> >> >
> >> > from pyspark.sql import SQLContext
> >> > sqlContext = SQLContext(sc)
> >> >
> >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
> >> >>>> '{"foo":"boom", "baz":[1,2,3]}']))
> >> >>>> a.printSchema()
> >> > root
> >> >  |-- baz: array (nullable = true)
> >> >  |    |-- element: integer (containsNull = false)
> >> >  |-- foo: string (nullable = true)
> >> >
> >> > It works really well! It handles fields with inconsistent value types
> by
> >> > inferring a value type that covers all the possible values.
> >> >
> >> > But say you’ve already deserialized the JSON to do some pre-processing
> >> > or
> >> > filtering. You’d commonly want to do this, say, to remove bad data. So
> >> > now
> >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON
> >> > strings. It would be perfect if you could get the completeness of the
> >> > json...() methods, but against dictionaries.
> >> >
> >> > Unfortunately, as you noted, inferSchema() only looks at the first
> >> > element
> >> > in the set. Furthermore, inferring schemata from RDDs of dictionaries
> is
> >> > being deprecated in favor of doing so from RDDs of Rows.
> >> >
> >> > I’m not sure what the intention behind this move is, but as a user I’d
> >> > like
> >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs with
> >> > the
> >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I
> really
> >> > want
> >> > that, I have to serialize the dictionaries to JSON text and then call
> >> > jsonRDD(), which is expensive.
> >>
> >> Before upcoming 1.1 release, we did not support nested structures via
> >> inferSchema,
> >> the nested dictionary will be MapType. This introduces inconsistance
> >> for dictionary that
> >> the top level will be structure type (can be accessed by name of
> >> field) but others will be
> >> MapType (can be accesses as map).
> >>
> >> So deprecated top level dictionary is try to solve this kind of
> >> inconsistance.
> >>
> >> The Row class in pyspark.sql has a similar interface to dict, so you
> >> can easily convert
> >> you dic into a Row:
> >>
> >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
> >>
> >> In order to get the correct schema, so we need another argument to
> specify
> >> the number of rows to be infered? Such as:
> >>
> >> inferSchema(rdd, sample=None)
> >>
> >> with sample=None, it will take the first row, or it will do the
> >> sampling to figure out the
> >> complete schema.
> >>
> >> Does this work for you?
> >>
> >> > Nick
> >> >
> >> >
> >> >
> >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <
> bmill...@eecs.berkeley.edu>
> >> > wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> I have a data set where each record is serialized using JSON, and I'm
> >> >> interested to use SchemaRDDs to work with the data.  Unfortunately
> I've
> >> >> hit
> >> >> a snag since some fields in the data are maps and list, and are not
> >> >> guaranteed to be populated for each record.  This seems to cause
> >> >> inferSchema
> >> >> to throw an error:
> >> >>
> >> >> Produces error:
> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
> >> >> {'foo':'boom', 'baz':[1,2,3]}]))
> >> >>
> >> >> Works fine:
> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar',
> 'baz':[1,2,3]},
> >> >> {'foo':'boom', 'baz':[]}]))
> >> >>
> >> >> To be fair inferSchema says it "peeks at the first row", so a
> possible
> >> >> work-around would be to make sure the type of any collection can be
> >> >> determined using the first instance.  However, I don't believe that
> >> >> items in
> >> >> an RDD are guaranteed to remain in an ordered, so this approach seems
> >> >> somewhat brittle.
> >> >>
> >> >> Does anybody know a robust solution to this problem in PySpark?  I'm
> am
> >> >> running the 1.0.1 release.
> >> >>
> >> >> -Brad
> >> >>
> >> >
> >
> >
>

Re: pyspark inferSchema

Reply via email to