Re: pyspark inferSchema

Brad Miller Tue, 05 Aug 2014 21:53:27 -0700

I've followed up in a thread more directly related to jsonRDD and jsonFile,
but it seems like after building from the current master I'm still having
some problems with nested dictionaries.


http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-jsonRDD-and-jsonFile-in-pyspark-tp11461p11517.html


On Tue, Aug 5, 2014 at 12:56 PM, Yin Huai <yh...@databricks.com> wrote:

> Yes, 2376 has been fixed in master. Can you give it a try?
>
> Also, for inferSchema, because Python is dynamically typed, I agree with
> Davies to provide a way to scan a subset (or entire) of the dataset to
> figure out the proper schema. We will take a look it.
>
> Thanks,
>
> Yin
>
>
> On Tue, Aug 5, 2014 at 12:20 PM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
>
>> Assuming updating to master fixes the bug I was experiencing with jsonRDD
>> and jsonFile, then pushing "sample" to master will probably not be
>> necessary.
>>
>> We believe that the link below was the bug I experienced, and I've been
>> told it is fixed in master.
>>
>> https://issues.apache.org/jira/browse/SPARK-2376
>>
>> best,
>> -brad
>>
>>
>> On Tue, Aug 5, 2014 at 12:18 PM, Davies Liu <dav...@databricks.com>
>> wrote:
>>
>>> This "sample" argument of inferSchema is still no in master, if will
>>> try to add it if it make
>>> sense.
>>>
>>> On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller <bmill...@eecs.berkeley.edu>
>>> wrote:
>>> > Hi Davies,
>>> >
>>> > Thanks for the response and tips.  Is the "sample" argument to
>>> inferSchema
>>> > available in the 1.0.1 release of pyspark?  I'm not sure (based on the
>>> > documentation linked below) that it is.
>>> >
>>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema
>>> >
>>> > It sounds like updating to master may help address my issue (and may
>>> also
>>> > make the "sample" argument available), so I'm going to go ahead and do
>>> that.
>>> >
>>> > best,
>>> > -Brad
>>> >
>>> >
>>> > On Tue, Aug 5, 2014 at 12:01 PM, Davies Liu <dav...@databricks.com>
>>> wrote:
>>> >>
>>> >> On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
>>> >> <nicholas.cham...@gmail.com> wrote:
>>> >> > I was just about to ask about this.
>>> >> >
>>> >> > Currently, there are two methods, sqlContext.jsonFile() and
>>> >> > sqlContext.jsonRDD(), that work on JSON text and infer a schema that
>>> >> > covers
>>> >> > the whole data set.
>>> >> >
>>> >> > For example:
>>> >> >
>>> >> > from pyspark.sql import SQLContext
>>> >> > sqlContext = SQLContext(sc)
>>> >> >
>>> >> >>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
>>> >> >>>> '{"foo":"boom", "baz":[1,2,3]}']))
>>> >> >>>> a.printSchema()
>>> >> > root
>>> >> >  |-- baz: array (nullable = true)
>>> >> >  |    |-- element: integer (containsNull = false)
>>> >> >  |-- foo: string (nullable = true)
>>> >> >
>>> >> > It works really well! It handles fields with inconsistent value
>>> types by
>>> >> > inferring a value type that covers all the possible values.
>>> >> >
>>> >> > But say you’ve already deserialized the JSON to do some
>>> pre-processing
>>> >> > or
>>> >> > filtering. You’d commonly want to do this, say, to remove bad data.
>>> So
>>> >> > now
>>> >> > you have an RDD of Python dictionaries, as opposed to an RDD of JSON
>>> >> > strings. It would be perfect if you could get the completeness of
>>> the
>>> >> > json...() methods, but against dictionaries.
>>> >> >
>>> >> > Unfortunately, as you noted, inferSchema() only looks at the first
>>> >> > element
>>> >> > in the set. Furthermore, inferring schemata from RDDs of
>>> dictionaries is
>>> >> > being deprecated in favor of doing so from RDDs of Rows.
>>> >> >
>>> >> > I’m not sure what the intention behind this move is, but as a user
>>> I’d
>>> >> > like
>>> >> > to be able to convert RDDs of dictionaries directly to SchemaRDDs
>>> with
>>> >> > the
>>> >> > completeness of the jsonRDD()/jsonFile() methods. Right now if I
>>> really
>>> >> > want
>>> >> > that, I have to serialize the dictionaries to JSON text and then
>>> call
>>> >> > jsonRDD(), which is expensive.
>>> >>
>>> >> Before upcoming 1.1 release, we did not support nested structures via
>>> >> inferSchema,
>>> >> the nested dictionary will be MapType. This introduces inconsistance
>>> >> for dictionary that
>>> >> the top level will be structure type (can be accessed by name of
>>> >> field) but others will be
>>> >> MapType (can be accesses as map).
>>> >>
>>> >> So deprecated top level dictionary is try to solve this kind of
>>> >> inconsistance.
>>> >>
>>> >> The Row class in pyspark.sql has a similar interface to dict, so you
>>> >> can easily convert
>>> >> you dic into a Row:
>>> >>
>>> >> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
>>> >>
>>> >> In order to get the correct schema, so we need another argument to
>>> specify
>>> >> the number of rows to be infered? Such as:
>>> >>
>>> >> inferSchema(rdd, sample=None)
>>> >>
>>> >> with sample=None, it will take the first row, or it will do the
>>> >> sampling to figure out the
>>> >> complete schema.
>>> >>
>>> >> Does this work for you?
>>> >>
>>> >> > Nick
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <
>>> bmill...@eecs.berkeley.edu>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi All,
>>> >> >>
>>> >> >> I have a data set where each record is serialized using JSON, and
>>> I'm
>>> >> >> interested to use SchemaRDDs to work with the data.  Unfortunately
>>> I've
>>> >> >> hit
>>> >> >> a snag since some fields in the data are maps and list, and are not
>>> >> >> guaranteed to be populated for each record.  This seems to cause
>>> >> >> inferSchema
>>> >> >> to throw an error:
>>> >> >>
>>> >> >> Produces error:
>>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
>>> >> >> {'foo':'boom', 'baz':[1,2,3]}]))
>>> >> >>
>>> >> >> Works fine:
>>> >> >> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar',
>>> 'baz':[1,2,3]},
>>> >> >> {'foo':'boom', 'baz':[]}]))
>>> >> >>
>>> >> >> To be fair inferSchema says it "peeks at the first row", so a
>>> possible
>>> >> >> work-around would be to make sure the type of any collection can be
>>> >> >> determined using the first instance.  However, I don't believe that
>>> >> >> items in
>>> >> >> an RDD are guaranteed to remain in an ordered, so this approach
>>> seems
>>> >> >> somewhat brittle.
>>> >> >>
>>> >> >> Does anybody know a robust solution to this problem in PySpark?
>>>  I'm am
>>> >> >> running the 1.0.1 release.
>>> >> >>
>>> >> >> -Brad
>>> >> >>
>>> >> >
>>> >
>>> >
>>>
>>
>>
>

Re: pyspark inferSchema

Reply via email to