Re: pyspark inferSchema

Nicholas Chammas Tue, 05 Aug 2014 11:02:30 -0700

I was just about to ask about this.

Currently, there are two methods, sqlContext.jsonFile() and
sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers
the whole data set.

For example:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
>>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', 
>>> '{"foo":"boom", "baz":[1,2,3]}']))>>> a.printSchema()
root
 |-- baz: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- foo: string (nullable = true)

It works really well! It handles fields with inconsistent value types by
inferring a value type that covers all the possible values.

But say you’ve already deserialized the JSON to do some pre-processing or
filtering. You’d commonly want to do this, say, to remove bad data. So now
you have an RDD of Python dictionaries, as opposed to an RDD of JSON
strings. It would be perfect if you could get the completeness of the
json...() methods, but against dictionaries.

Unfortunately, as you noted, inferSchema() only looks at the first element
in the set. Furthermore, inferring schemata from RDDs of dictionaries is being
deprecated <https://issues.apache.org/jira/browse/SPARK-2010> in favor of
doing so from RDDs of Rows.

I’m not sure what the intention behind this move is, but as a user I’d like
to be able to convert RDDs of dictionaries directly to SchemaRDDs with the
completeness of the jsonRDD()/jsonFile() methods. Right now if I really
want that, I have to serialize the dictionaries to JSON text and then call
jsonRDD(), which is expensive.

Nick

On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu>
wrote:

> Hi All,
>
> I have a data set where each record is serialized using JSON, and I'm
> interested to use SchemaRDDs to work with the data.  Unfortunately I've hit
> a snag since some fields in the data are maps and list, and are not
> guaranteed to be populated for each record.  This seems to cause
> inferSchema to throw an error:
>
> Produces error:
> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
> {'foo':'boom', 'baz':[1,2,3]}]))
>
> Works fine:
> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
> {'foo':'boom', 'baz':[]}]))
>
> To be fair inferSchema says it "peeks at the first row", so a possible
> work-around would be to make sure the type of any collection can be
> determined using the first instance.  However, I don't believe that items
> in an RDD are guaranteed to remain in an ordered, so this approach seems
> somewhat brittle.
>
> Does anybody know a robust solution to this problem in PySpark?  I'm am
> running the 1.0.1 release.
>
> -Brad
>
>

Re: pyspark inferSchema

Reply via email to