Re: pyspark inferSchema

Brad Miller Tue, 05 Aug 2014 11:49:52 -0700

Hi Nick,

Thanks for the great response.


I actually already investigated jsonRDD and jsonFile, although I did not
realize they provide more complete schema inference.  I did however have
other problems with jsonRDD and jsonFile, but I will now describe in a
separate thread with an appropriate subject.

I did notice that when I run your example code, I do not receive the exact
same output.  For example, I see:

> srdd = sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}',
'{"foo":"boom", "baz":[1,2,3]}']))
> srdd.printSchema()

root
 |-- baz: ArrayType[IntegerType]
 |-- foo: StringType


Notice the difference in the schema.  Are you running the 1.0.1 release, or
a more bleeding-edge version from the repository?

best,
-Brad


On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I was just about to ask about this.
>
> Currently, there are two methods, sqlContext.jsonFile() and
> sqlContext.jsonRDD(), that work on JSON text and infer a schema that
> covers the whole data set.
>
> For example:
>
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> >>> a = sqlContext.jsonRDD(sc.parallelize(['{"foo":"bar", "baz":[]}', 
> >>> '{"foo":"boom", "baz":[1,2,3]}']))>>> a.printSchema()
> root
>  |-- baz: array (nullable = true)
>  |    |-- element: integer (containsNull = false)
>  |-- foo: string (nullable = true)
>
> It works really well! It handles fields with inconsistent value types by
> inferring a value type that covers all the possible values.
>
> But say you’ve already deserialized the JSON to do some pre-processing or
> filtering. You’d commonly want to do this, say, to remove bad data. So now
> you have an RDD of Python dictionaries, as opposed to an RDD of JSON
> strings. It would be perfect if you could get the completeness of the
> json...() methods, but against dictionaries.
>
> Unfortunately, as you noted, inferSchema() only looks at the first
> element in the set. Furthermore, inferring schemata from RDDs of
> dictionaries is being deprecated
> <https://issues.apache.org/jira/browse/SPARK-2010> in favor of doing so
> from RDDs of Rows.
>
> I’m not sure what the intention behind this move is, but as a user I’d
> like to be able to convert RDDs of dictionaries directly to SchemaRDDs with
> the completeness of the jsonRDD()/jsonFile() methods. Right now if I
> really want that, I have to serialize the dictionaries to JSON text and
> then call jsonRDD(), which is expensive.
>
> Nick
> 
>
>
> On Tue, Aug 5, 2014 at 1:31 PM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
>
>> Hi All,
>>
>> I have a data set where each record is serialized using JSON, and I'm
>> interested to use SchemaRDDs to work with the data.  Unfortunately I've hit
>> a snag since some fields in the data are maps and list, and are not
>> guaranteed to be populated for each record.  This seems to cause
>> inferSchema to throw an error:
>>
>> Produces error:
>> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[]},
>> {'foo':'boom', 'baz':[1,2,3]}]))
>>
>> Works fine:
>> srdd = sqlCtx.inferSchema(sc.parallelize([{'foo':'bar', 'baz':[1,2,3]},
>> {'foo':'boom', 'baz':[]}]))
>>
>> To be fair inferSchema says it "peeks at the first row", so a possible
>> work-around would be to make sure the type of any collection can be
>> determined using the first instance.  However, I don't believe that items
>> in an RDD are guaranteed to remain in an ordered, so this approach seems
>> somewhat brittle.
>>
>> Does anybody know a robust solution to this problem in PySpark?  I'm am
>> running the 1.0.1 release.
>>
>> -Brad
>>
>>
>

Re: pyspark inferSchema

Reply via email to