[PySpark] [SQL] Going from RDD[dict] to SchemaRDD

Nicholas Chammas Tue, 05 Aug 2014 13:32:43 -0700

Forking from this thread
<http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-inferSchema-tc11449.html>.



On Tue, Aug 5, 2014 at 3:01 PM, Davies Liu dav...@databricks.com
<http://mailto:dav...@databricks.com> wrote:

Before upcoming 1.1 release, we did not support nested structures
> via inferSchema,
> the nested dictionary will be MapType. This introduces inconsistance for
> dictionary that
> the top level will be structure type (can be accessed by name of field)
> but others will be
> MapType (can be accesses as map).
>
When you mention field access here, do you mean via SQL? Could you provide
a brief code example to illustrate your point?

 The Row class in pyspark.sql has a similar interface to dict, so you
> can easily convert you dic into a Row:
>
> ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d)))
>
I just tried that out and it seems to work well.

 In order to get the correct schema, so we need another argument to specify
> the number of rows to be infered? Such as:
> ...
>
 Does this work for you?
>
Maybe; I’m not sure just yet. Basically, I’m looking for something
functionally equivalent to this:

sqlContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))

In other words, given an RDD of JSON-serializable Python dictionaries, I
want to be able to infer a schema that is guaranteed to cover the entire
data set. With semi-structured data, it is rarely useful to infer schema by
inspecting just one element.

Does that sound like something we want to support?

Nick

[PySpark] [SQL] Going from RDD[dict] to SchemaRDD

Reply via email to