Forking from this thread <http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-inferSchema-tc11449.html>.
On Tue, Aug 5, 2014 at 3:01 PM, Davies Liu dav...@databricks.com <http://mailto:dav...@databricks.com> wrote: Before upcoming 1.1 release, we did not support nested structures > via inferSchema, > the nested dictionary will be MapType. This introduces inconsistance for > dictionary that > the top level will be structure type (can be accessed by name of field) > but others will be > MapType (can be accesses as map). > When you mention field access here, do you mean via SQL? Could you provide a brief code example to illustrate your point? The Row class in pyspark.sql has a similar interface to dict, so you > can easily convert you dic into a Row: > > ctx.inferSchema(rdd_of_dict.map(lambda d: Row(**d))) > I just tried that out and it seems to work well. In order to get the correct schema, so we need another argument to specify > the number of rows to be infered? Such as: > ... > Does this work for you? > Maybe; I’m not sure just yet. Basically, I’m looking for something functionally equivalent to this: sqlContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) In other words, given an RDD of JSON-serializable Python dictionaries, I want to be able to infer a schema that is guaranteed to cover the entire data set. With semi-structured data, it is rarely useful to infer schema by inspecting just one element. Does that sound like something we want to support? Nick