I'm working with RDD[Map[String,Any]] objects all over my codebase. These
objects were all originally parsed from JSON. The processing I do on RDDs
consists of parsing json -> grouping/transforming dataset into a feasible
report -> outputting data to a file.

I've been wanting to infer the schemas of the output Map[String,Any]
objects so that I can store the schemas off and use them when the report
files are loaded using sqlContext.jsonRDD(). The reason I'm trying to infer
the schema myself is that i want to avoid the expensive map/reduce phase
that occurs when a sqlContextRDD.jsonRDD() is called and schema is not
supplied. I figure since we are creating the json file to begin with, we
may as well be able to infer the schema ourselves.

So what I'd like to do is something like this for my processing phase:

val inputJson rdd = ...

inputJson.transform().transform()....transform().map(inferSchema()).outputToFile()

I'm trying to figure out the best way to do the inferSchema() so that I
don't need to do an extra reduce phase if possible. I'd be okay having the
driver (or something else) do the ultimate merging of the schemas.

Would a broadcasted value be useful here?  Or possibly an accumulator where
the accumuloation function merges two schemas together? Any help would be
greatly appreciated.

Reply via email to