I'm working with RDD[Map[String,Any]] objects all over my codebase. These objects were all originally parsed from JSON. The processing I do on RDDs consists of parsing json -> grouping/transforming dataset into a feasible report -> outputting data to a file.
I've been wanting to infer the schemas of the output Map[String,Any] objects so that I can store the schemas off and use them when the report files are loaded using sqlContext.jsonRDD(). The reason I'm trying to infer the schema myself is that i want to avoid the expensive map/reduce phase that occurs when a sqlContextRDD.jsonRDD() is called and schema is not supplied. I figure since we are creating the json file to begin with, we may as well be able to infer the schema ourselves. So what I'd like to do is something like this for my processing phase: val inputJson rdd = ... inputJson.transform().transform()....transform().map(inferSchema()).outputToFile() I'm trying to figure out the best way to do the inferSchema() so that I don't need to do an extra reduce phase if possible. I'd be okay having the driver (or something else) do the ultimate merging of the schemas. Would a broadcasted value be useful here? Or possibly an accumulator where the accumuloation function merges two schemas together? Any help would be greatly appreciated.