I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql().
Looking @ the code- I see an inferSchema() method which gets called under the hood. I also see an experimental jsonRDD() method which has a sampleRatio. My dataset is extremely large and i've got a lot of processing to do on it- it's really not a luxury to be able to loop through it twice. I also know that the SQL I am going to be running matches at least "some" of the records contained in the files. Would it make sense or be possible with the current execution plan design to be able to bypass inferring the schema for purposes of speed? Though I haven't really dug further in the code than the implementations of the client API methods that I'm calling, I am wondering if there's a way to theoretically process the data without pre-determining the schema. I also don't have the luxury of giving the full schema ahead of time because i may want to do a "select * from table" but I may only know 2 or 3 of the actual json keys that are available. Thanks.