I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().

Looking @ the code- I see an inferSchema() method which gets called under
the hood. I also see an experimental jsonRDD() method which has a
sampleRatio.

My dataset is extremely large and i've got a lot of processing to do on it-
it's really not a luxury to be able to loop through it twice. I also know
that the SQL I am going to be running matches at least "some" of the
records contained in the files. Would it make sense or be possible with the
current execution plan design to be able to bypass inferring the schema for
purposes of speed?

Though I haven't really dug further in the code than the implementations of
the client API methods that I'm calling, I am wondering if there's a way to
theoretically process the data without pre-determining the schema. I also
don't have the luxury of giving the full schema ahead of time because i may
want to do a "select * from table" but I may only know 2 or 3 of the actual
json keys that are available.

Thanks.

Reply via email to