Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

Cheng Lian Thu, 11 Dec 2014 00:06:08 -0800

There are several overloaded versions of both |jsonFile| and |jsonRDD|.Schema inferring is kinda expensive since it requires an extra Sparkjob. You can avoid schema inferring by storing the inferred schema andthen use it together with the following two methods:


 * |def jsonFile(path: String, schema: StructType): SchemaRDD|
 * |def jsonRDD(json: RDD[String], schema: StructType): SchemaRDD|


Cheng

On 12/11/14 12:50 PM, Rakesh Nair wrote:

Couple of questions :
1. "sqlContext.jsonFile" reads a json file, infers the schema for thedata stored, and then returns a SchemaRDD. Now, i could also create aSchemaRDD by reading a file as text(which returns RDD[String]) andthen use the "jsonRDD" method. My question, is the "jsonFile" way ofcreating SchemaRDD slower than the second method i mentioned (maybebecause jsonFile needs to infer the schema and jsonRDD just appliesthe schema to a dataset???)
The workflow i am thinking of is: 1. For the first data set use"jsonFile" and infer the schema. 2. Save the schema somewhere. 3. Forlater data sets, create RDD[String] and then use "jsonRDD" method toconvert the RDD[String] to SchemaRDD.
2. What is the best way to store a schema or rather how can iserialize StructType and store it in hdfs, so that i can load it later.
--
Regards
Rakesh Nair

Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

Reply via email to