There are several overloaded versions of both |jsonFile| and |jsonRDD|. Schema inferring is kinda expensive since it requires an extra Spark job. You can avoid schema inferring by storing the inferred schema and then use it together with the following two methods:

 * |def jsonFile(path: String, schema: StructType): SchemaRDD|
 * |def jsonRDD(json: RDD[String], schema: StructType): SchemaRDD|

You can use |StructType.json|/|StructType.prettyJson| and |DataType.fromJson| to store and load the schema.

Cheng

On 12/11/14 12:50 PM, Rakesh Nair wrote:

Couple of questions :
1. "sqlContext.jsonFile" reads a json file, infers the schema for the data stored, and then returns a SchemaRDD. Now, i could also create a SchemaRDD by reading a file as text(which returns RDD[String]) and then use the "jsonRDD" method. My question, is the "jsonFile" way of creating SchemaRDD slower than the second method i mentioned (maybe because jsonFile needs to infer the schema and jsonRDD just applies the schema to a dataset???)

The workflow i am thinking of is: 1. For the first data set use "jsonFile" and infer the schema. 2. Save the schema somewhere. 3. For later data sets, create RDD[String] and then use "jsonRDD" method to convert the RDD[String] to SchemaRDD.


2. What is the best way to store a schema or rather how can i serialize StructType and store it in hdfs, so that i can load it later.

--
Regards
Rakesh Nair

Reply via email to