There are several overloaded versions of both |jsonFile| and |jsonRDD|.
Schema inferring is kinda expensive since it requires an extra Spark
job. You can avoid schema inferring by storing the inferred schema and
then use it together with the following two methods:
* |def jsonFile(path: String, schema: StructType): SchemaRDD|
* |def jsonRDD(json: RDD[String], schema: StructType): SchemaRDD|
You can use |StructType.json|/|StructType.prettyJson| and
|DataType.fromJson| to store and load the schema.
Cheng
On 12/11/14 12:50 PM, Rakesh Nair wrote:
Couple of questions :
1. "sqlContext.jsonFile" reads a json file, infers the schema for the
data stored, and then returns a SchemaRDD. Now, i could also create a
SchemaRDD by reading a file as text(which returns RDD[String]) and
then use the "jsonRDD" method. My question, is the "jsonFile" way of
creating SchemaRDD slower than the second method i mentioned (maybe
because jsonFile needs to infer the schema and jsonRDD just applies
the schema to a dataset???)
The workflow i am thinking of is: 1. For the first data set use
"jsonFile" and infer the schema. 2. Save the schema somewhere. 3. For
later data sets, create RDD[String] and then use "jsonRDD" method to
convert the RDD[String] to SchemaRDD.
2. What is the best way to store a schema or rather how can i
serialize StructType and store it in hdfs, so that i can load it later.
--
Regards
Rakesh Nair