Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

Yin Huai Tue, 21 Oct 2014 17:21:07 -0700

Add one more thing about question 1. Once you get the SchemaRDD from
jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to
cast the column type from the StringType to DateType (the string format
should be "yyyy-[m]m-[d]d" and you need to use hiveContext). Here is the
code snippet that may help.


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val schemaRDD = hiveContext.jsonFile(...)
schemaRDD.registerTempTable("jsonTable")
hiveContext.sql("SELECT CAST(columnName as DATE) FROM jsonTable")

Thanks,

Yin

On Tue, Oct 21, 2014 at 8:00 PM, Yin Huai <huaiyin....@gmail.com> wrote:

> Hello Tridib,
>
> I just saw this one.
>
> 1. Right now, jsonFile and jsonRDD do not detect date type. Right now,
> IntegerType, LongType, DoubleType, DecimalType, StringType, BooleanType,
> StructType and ArrayType will be automatically detected.
> 2. The process of inferring schema will pass the entire dataset once to
> determine the schema. So, you will see a join is launched. Applying a
> specific schema to a dataset does not have this cost.
> 3. It is hard to comment on it without seeing your implementation. For our
> built-in JSON support, jsonFile and jsonRDD provides a very convenient way
> to work with JSON datasets with SQL. You do not need to define the schema
> in advance and Spark SQL will automatically create the SchemaRDD for your
> dataset. You can start to query it with SQL by simply registering the
> returned SchemaRDD as a temp table. Regarding the implementation, we use a
> high performance JSON lib (Jackson, https://github.com/FasterXML/jackson)
> to parse JSON records.
>
> Thanks,
>
> Yin
>
> On Mon, Oct 20, 2014 at 10:56 PM, tridib <tridib.sama...@live.com> wrote:
>
>> Hi Spark SQL team,
>> I trying to explore automatic schema detection for json document. I have
>> few
>> questions:
>> 1. What should be the date format to detect the fields as date type?
>> 2. Is automatic schema infer slower than applying specific schema?
>> 3. At this moment I am parsing json myself using map Function and creating
>> schema RDD from the parsed JavaRDD. Is there any performance impact not
>> using inbuilt jsonFile()?
>>
>> Thanks
>> Tridib
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

Reply via email to