Re: Spark SQL 1.1.1 reading LZO compressed json files

Michael Armbrust Wed, 17 Dec 2014 11:40:37 -0800

To be a little more clear jsonRDD and jsonFile use the same implementation
underneath.  jsonFile is just a connivence method that does
jsonRDD(sc.textFile(...))


On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust <mich...@databricks.com>
wrote:
>
> The first pass is inferring the schema of the JSON data.  If you already
> know the schema you can skip this pass by specifying the schema as the
> second parameter to jsonRDD.
>
> On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>> Hi Michael,
>>
>> This is what I did. I was thinking if there is a more efficient way to
>> accomplish this.
>>
>> I was doing a very simple benchmark: Convert lzo compressed json files to
>> parquet files using SparkSQL vs. Hadoop MR.
>>
>> Spark SQL seems to require 2 stages to accomplish this task:
>> Stage 1: read the lzo files using newAPIHadoopFile with
>> LzoTextInputFormat and then convert it to JsonRDD
>> Stage 2: saveAsParquetFile from the JsonRDD
>>
>> In Hadoop, it takes 1 step, a map-only job to read the data and then
>> output the json to the parquet file (I'm using elephant bird LzoJsonLoader
>> to load the files)
>>
>> In some scenarios, Hadoop is faster because it is saving one stage. Did I
>> do something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>>
>>> You can create an RDD[String] using whatever method and pass that to
>>> jsonRDD.
>>>
>>> On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>>>
>>>> Hi Ted,
>>>>
>>>> Thanks for your help.
>>>> I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
>>>> couldn't do the same for sqlContext because sqlContext.josnFile does not
>>>> provide ways to configure the input file format. Do you know if there are
>>>> some APIs to do that?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>> On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>> See this thread: http://search-hadoop.com/m/JW1q5HAuFv
>>>>> which references https://issues.apache.org/jira/browse/SPARK-2394
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam <chiling...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi spark users,
>>>>>>
>>>>>> Do you know how to read json files using Spark SQL that are LZO
>>>>>> compressed?
>>>>>>
>>>>>> I'm looking into sqlContext.jsonFile but I don't know how to
>>>>>> configure it to read lzo files.
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>

Re: Spark SQL 1.1.1 reading LZO compressed json files

Reply via email to