To be a little more clear jsonRDD and jsonFile use the same implementation underneath. jsonFile is just a connivence method that does jsonRDD(sc.textFile(...))
On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust <mich...@databricks.com> wrote: > > The first pass is inferring the schema of the JSON data. If you already > know the schema you can skip this pass by specifying the schema as the > second parameter to jsonRDD. > > On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam <chiling...@gmail.com> wrote: >> >> Hi Michael, >> >> This is what I did. I was thinking if there is a more efficient way to >> accomplish this. >> >> I was doing a very simple benchmark: Convert lzo compressed json files to >> parquet files using SparkSQL vs. Hadoop MR. >> >> Spark SQL seems to require 2 stages to accomplish this task: >> Stage 1: read the lzo files using newAPIHadoopFile with >> LzoTextInputFormat and then convert it to JsonRDD >> Stage 2: saveAsParquetFile from the JsonRDD >> >> In Hadoop, it takes 1 step, a map-only job to read the data and then >> output the json to the parquet file (I'm using elephant bird LzoJsonLoader >> to load the files) >> >> In some scenarios, Hadoop is faster because it is saving one stage. Did I >> do something wrong? >> >> Best Regards, >> >> Jerry >> >> >> On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >>> >>> You can create an RDD[String] using whatever method and pass that to >>> jsonRDD. >>> >>> On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam <chiling...@gmail.com> wrote: >>>> >>>> Hi Ted, >>>> >>>> Thanks for your help. >>>> I'm able to read lzo files using sparkContext.newAPIHadoopFile but I >>>> couldn't do the same for sqlContext because sqlContext.josnFile does not >>>> provide ways to configure the input file format. Do you know if there are >>>> some APIs to do that? >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>>> On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>> See this thread: http://search-hadoop.com/m/JW1q5HAuFv >>>>> which references https://issues.apache.org/jira/browse/SPARK-2394 >>>>> >>>>> Cheers >>>>> >>>>> On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam <chiling...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi spark users, >>>>>> >>>>>> Do you know how to read json files using Spark SQL that are LZO >>>>>> compressed? >>>>>> >>>>>> I'm looking into sqlContext.jsonFile but I don't know how to >>>>>> configure it to read lzo files. >>>>>> >>>>>> Best Regards, >>>>>> >>>>>> Jerry >>>>>> >>>>>