Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed? I'm looking into sqlContext.jsonFile but I don't know how to configure it to read lzo files. Best Regards, Jerry

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because sqlContext.josnFile does not provide ways to configure the input file format. Do you know if there are some APIs to do that? Best Regards, Jerry On

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5HAuFv which references https://issues.apache.org/jira/browse/SPARK-2394 Cheers On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed?

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
In SQLContext: def jsonFile(path: String, samplingRatio: Double): SchemaRDD = { val json = sparkContext.textFile(path) jsonRDD(json, samplingRatio) } Looks like jsonFile() can be enhanced with call to sparkContext.newAPIHadoopFile() with proper input file format. Cheers On Wed, Dec

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
You can create an RDD[String] using whatever method and pass that to jsonRDD. On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote: Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Michael, This is what I did. I was thinking if there is a more efficient way to accomplish this. I was doing a very simple benchmark: Convert lzo compressed json files to parquet files using SparkSQL vs. Hadoop MR. Spark SQL seems to require 2 stages to accomplish this task: Stage 1: read

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
The first pass is inferring the schema of the JSON data. If you already know the schema you can skip this pass by specifying the schema as the second parameter to jsonRDD. On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam chiling...@gmail.com wrote: Hi Michael, This is what I did. I was thinking

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
To be a little more clear jsonRDD and jsonFile use the same implementation underneath. jsonFile is just a connivence method that does jsonRDD(sc.textFile(...)) On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com wrote: The first pass is inferring the schema of the JSON