Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Ted,

Thanks for your help.
I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
couldn't do the same for sqlContext because sqlContext.josnFile does not
provide ways to configure the input file format. Do you know if there are
some APIs to do that?

Best Regards,

Jerry

On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure it
 to read lzo files.

 Best Regards,

 Jerry




Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5HAuFv
which references https://issues.apache.org/jira/browse/SPARK-2394

Cheers

On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure it
 to read lzo files.

 Best Regards,

 Jerry



Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
In SQLContext:
  def jsonFile(path: String, samplingRatio: Double): SchemaRDD = {
val json = sparkContext.textFile(path)
jsonRDD(json, samplingRatio)
  }
Looks like jsonFile() can be enhanced with call to
sparkContext.newAPIHadoopFile()
with proper input file format.

Cheers

On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Ted,

 Thanks for your help.
 I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
 couldn't do the same for sqlContext because sqlContext.josnFile does not
 provide ways to configure the input file format. Do you know if there are
 some APIs to do that?

 Best Regards,

 Jerry

 On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure
 it to read lzo files.

 Best Regards,

 Jerry




Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
You can create an RDD[String] using whatever method and pass that to
jsonRDD.

On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Ted,

 Thanks for your help.
 I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
 couldn't do the same for sqlContext because sqlContext.josnFile does not
 provide ways to configure the input file format. Do you know if there are
 some APIs to do that?

 Best Regards,

 Jerry

 On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure
 it to read lzo files.

 Best Regards,

 Jerry




Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Michael,

This is what I did. I was thinking if there is a more efficient way to
accomplish this.

I was doing a very simple benchmark: Convert lzo compressed json files to
parquet files using SparkSQL vs. Hadoop MR.

Spark SQL seems to require 2 stages to accomplish this task:
Stage 1: read the lzo files using newAPIHadoopFile with LzoTextInputFormat
and then convert it to JsonRDD
Stage 2: saveAsParquetFile from the JsonRDD

In Hadoop, it takes 1 step, a map-only job to read the data and then output
the json to the parquet file (I'm using elephant bird LzoJsonLoader to load
the files)

In some scenarios, Hadoop is faster because it is saving one stage. Did I
do something wrong?

Best Regards,

Jerry


On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
wrote:

 You can create an RDD[String] using whatever method and pass that to
 jsonRDD.

 On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Ted,

 Thanks for your help.
 I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
 couldn't do the same for sqlContext because sqlContext.josnFile does not
 provide ways to configure the input file format. Do you know if there are
 some APIs to do that?

 Best Regards,

 Jerry

 On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure
 it to read lzo files.

 Best Regards,

 Jerry




Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
The first pass is inferring the schema of the JSON data.  If you already
know the schema you can skip this pass by specifying the schema as the
second parameter to jsonRDD.

On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Michael,

 This is what I did. I was thinking if there is a more efficient way to
 accomplish this.

 I was doing a very simple benchmark: Convert lzo compressed json files to
 parquet files using SparkSQL vs. Hadoop MR.

 Spark SQL seems to require 2 stages to accomplish this task:
 Stage 1: read the lzo files using newAPIHadoopFile with LzoTextInputFormat
 and then convert it to JsonRDD
 Stage 2: saveAsParquetFile from the JsonRDD

 In Hadoop, it takes 1 step, a map-only job to read the data and then
 output the json to the parquet file (I'm using elephant bird LzoJsonLoader
 to load the files)

 In some scenarios, Hadoop is faster because it is saving one stage. Did I
 do something wrong?

 Best Regards,

 Jerry


 On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You can create an RDD[String] using whatever method and pass that to
 jsonRDD.

 On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Ted,

 Thanks for your help.
 I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
 couldn't do the same for sqlContext because sqlContext.josnFile does not
 provide ways to configure the input file format. Do you know if there are
 some APIs to do that?

 Best Regards,

 Jerry

 On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com
 wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to configure
 it to read lzo files.

 Best Regards,

 Jerry




Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
To be a little more clear jsonRDD and jsonFile use the same implementation
underneath.  jsonFile is just a connivence method that does
jsonRDD(sc.textFile(...))

On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com
wrote:

 The first pass is inferring the schema of the JSON data.  If you already
 know the schema you can skip this pass by specifying the schema as the
 second parameter to jsonRDD.

 On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Michael,

 This is what I did. I was thinking if there is a more efficient way to
 accomplish this.

 I was doing a very simple benchmark: Convert lzo compressed json files to
 parquet files using SparkSQL vs. Hadoop MR.

 Spark SQL seems to require 2 stages to accomplish this task:
 Stage 1: read the lzo files using newAPIHadoopFile with
 LzoTextInputFormat and then convert it to JsonRDD
 Stage 2: saveAsParquetFile from the JsonRDD

 In Hadoop, it takes 1 step, a map-only job to read the data and then
 output the json to the parquet file (I'm using elephant bird LzoJsonLoader
 to load the files)

 In some scenarios, Hadoop is faster because it is saving one stage. Did I
 do something wrong?

 Best Regards,

 Jerry


 On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You can create an RDD[String] using whatever method and pass that to
 jsonRDD.

 On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Ted,

 Thanks for your help.
 I'm able to read lzo files using sparkContext.newAPIHadoopFile but I
 couldn't do the same for sqlContext because sqlContext.josnFile does not
 provide ways to configure the input file format. Do you know if there are
 some APIs to do that?

 Best Regards,

 Jerry

 On Wed, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread: http://search-hadoop.com/m/JW1q5HAuFv
 which references https://issues.apache.org/jira/browse/SPARK-2394

 Cheers

 On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com
 wrote:

 Hi spark users,

 Do you know how to read json files using Spark SQL that are LZO
 compressed?

 I'm looking into sqlContext.jsonFile but I don't know how to
 configure it to read lzo files.

 Best Regards,

 Jerry