Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian
That's right. Also, Spark SQL can automatically infer schema from JSON 
datasets. You don't need to specify an Avro schema:


sqlContext.jsonFile(json/path).saveAsParquetFile(parquet/path)

or with the new reader/writer API introduced in 1.4-SNAPSHOT:

   sqlContext.read.json(json/path).write.parquet(parquet/path)

Cheng

On 5/19/15 6:07 PM, Ewan Leith wrote:


Thanks Cheng, that makes sense.

So for new dataframe creation (not conversion from Avro but from JSON 
or CSV inputs) in Spark we shouldn’t worry about using Avro at all, 
just use the Spark SQL StructType when building new Dataframes? If so, 
that will be a lot simpler!


Thanks,

Ewan

*From:*Cheng Lian [mailto:lian.cs@gmail.com]
*Sent:* 19 May 2015 11:01
*To:* Ewan Leith; user@spark.apache.org
*Subject:* Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext 
Save or createDataFrame Interfaces?


Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as 
the intermediate schema format. So when converting Avro files to 
Parquet files, we internally converts Avro schema to Spark SQL 
StructType first, and then convert StructType to Parquet schema.


Cheng

On 5/19/15 4:42 PM, Ewan Leith wrote:

Hi all,

I might be missing something, but does the new Spark 1.3
sqlContext save interface support using Avro as the schema
structure when writing Parquet files, in a similar way to
AvroParquetWriter (which I’ve got working)?

I've seen how you can load an avro file and save it as parquet

fromhttps://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan





RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that's brilliant, you've saved me a headache.

Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:58
To: Ewan Leith; user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

That's right. Also, Spark SQL can automatically infer schema from JSON 
datasets. You don't need to specify an Avro schema:

   sqlContext.jsonFile(json/path).saveAsParquetFile(parquet/path)

or with the new reader/writer API introduced in 1.4-SNAPSHOT:

   sqlContext.read.json(json/path).write.parquet(parquet/path)

Cheng
On 5/19/15 6:07 PM, Ewan Leith wrote:
Thanks Cheng, that makes sense.

So for new dataframe creation (not conversion from Avro but from JSON or CSV 
inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark 
SQL StructType when building new Dataframes? If so, that will be a lot simpler!

Thanks,
Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:01
To: Ewan Leith; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as the 
intermediate schema format. So when converting Avro files to Parquet files, we 
internally converts Avro schema to Spark SQL StructType first, and then convert 
StructType to Parquet schema.

Cheng
On 5/19/15 4:42 PM, Ewan Leith wrote:
Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext save 
interface support using Avro as the schema structure when writing Parquet 
files, in a similar way to AvroParquetWriter (which I've got working)?

I've seen how you can load an avro file and save it as parquet from 
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
 but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan




Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian

Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as the 
intermediate schema format. So when converting Avro files to Parquet 
files, we internally converts Avro schema to Spark SQL StructType first, 
and then convert StructType to Parquet schema.


Cheng

On 5/19/15 4:42 PM, Ewan Leith wrote:


Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext 
save interface support using Avro as the schema structure when writing 
Parquet files, in a similar way to AvroParquetWriter (which I’ve got 
working)?


I've seen how you can load an avro file and save it as parquet 
fromhttps://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html, 
but not using the 2 together.


Thanks, and apologies if I've missed something obvious!

Ewan





RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that makes sense.

So for new dataframe creation (not conversion from Avro but from JSON or CSV 
inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark 
SQL StructType when building new Dataframes? If so, that will be a lot simpler!

Thanks,
Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:01
To: Ewan Leith; user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as the 
intermediate schema format. So when converting Avro files to Parquet files, we 
internally converts Avro schema to Spark SQL StructType first, and then convert 
StructType to Parquet schema.

Cheng
On 5/19/15 4:42 PM, Ewan Leith wrote:
Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext save 
interface support using Avro as the schema structure when writing Parquet 
files, in a similar way to AvroParquetWriter (which I've got working)?

I've seen how you can load an avro file and save it as parquet from 
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
 but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan