Re: Kafka-HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
Currently I am not doing anything, if anything change start from scratch.

In general I doubt there are many options to account for schema changes. If you 
are reading files using impala, then it may allow if the schema changes are 
append only. Otherwise existing Parquet files have to be migrated to new schema.

- Original Message -
From: Buntu Dev buntu...@gmail.com
To: Soumitra Kumar kumar.soumi...@gmail.com
Cc: u...@spark.incubator.apache.org
Sent: Tuesday, October 7, 2014 10:18:16 AM
Subject: Re: Kafka-HDFS to store as Parquet format


Thanks for the info Soumitra.. its a good start for me. 


Just wanted to know how you are managing schema changes/evolution as 
parquetSchema is provided to setSchema in the above sample code. 


On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar  kumar.soumi...@gmail.com  
wrote: 


I have used it to write Parquet files as: 

val job = new Job 
val conf = job.getConfiguration 
conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ()) 
ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType 
(parquetSchema)) 
rdd saveAsNewAPIHadoopFile (rddToFileName (outputDir, em, time), classOf[Void], 
classOf[Group], classOf[ExampleOutputFormat], conf) 



- Original Message - 
From: bdev  buntu...@gmail.com  
To: u...@spark.incubator.apache.org 
Sent: Tuesday, October 7, 2014 9:51:40 AM 
Subject: Re: Kafka-HDFS to store as Parquet format 

After a bit of looking around, I found saveAsNewAPIHadoopFile could be used 
to specify the ParquetOutputFormat. Has anyone used it to convert JSON to 
Parquet format or any pointers are welcome, thanks! 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HDFS-to-store-as-Parquet-format-tp15768p15852.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka-HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
Thanks for the input Soumitra.

On Tue, Oct 7, 2014 at 10:24 AM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 Currently I am not doing anything, if anything change start from scratch.

 In general I doubt there are many options to account for schema changes.
 If you are reading files using impala, then it may allow if the schema
 changes are append only. Otherwise existing Parquet files have to be
 migrated to new schema.

 - Original Message -
 From: Buntu Dev buntu...@gmail.com
 To: Soumitra Kumar kumar.soumi...@gmail.com
 Cc: u...@spark.incubator.apache.org
 Sent: Tuesday, October 7, 2014 10:18:16 AM
 Subject: Re: Kafka-HDFS to store as Parquet format


 Thanks for the info Soumitra.. its a good start for me.


 Just wanted to know how you are managing schema changes/evolution as
 parquetSchema is provided to setSchema in the above sample code.


 On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar  kumar.soumi...@gmail.com
  wrote:


 I have used it to write Parquet files as:

 val job = new Job
 val conf = job.getConfiguration
 conf.set (ParquetOutputFormat.COMPRESSION,
 CompressionCodecName.SNAPPY.name ())
 ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType
 (parquetSchema))
 rdd saveAsNewAPIHadoopFile (rddToFileName (outputDir, em, time),
 classOf[Void], classOf[Group], classOf[ExampleOutputFormat], conf)



 - Original Message -
 From: bdev  buntu...@gmail.com 
 To: u...@spark.incubator.apache.org
 Sent: Tuesday, October 7, 2014 9:51:40 AM
 Subject: Re: Kafka-HDFS to store as Parquet format

 After a bit of looking around, I found saveAsNewAPIHadoopFile could be used
 to specify the ParquetOutputFormat. Has anyone used it to convert JSON to
 Parquet format or any pointers are welcome, thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HDFS-to-store-as-Parquet-format-tp15768p15852.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org