val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post
val transformedStream = kafkaStream.map ... // whatever transformation you want to do transformedStream.foreachRDD((rdd: RDD[...], time: Time) => { // save the rdd to parquet file, using time as the file name, see other link i sent in how to do it // every batch of data will create a new parquet file }) Maybe michael (cc'ed) will be able to give more insights about the parquet stuff. TD On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <mahebub...@gmail.com> wrote: > Hi, > > To migrate data from *HBase *to *Parquet* we used following query through > * Impala*: > > INSERT INTO table PARQUET_HASHTAGS( > > key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, > hashtag_month, posted_time, hashtag_time, > tweet_id, user_id, user_name, > hashtag_year > ) *partition(year, month, day)* SELECT key, city_name, country_name, > hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, > hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year > as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS > where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01' > ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset > 0; > > using above query we have successfully migrated form HBase to Parquet > files with proper partitions. > > Now we are storing Data direct from *Kafka *to *Parquet.* > > *How is it possible to create partitions while storing data direct from > kafka to Parquet files??* > *(likewise created in above query)* > > > On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> 1. You can put in multiple kafka topics in the same Kafka input stream. >> See the example KafkaWordCount >> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> >> . >> However they will all be read through a single receiver (though multiple >> threads, one per topic). To parallelize the read (for increasing >> throughput), you can create multiple Kafka input streams, and splits the >> topics appropriately between them. >> >> 2. You can easily read and write to parquet files in Spark. Any RDD >> (generated through DStreams in Spark Streaming, or otherwise), can be >> converted to a SchemaRDD and then saved in the parquet format as >> rdd.saveAsParquetFile. See the Spark SQL guide >> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> >> for >> more details. So if you want to write a same dataset (as RDDs) to two >> different parquet files, you just have to call saveAsParquetFile twice (on >> same or transformed versions of the RDD), as shown in the guide. >> >> Hope this helps! >> >> TD >> >> >> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> Currently we are reading (multiple) topics from Apache kafka and storing >>> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 >>> different tables). >>> but we are facing some performance issue with HBase. >>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache >>> Spark*. >>> >>> difficulties: >>> 1. How to read multiple topics from kafka using spark? >>> 2. One tuple belongs to multiple tables, How to write one topic to >>> multiple parquet files with proper partitioning using spark?? >>> >>> Please help me >>> Thanks in advance. >>> >>> -- >>> *Regards,* >>> >>> *Mahebub * >>> >> >> > > > -- > *Regards,* > *Mahebub Sayyed* >