We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406
On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > val kafkaStream = KafkaUtils.createStream(... ) // see the example in my > previous post > > val transformedStream = kafkaStream.map ... // whatever transformation > you want to do > > transformedStream.foreachRDD((rdd: RDD[...], time: Time) => { > // save the rdd to parquet file, using time as the file name, see > other link i sent in how to do it > // every batch of data will create a new parquet file > }) > > > Maybe michael (cc'ed) will be able to give more insights about the parquet > stuff. > > TD > > > > > On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <mahebub...@gmail.com> > wrote: > >> Hi, >> >> To migrate data from *HBase *to *Parquet* we used following query >> through * Impala*: >> >> INSERT INTO table PARQUET_HASHTAGS( >> >> key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, >> hashtag_month, posted_time, hashtag_time, >> tweet_id, user_id, user_name, >> hashtag_year >> ) *partition(year, month, day)* SELECT key, city_name, country_name, >> hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, >> hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year >> as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS >> where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01' >> ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset >> 0; >> >> using above query we have successfully migrated form HBase to Parquet >> files with proper partitions. >> >> Now we are storing Data direct from *Kafka *to *Parquet.* >> >> *How is it possible to create partitions while storing data direct from >> kafka to Parquet files??* >> *(likewise created in above query)* >> >> >> On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> 1. You can put in multiple kafka topics in the same Kafka input stream. >>> See the example KafkaWordCount >>> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala> >>> . >>> However they will all be read through a single receiver (though multiple >>> threads, one per topic). To parallelize the read (for increasing >>> throughput), you can create multiple Kafka input streams, and splits the >>> topics appropriately between them. >>> >>> 2. You can easily read and write to parquet files in Spark. Any RDD >>> (generated through DStreams in Spark Streaming, or otherwise), can be >>> converted to a SchemaRDD and then saved in the parquet format as >>> rdd.saveAsParquetFile. See the Spark SQL guide >>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files> >>> for >>> more details. So if you want to write a same dataset (as RDDs) to two >>> different parquet files, you just have to call saveAsParquetFile twice (on >>> same or transformed versions of the RDD), as shown in the guide. >>> >>> Hope this helps! >>> >>> TD >>> >>> >>> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> Currently we are reading (multiple) topics from Apache kafka and >>>> storing that in HBase (multiple tables) using twitter storm (1 tuple stores >>>> in 4 different tables). >>>> but we are facing some performance issue with HBase. >>>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache >>>> Spark*. >>>> >>>> difficulties: >>>> 1. How to read multiple topics from kafka using spark? >>>> 2. One tuple belongs to multiple tables, How to write one topic to >>>> multiple parquet files with proper partitioning using spark?? >>>> >>>> Please help me >>>> Thanks in advance. >>>> >>>> -- >>>> *Regards,* >>>> >>>> *Mahebub * >>>> >>> >>> >> >> >> -- >> *Regards,* >> *Mahebub Sayyed* >> > >