Now we are storing Data direct from Kafka to Parquet.
We are currently using Camus and wanted to know how you went about storing
to Parquet?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html
Sent from the Apache
Hi All,
Currently we are reading (multiple) topics from Apache kafka and storing
that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
different tables).
but we are facing some performance issue with HBase.
so we are replacing* HBase* with *Parquet* file and *storm* with
1. You can put in multiple kafka topics in the same Kafka input stream. See
the example KafkaWordCount
https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
.
However they will all be read
Hi,
To migrate data from *HBase *to *Parquet* we used following query through
* Impala*:
INSERT INTO table PARQUET_HASHTAGS(
key, city_name, country_name, hashtag_date, hashtag_text,
hashtag_source, hashtag_month, posted_time, hashtag_time,
tweet_id, user_id, user_name,
hashtag_year
)
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post
val transformedStream = kafkaStream.map ... // whatever transformation
you want to do
transformedStream.foreachRDD((rdd: RDD[...], time: Time) = {
// save the rdd to parquet file, using time as the file
We don't have support for partitioned parquet yet. There is a JIRA here:
https://issues.apache.org/jira/browse/SPARK-2406
On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post