Re: Apache kafka + spark + Parquet

Tathagata Das Thu, 17 Jul 2014 14:09:37 -0700

val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post


val transformedStream = kafkaStream.map ...   // whatever transformation
you want to do

transformedStream.foreachRDD((rdd: RDD[...], time: Time) => {
     // save the rdd to parquet file, using time as the file name, see
other link i sent in how to do it
     // every batch of data will create a new parquet file
})


Maybe michael (cc'ed) will be able to give more insights about the parquet
stuff.

TD




On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <mahebub...@gmail.com>
wrote:

> Hi,
>
> To migrate data from *HBase *to *Parquet* we used following query through
> * Impala*:
>
> INSERT INTO table PARQUET_HASHTAGS(
>
> key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, 
> hashtag_month, posted_time, hashtag_time,
> tweet_id, user_id, user_name,
> hashtag_year
> ) *partition(year, month, day)* SELECT key, city_name, country_name,
> hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
> hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
> as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
> where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
> ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
> 0;
>
> using above query we have successfully migrated form HBase to Parquet
> files with proper partitions.
>
> Now we are storing Data direct from *Kafka *to *Parquet.*
>
> *How is it possible to create partitions while storing data direct from
> kafka to Parquet files??*
> *(likewise created in above query)*
>
>
> On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> 1. You can put in multiple kafka topics in the same Kafka input stream.
>> See the example KafkaWordCount
>> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala>
>>  .
>> However they will all be read through a single receiver (though multiple
>> threads, one per topic). To parallelize the read (for increasing
>> throughput), you can create multiple Kafka input streams, and splits the
>> topics appropriately between them.
>>
>> 2. You can easily read and write to parquet files in Spark. Any RDD
>> (generated through DStreams in Spark Streaming, or otherwise), can be
>> converted to a SchemaRDD and then saved in the parquet format as
>> rdd.saveAsParquetFile. See the Spark SQL guide
>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
>>  for
>> more details. So if you want to write a same dataset (as RDDs) to two
>> different parquet files, you just have to call saveAsParquetFile twice (on
>> same or transformed versions of the RDD), as shown in the guide.
>>
>> Hope this helps!
>>
>> TD
>>
>>
>> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Currently we are reading (multiple) topics from Apache kafka and storing
>>> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
>>> different tables).
>>>  but we are facing some performance issue with HBase.
>>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>>> Spark*.
>>>
>>> difficulties:
>>>  1. How to read multiple topics from kafka using spark?
>>> 2. One tuple belongs to multiple tables, How to write one topic to
>>> multiple parquet files with proper partitioning using spark??
>>>
>>> Please help me
>>> Thanks in advance.
>>>
>>> --
>>> *Regards,*
>>>
>>> *Mahebub *
>>>
>>
>>
>
>
> --
> *Regards,*
> *Mahebub Sayyed*
>

Re: Apache kafka + spark + Parquet

Reply via email to