Re: Apache kafka + spark + Parquet

Michael Armbrust Thu, 17 Jul 2014 14:41:36 -0700

We don't have support for partitioned parquet yet.  There is a JIRA here:
https://issues.apache.org/jira/browse/SPARK-2406



On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
> previous post
>
> val transformedStream = kafkaStream.map ...   // whatever transformation
> you want to do
>
> transformedStream.foreachRDD((rdd: RDD[...], time: Time) => {
>      // save the rdd to parquet file, using time as the file name, see
> other link i sent in how to do it
>      // every batch of data will create a new parquet file
> })
>
>
> Maybe michael (cc'ed) will be able to give more insights about the parquet
> stuff.
>
> TD
>
>
>
>
> On Thu, Jul 17, 2014 at 3:59 AM, Mahebub Sayyed <mahebub...@gmail.com>
> wrote:
>
>> Hi,
>>
>> To migrate data from *HBase *to *Parquet* we used following query
>> through * Impala*:
>>
>> INSERT INTO table PARQUET_HASHTAGS(
>>
>> key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, 
>> hashtag_month, posted_time, hashtag_time,
>> tweet_id, user_id, user_name,
>> hashtag_year
>> ) *partition(year, month, day)* SELECT key, city_name, country_name,
>> hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time,
>> hashtag_time, tweet_id, user_id, user_name,hashtag_year, cast(hashtag_year
>> as int),cast(hashtag_month as int), cast(hashtag_date as int) from HASHTAGS
>> where hashtag_year='2014' and hashtag_month='04' and hashtag_date='01'
>> ORDER BY key,hashtag_year,hashtag_month,hashtag_date LIMIT 10000000 offset
>> 0;
>>
>> using above query we have successfully migrated form HBase to Parquet
>> files with proper partitions.
>>
>> Now we are storing Data direct from *Kafka *to *Parquet.*
>>
>> *How is it possible to create partitions while storing data direct from
>> kafka to Parquet files??*
>> *(likewise created in above query)*
>>
>>
>> On Thu, Jul 17, 2014 at 12:35 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> 1. You can put in multiple kafka topics in the same Kafka input stream.
>>> See the example KafkaWordCount
>>> <https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala>
>>>  .
>>> However they will all be read through a single receiver (though multiple
>>> threads, one per topic). To parallelize the read (for increasing
>>> throughput), you can create multiple Kafka input streams, and splits the
>>> topics appropriately between them.
>>>
>>> 2. You can easily read and write to parquet files in Spark. Any RDD
>>> (generated through DStreams in Spark Streaming, or otherwise), can be
>>> converted to a SchemaRDD and then saved in the parquet format as
>>> rdd.saveAsParquetFile. See the Spark SQL guide
>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
>>>  for
>>> more details. So if you want to write a same dataset (as RDDs) to two
>>> different parquet files, you just have to call saveAsParquetFile twice (on
>>> same or transformed versions of the RDD), as shown in the guide.
>>>
>>> Hope this helps!
>>>
>>> TD
>>>
>>>
>>> On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Currently we are reading (multiple) topics from Apache kafka and
>>>> storing that in HBase (multiple tables) using twitter storm (1 tuple stores
>>>> in 4 different tables).
>>>>  but we are facing some performance issue with HBase.
>>>> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
>>>> Spark*.
>>>>
>>>> difficulties:
>>>>  1. How to read multiple topics from kafka using spark?
>>>> 2. One tuple belongs to multiple tables, How to write one topic to
>>>> multiple parquet files with proper partitioning using spark??
>>>>
>>>> Please help me
>>>> Thanks in advance.
>>>>
>>>> --
>>>> *Regards,*
>>>>
>>>> *Mahebub *
>>>>
>>>
>>>
>>
>>
>> --
>> *Regards,*
>> *Mahebub Sayyed*
>>
>
>

Re: Apache kafka + spark + Parquet

Reply via email to