Re: Apache kafka + spark + Parquet

Tathagata Das Thu, 17 Jul 2014 02:36:26 -0700

1. You can put in multiple kafka topics in the same Kafka input stream. See
the example KafkaWordCount
<https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala>
.
However they will all be read through a single receiver (though multiple
threads, one per topic). To parallelize the read (for increasing
throughput), you can create multiple Kafka input streams, and splits the
topics appropriately between them.

2. You can easily read and write to parquet files in Spark. Any RDD
(generated through DStreams in Spark Streaming, or otherwise), can be
converted to a SchemaRDD and then saved in the parquet format as
rdd.saveAsParquetFile. See the Spark SQL guide
<http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
for
more details. So if you want to write a same dataset (as RDDs) to two
different parquet files, you just have to call saveAsParquetFile twice (on
same or transformed versions of the RDD), as shown in the guide.

Hope this helps!

TD

On Thu, Jul 17, 2014 at 2:19 AM, Mahebub Sayyed <mahebub...@gmail.com>
wrote:

> Hi All,
>
> Currently we are reading (multiple) topics from Apache kafka and storing
> that in HBase (multiple tables) using twitter storm (1 tuple stores in 4
> different tables).
>  but we are facing some performance issue with HBase.
> so we are replacing* HBase* with *Parquet* file and *storm* with *Apache
> Spark*.
>
> difficulties:
>  1. How to read multiple topics from kafka using spark?
> 2. One tuple belongs to multiple tables, How to write one topic to
> multiple parquet files with proper partitioning using spark??
>
> Please help me
> Thanks in advance.
>
> --
> *Regards,*
>
> *Mahebub *
>

Re: Apache kafka + spark + Parquet

Reply via email to