Re: Low Level Kafka Consumer for Spark

Bharat Venkat Wed, 27 Aug 2014 11:39:12 -0700

Hi Dibyendu,

That would be great.  One of the biggest drawback of Kafka utils as well as
your implementation is I am unable to scale out processing.  I am
relatively new to Spark and Spark Streaming - from what I read and what I
observe with my deployment is that having the RDD created on one receiver
is processed by at most 2 nodes in my cluster (most likely because default
replication is 2 and spark schedules processing close to where the data
is).  I tried rdd.replicate() to no avail.


Would Chris and your proposal to have union of DStreams for all these
Receivers still allow scaling out subsequent processing?

Thanks,
Bharat


On Tue, Aug 26, 2014 at 10:59 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

>
> Thanks Chris and Bharat for your inputs. I agree, running multiple
> receivers/dstreams is desirable for scalability and fault tolerant. and
> this is easily doable. In present KafkaReceiver I am creating as many
> threads for each kafka topic partitions, but I can definitely create
> multiple KafkaReceivers for every partition. As Chris mentioned , in this
> case I need to then have union of DStreams for all these Receivers. I will
> try this out and let you know.
>
> Dib
>
>
> On Wed, Aug 27, 2014 at 9:10 AM, Chris Fregly <ch...@fregly.com> wrote:
>
>> great work, Dibyendu.  looks like this would be a popular contribution.
>>
>> expanding on bharat's question a bit:
>>
>> what happens if you submit multiple receivers to the cluster by creating
>> and unioning multiple DStreams as in the kinesis example here:
>>
>>
>> https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123
>>
>> for more context, the kinesis implementation above uses the Kinesis
>> Client Library (KCL) to automatically assign - and load balance - stream
>> shards among all KCL threads from all receivers (potentially coming and
>> going as nodes die) on all executors/nodes using DynamoDB as the
>> association data store.
>>
>> ZooKeeper would be used for your Kafka consumers, of course.  and
>> ZooKeeper watches to handle the ephemeral nodes.  and I see you're using
>> Curator, which makes things easier.
>>
>> as bharat suggested, running multiple receivers/dstreams may be desirable
>> from a scalability and fault tolerance standpoint.  is this type of load
>> balancing possible among your different Kafka consumers running in
>> different ephemeral JVMs?
>>
>> and isn't it fun proposing a popular piece of code?  the question
>> floodgates have opened!  haha. :)
>>
>>  -chris
>>
>>
>>
>> On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya <
>> dibyendu.bhattach...@gmail.com> wrote:
>>
>>> Hi Bharat,
>>>
>>> Thanks for your email. If the "Kafka Reader" worker process dies, it
>>> will be replaced by different machine, and it will start consuming from the
>>> offset where it left over ( for each partition). Same case can happen even
>>> if I tried to have individual Receiver for every partition.
>>>
>>> Regards,
>>> Dibyendu
>>>
>>>
>>> On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat <bvenkat.sp...@gmail.com>
>>> wrote:
>>>
>>>> I like this consumer for what it promises - better control over offset
>>>> and
>>>> recovery from failures.  If I understand this right, it still uses
>>>> single
>>>> worker process to read from Kafka (one thread per partition) - is there
>>>> a
>>>> way to specify multiple worker processes (on different machines) to read
>>>> from Kafka?  Maybe one worker process for each partition?
>>>>
>>>> If there is no such option, what happens when the single machine
>>>> hosting the
>>>> "Kafka Reader" worker process dies and is replaced by a different
>>>> machine
>>>> (like in cloud)?
>>>>
>>>> Thanks,
>>>> Bharat
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Low Level Kafka Consumer for Spark

Reply via email to