[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111243#comment-15111243
 ] 

Saisai Shao commented on SPARK-11045:
-------------------------------------

Hi [~dibbhatt], I'm afraid I could not agree with your comment about current 
receiver-based Kafka connector. Currently Spark Streaming uses the high-level 
API to fetch data, beside using Kafka provided API, spark itself is quite 
simple, if you're saying this approach is buggy and faulty, I think it is buggy 
and faulty in Kafka high-level API, not the way Spark Streaming uses it. What 
you're doing is to fix such things which should be fixed by Kafka not Spark. 
Yes we could improve the current receiver-based way, but to build up from 
ground to fix such issues which will also be improved in Kafka will make it 
diverge from the Kafka's way. Just my two cents :).

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11045
>                 URL: https://issues.apache.org/jira/browse/SPARK-11045
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to