[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951926#comment-14951926
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/11/15 4:47 AM:
-

I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and many customer do not know that 
a better choice exists in spark-packages.



was (Author: dibbhatt):
I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer do not know 
that a better choice exists in spark-packages.


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttli

[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951915#comment-14951915
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/11/15 4:45 AM:
-

Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which comes at a cost)  Whereas 
, in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also not good enough which uses Kafka high level API. I am here 
trying to give a better alternatives in terms of a much better Receiver based 
approach.






was (Author: dibbhatt):
Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which comes at a cost)  Whereas 
, in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also good enough which uses Kafka high level API. I am here trying 
to give a better alternatives in terms of a much better Receiver based approach.





> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream h

[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951926#comment-14951926
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/10/15 4:58 PM:
-

I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer do not know 
that a better choice exists in spark-packages.



was (Author: dibbhatt):
I agree Sean. this space is already getting complicated. My intention was not 
at all to make it more confusing. 

What I see is , many customer is little reluctant to use this consumer from 
spark-packages thinking that it will get less support . Being at 
spark-packages, many does not even consider it using in their use cases rather 
use the whatever Receiver Based model which is documented with Spark. I think 
those who wants to fall back to Receiver based model , Spark out of the box 
Receivers does not give them a better choice and not many customer knows that a 
better choice exists in spark-packages.


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling

[jira] [Comment Edited] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-10 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951915#comment-14951915
 ] 

Dibyendu Bhattacharya edited comment on SPARK-11045 at 10/10/15 4:38 PM:
-

Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which comes at a cost)  Whereas 
, in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also good enough which uses Kafka high level API. I am here trying 
to give a better alternatives in terms of a much better Receiver based approach.






was (Author: dibbhatt):
Hi Cody, thanks for your comments. 

My opinion on parallelism is not around receiving parallelism from Kafka which 
is same for both receiver and direct stream mode. My thought was on the 
parallelism while processing the RDD. In DirectStream the partitions of the RDD 
is your number of topic partitions . So if you Kafka topic has 10 partition , 
your RDD will have 10 partition. And that's the max parallelism while 
processing the RDD (unless you do re-partition which some at a cost)  Whereas , 
in Receiver based model, the number of partitions is dictated by Block 
Intervals and Batch Interval. If your block interval is 200 Ms, and Batch 
interval is 10 seconds , your RDD will have 50 partitions !

I believe that seems to give much better parallelism while processing the RDD. 

Regarding the state of spark-packages code, your comment is not at good taste. 
There are many companies who think otherwise and use the spark packages 
consumer in production. 

As I said earlier, DirectStream is definitely the choice if one need "Exactly 
Once", but there are many who does not want "Exactly Once" and does not want 
the overhead of using DirectStream. unfortunately , for them the other 
alternatives also good enough which uses Kafka high level API. I am here trying 
to give a better alternatives in terms of a much better Receiver based approach.





> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has hig