As I understand the matter: Option 1) has benefits when you think that your network bandwidth may be a bottle neck, because Spark opens several network connections on possibly several different physical machines.
Option 2) - as you already pointed out - has the benefit that you occupy less worker cores with receiver tasks. Regards, Jeff 2015-02-26 9:38 GMT+01:00 bit1...@163.com <bit1...@163.com>: > Sure, Thanks Tathagata! > > ------------------------------ > bit1...@163.com > > > *From:* Tathagata Das <t...@databricks.com> > *Date:* 2015-02-26 14:47 > *To:* bit1...@163.com > *CC:* Akhil Das <ak...@sigmoidanalytics.com>; user <user@spark.apache.org> > *Subject:* Re: Re: Many Receiver vs. Many threads per Receiver > Spark Streaming has a new Kafka direct stream, to be release as > experimental feature with 1.3. That uses a low level consumer. Not sure if > it satisfies your purpose. > If you want more control, its best to create your own Receiver with the > low level Kafka API. > > TD > > On Tue, Feb 24, 2015 at 12:09 AM, bit1...@163.com <bit1...@163.com> wrote: > >> Thanks Akhil. >> Not sure whether thelowlevel consumer. >> <https://github.com/dibbhatt/kafka-spark-consumer>will be officially >> supported by Spark Streaming. So far, I don't see it mentioned/documented >> in the spark streaming programming guide. >> >> ------------------------------ >> bit1...@163.com >> >> >> *From:* Akhil Das <ak...@sigmoidanalytics.com> >> *Date:* 2015-02-24 16:21 >> *To:* bit1...@163.com >> *CC:* user <user@spark.apache.org> >> *Subject:* Re: Many Receiver vs. Many threads per Receiver >> I believe when you go with 1, it will distribute the consumer across your >> cluster (possibly on 6 machines), but still it i don't see a away to tell >> from which partition it will consume etc. If you are looking to have a >> consumer where you can specify the partition details and all, then you are >> better off with the lowlevel consumer. >> <https://github.com/dibbhatt/kafka-spark-consumer> >> >> >> >> Thanks >> Best Regards >> >> On Tue, Feb 24, 2015 at 9:36 AM, bit1...@163.com <bit1...@163.com> wrote: >> >>> Hi, >>> I am experimenting Spark Streaming and Kafka Integration, To read >>> messages from Kafka in parallel, basically there are two ways >>> 1. Create many Receivers like (1 to 6).map(_ => >>> KakfaUtils.createStream). >>> 2. Specifiy many threads when calling KakfaUtils.createStream like val >>> topicMap("myTopic"=>6), this will create one receiver with 6 reading >>> threads. >>> >>> My question is which option is better, sounds option 2 is better is to >>> me because it saves a lot of cores(one Receiver one core), but I >>> learned from somewhere else that choice 1 is better, so I would ask and see >>> how you guys elaborate on this. Thank >>> >>> ------------------------------ >>> bit1...@163.com >>> >> >> >