Re: Kafka streams vs Spark streaming

Sachin Mittal Wed, 11 Oct 2017 04:48:02 -0700

No it wont work this way.
Say you have 9 partitions and 3 instances.
1 = {1, 2, 3}
2 = {4, 5, 6}
3 = (7, 8, 9}
And lets say a particular key (k1) is always written to partition 4.
Now say you increase partitions to 12 you may have:
1 = {1, 2, 3, 4}
2 = {5, 6, 7, 8}
3 = (9, 10, 11, 12}


Now it is possible that k1 is still written to 4 and now processed by
instance 1 or may be redistributed to some new partition say 9 and
processed by machine 3.
So we would have some old data for that key written to partition 4 and new
data after added new partitions written to partition 9.

Hence some aggregation for time window spanning both the partitions would
report incorrect results.

As a result it is general practice that one creates slightly more
partitions than anticipated data volume. Hence kafka topics and streaming
is not as elastic as spark.



On Wed, Oct 11, 2017 at 1:56 PM, Sabarish Sasidharan <sabarish....@gmail.com
> wrote:

> @Sachin
> >>is not elastic. You need to anticipate before hand on volume of data you
> will have. Very difficult to add and reduce topic partitions later on.
>
> Why do you say so Sachin? Kafka Streams will readjust once we add more
> partitions to the Kafka topic. And when we add more machines, rebalancing
> auto distributes the partitions among the new stream threads.
>
> Regards
> Sab
>
> On 11 Oct 2017 1:44 pm, "Sachin Mittal" <sjmit...@gmail.com> wrote:
>
>> Kafka streams has a lower learning curve and if your source data is in
>> kafka topics it is pretty simple to integrate it with.
>> It can run like a library inside your main programs.
>>
>> So as compared to spark streams
>> 1. Is much simpler to implement.
>> 2. Is not much heavy on hardware unlike spark.
>>
>>
>> On the downside
>> 1. It is not elastic. You need to anticipate before hand on volume of
>> data you will have. Very difficult to add and reduce topic partitions later
>> on.
>> 2. The partition key is very important if you need to run multiple
>> instances of streams application and certain instance processing certain
>> partitions only.
>>      In case you need aggregation on a different key you may need to
>> re-partition the data to a new topic and run new streams app against that.
>>
>> So yes if you have good idea about your data and if it comes from kafka
>> and you want to build something quick without much hardware kafka streams
>> is a way to go.
>>
>> We had first tried spark streaming but given hardware limitation and
>> complexity of fetching data from mongodb we decided kafka streams as way to
>> go forward.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>> On Wed, Oct 11, 2017 at 1:01 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Has anyone had an experience of using Kafka streams versus Spark?
>>>
>>> I am not familiar with Kafka streams concept except that it is a set of
>>> libraries.
>>>
>>> Any feedback will be appreciated.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>

Re: Kafka streams vs Spark streaming

Reply via email to