General Data questions - streams vs batch

Konstantin Kulagin Sun, 24 Apr 2016 06:55:11 -0700

Hi guys,

I have some kind of general question in order to get more understanding of
stream vs final data transformation. More specific - I am trying to
understand 'entities' lifecycle during processing.


1) For example in a case of streams: suppose we start with some key-value
source, parallel it into 2 streams by key. Each stream modifies entry's
values, lets say adds some fields. And we want to merge it back later. How
does it happen?
Merging point will keep some finite buffer of entries? Basing on time or
size?

I understand that probably right solution in this case would be having one
stream and achieve more more performance by increasing parallelism, but
what if I have 2 sources from the beginning?


2) Also I assume that in a case of streaming each entry considered as
'processed' once it passes whole chain and emitted into some sink, so after
it will not consume resources. Basically similar to what Storm is doing.
But in a case of finite data (data sets): how big amount of data system
will keep in memory? The whole set?

I probably have some example of dataset vs stream 'mix': I need to
*transform* big but finite chunk of data, I don't really need to do any
'joins', grouping or smth like that so I never need to store whole dataset
in memory/storage. What my choice would be in this case?

Thanks!
Konstantin

General Data questions - streams vs batch

Reply via email to