Re: Kafka Streams vs Spark Streaming

Kohki Nishio Mon, 27 Feb 2017 10:11:01 -0800

Guozhang,
It's a bit difficult to explain, but let me try ... the basic idea is that
we can assume most of messages have the same clock (per partition at
least), then if an offset has information about metadata about the target
time of the offset, fail-over works.

Offset = 12222
Metadata Time = 2/12/2017 12:00:00

After this offset (12222), I only process messages later than the metadata
time. And from the application, it commits only when it's safe to move the
time bucket forward for all keys, it picks the earliest time of the time
bucket from each key. But that 'move forward' is not per key decision, it's
per partition decision. So I need to define the maximum time range in which
I need to keep data.

But it's not always that simple, there are outliers(network hiccup) and
delayed arrivals(wrong ntp), my plan is to mark those keys as 'best effort'
group when it happens. For those keys, as long as JVM keeps running, I can
handle those but those won't be a part of 'per partition decision'..

Hope this gives some idea about what I'm trying to do .. I'm planning to
use Processor API to do this.

Thanks
-Kohki

On Sun, Feb 26, 2017 at 8:56 PM, Guozhang Wang <wangg...@gmail.com> wrote:

> Hello Kohki,
>
> Given your data traffic and the state volume I cannot think of a better
> solution but suggest using large number of partitioned local states.
>
> I'm wondering how would "per partition watermark" can help with your
> traffic?
>
> Guozhang
>
> On Sun, Feb 26, 2017 at 10:45 AM, Kohki Nishio <tarop...@gmail.com> wrote:
>
> > Guozhang,
> >
> > Let me explain what I'm trying to do. The message volume is large (TB per
> > Day) and that is coming to a topic. Now I want to do per minute
> > aggregation(Windowed) and send the output to the downstream (a topic)
> >
> > (Topic1 - Large Volume) -> [Stream App] -> (Topic2 - Large Volume)
> >
> > I assume the internal topic would have the same amount of data as topic1
> > and the same goes to the local store, I know we can tweak retention
> period
> > but network traffic would be same (or even more)
> >
> > The key point here is that most of incoming stream will end up a single
> > data point per a minute (aggregation window), but the variance of the key
> > is huge (high cardinality), then buffering wouldn't really help reduce
> the
> > data/traffic size.
> >
> > I'm going to do something like per partition watermarking with offset
> > metadata. It wouldn't increase the traffic that much
> > thanks
> > -Kohki
> >
>
>
>
> --
> -- Guozhang
>

-- 
Kohki Nishio

Re: Kafka Streams vs Spark Streaming

Reply via email to