Re: Usage within a data flow style scenario

Chris Riccomini Tue, 13 Jan 2015 17:05:21 -0800

Hey Ged,

> So when the task streams it aggregated result to task B, will it stop
>for a moment whilst it does this ?

Yes and no. An individual SamzaContainer is single-threaded, so when
output is being flushed, all other StreamTasks in the container are
blocked. No other processing is happening. But, if you have multiple
containers, the other containers will still be sending their output, so
the downstream consumer will probably be receiving data anyway. This is
somewhat dependent on the partitioning of your topic, but usually this is
the case.

> Sensor fusion bring one and financial trading being another.

These use cases are not ones that we've been targeting so far. Both
require very tight SLAs on latency. We've been focusing more on tight
data-loss SLAs (we don¹t want to lose data).

> If a task if having problems keeping up can I real time add another CPU
>core or task instance ?

Since Samza's containers are single-threaded, this is modeled by adding a
new container to the job (yarn.container.count). This operation typically
takes a few seconds, though, since the resources have to be negotiated
with YARN. In addition, we don't do this dynamically right now. It *could*
be done dynamically, but it would involve monitoring your job's metrics,
updating configuration, and restarting the job. Again, this triggers a
blip in latency while the job is restarted.

> Can I skip some of the 10k per second in real time and give a less
>correct result to task b ?

In the *specific* example you're giving, your StreamTask can opt to not
send output that it might have buffered, in cases where the output is
already far enough behind that the data is not valuable. For example, if
your StreamTask accrues 10k messages, and the send takes 200ms, but you
have a 100ms SLA, then your task could opt to drop (don't send) all
messages older than 100ms. This would cause the StreamTask to speed up as
it simply inspects and skips messages that are too old.

> To alleviate the "stopping while it sends the results " to the next
>task...

The best way to accomplish this is to use Kafka's async producer with
Samza:

  systems.kafka.producer.producer.type=async

The async producer will cause sending to happen on a separate thread,
which means that flushing calls will be non-blocking. **Note that this
means you can lose data if a failure occurs.**

You can also speed things up by disabling acks in Kafka:

  systems.kafka.producer.request.required.acks=0

If you use the async producer, you'll need to be sensitive to the size of
the batch. A smaller batch size means that flushes will happen more
frequently, but the performance might be slower since your network
overhead starts to have an impact.

I should also note that a lot of these settings will go away when we
switch to the new Kafka producer:

  https://issues.apache.org/jira/browse/SAMZA-227

This is currently being worked on. The new producer pipelines requests, so
batching can be done with no cost to latency. This feature:

  https://issues.apache.org/jira/browse/SAMZA-459

Might also be useful to you. It would allow you to explicitly flush output
even if the batch output is not yet full.

Cheers,
Chris

On 1/12/15 3:29 PM, "Gerard Webb" <[email protected]> wrote:

>Samza looks very flexible.
>Let's say I had message streaming in to a task at 100 k / sec duty cycle.
>The task has a SLA to output the results every .1 seconds. So it would
>have
>swallowed about 10 k of messages; of course this will vary based on many
>factors.
>So when the task streams it aggregated result to task B, will it stop for
>a
>moment whilst it does this ? A git like a garbage collection metaphor.
>
>The reason I ask this is because samza looks nice for lots of dataflow
>scenarios but I am wondering about vest practice to get it doing some sort
>of deterministic QoS for time sensitive situations.
>Sensor fusion bring one and financial trading being another.
>
>Ways to approach problem.
>1. If a task if having problems keeping up can I real time add another CPU
>core or task instance ?
>
>2. Can I skip some of the 10k per second in real time and give a less
>correct result to task b ?
>
>3. To alleviate the "stopping while it sends the results " to the next
>task, could I store the aggregated results in memory, and at the time of
>handing it on have another task do the aggregated hand on, so the task
>never pauses ?
>
>The metaphor is a typical function. It has many parameters on the function
>signature and a single out return. Inside the function its doing a loop on
>data (the stream) and when it reaches a certain timer Point, it returns
>out
>the aggregated result. Then it just keeps going.
>
>Hope this is understandable and you can see what I am getting at.
>
>Ged

Re: Usage within a data flow style scenario

Reply via email to