Hey Ged, > So when the task streams it aggregated result to task B, will it stop >for a moment whilst it does this ?
Yes and no. An individual SamzaContainer is single-threaded, so when output is being flushed, all other StreamTasks in the container are blocked. No other processing is happening. But, if you have multiple containers, the other containers will still be sending their output, so the downstream consumer will probably be receiving data anyway. This is somewhat dependent on the partitioning of your topic, but usually this is the case. > Sensor fusion bring one and financial trading being another. These use cases are not ones that we've been targeting so far. Both require very tight SLAs on latency. We've been focusing more on tight data-loss SLAs (we don¹t want to lose data). > If a task if having problems keeping up can I real time add another CPU >core or task instance ? Since Samza's containers are single-threaded, this is modeled by adding a new container to the job (yarn.container.count). This operation typically takes a few seconds, though, since the resources have to be negotiated with YARN. In addition, we don't do this dynamically right now. It *could* be done dynamically, but it would involve monitoring your job's metrics, updating configuration, and restarting the job. Again, this triggers a blip in latency while the job is restarted. > Can I skip some of the 10k per second in real time and give a less >correct result to task b ? In the *specific* example you're giving, your StreamTask can opt to not send output that it might have buffered, in cases where the output is already far enough behind that the data is not valuable. For example, if your StreamTask accrues 10k messages, and the send takes 200ms, but you have a 100ms SLA, then your task could opt to drop (don't send) all messages older than 100ms. This would cause the StreamTask to speed up as it simply inspects and skips messages that are too old. > To alleviate the "stopping while it sends the results " to the next >task... The best way to accomplish this is to use Kafka's async producer with Samza: systems.kafka.producer.producer.type=async The async producer will cause sending to happen on a separate thread, which means that flushing calls will be non-blocking. **Note that this means you can lose data if a failure occurs.** You can also speed things up by disabling acks in Kafka: systems.kafka.producer.request.required.acks=0 If you use the async producer, you'll need to be sensitive to the size of the batch. A smaller batch size means that flushes will happen more frequently, but the performance might be slower since your network overhead starts to have an impact. I should also note that a lot of these settings will go away when we switch to the new Kafka producer: https://issues.apache.org/jira/browse/SAMZA-227 This is currently being worked on. The new producer pipelines requests, so batching can be done with no cost to latency. This feature: https://issues.apache.org/jira/browse/SAMZA-459 Might also be useful to you. It would allow you to explicitly flush output even if the batch output is not yet full. Cheers, Chris On 1/12/15 3:29 PM, "Gerard Webb" <[email protected]> wrote: >Samza looks very flexible. >Let's say I had message streaming in to a task at 100 k / sec duty cycle. >The task has a SLA to output the results every .1 seconds. So it would >have >swallowed about 10 k of messages; of course this will vary based on many >factors. >So when the task streams it aggregated result to task B, will it stop for >a >moment whilst it does this ? A git like a garbage collection metaphor. > >The reason I ask this is because samza looks nice for lots of dataflow >scenarios but I am wondering about vest practice to get it doing some sort >of deterministic QoS for time sensitive situations. >Sensor fusion bring one and financial trading being another. > >Ways to approach problem. >1. If a task if having problems keeping up can I real time add another CPU >core or task instance ? > >2. Can I skip some of the 10k per second in real time and give a less >correct result to task b ? > >3. To alleviate the "stopping while it sends the results " to the next >task, could I store the aggregated results in memory, and at the time of >handing it on have another task do the aggregated hand on, so the task >never pauses ? > >The metaphor is a typical function. It has many parameters on the function >signature and a single out return. Inside the function its doing a loop on >data (the stream) and when it reaches a certain timer Point, it returns >out >the aggregated result. Then it just keeps going. > >Hope this is understandable and you can see what I am getting at. > >Ged
