[ 
https://issues.apache.org/jira/browse/SAMZA-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218739#comment-14218739
 ] 

Ben Kirwin commented on SAMZA-353:
----------------------------------

I'd like to throw a couple more use-cases in the ring.

- Storm / Trident have a concept of a 'DRPC' topology. It's a pretty typical 
scatter / gather pattern: there's a stream of incoming requests that get 
broadcast to a bunch of workers, and the results get funnelled together at the 
bottom. (Some examples / commentary in here: 
https://storm.apache.org/documentation/Trident-tutorial.html#reach ) 'Global' 
streams seem like a good abstraction for this.

- It allows you to trade off between bandwidth and compute. For example, say I 
want to join a tiny topic with 2 partitions and a large topic with 32 
partitions. Right now, I'd be forced to use only 2 tasks -- which might be too 
little to keep up with the data in the large topic. If we allowed the same SSP 
to go to multiple tasks, I could make a task for each partition in the large 
topic -- this means I'd be consuming the small topic from many tasks, but that 
extra bandwidth might be worth it to scale the job across many machines.

- It seems like a common use case for 'windowing' is to periodically dump some 
accumulated stats / etc. Having a broadcast stream would let you trigger this 
from outside the job, so you could (say) have an external service send a 
message every hour or whenever a user requested a more up-to-date version. (If 
a machine was down at the time, it would catch up when the task is restarted.)

Personally, I like the original proposal best: make the offsets ordered, take 
the minimum offset on container start, then drop messages for the tasks where 
the latest commit is greater than the stream's offset. From a user's 
perspective, this is very simple: it doesn't introduce any new concepts or 
change existing behaviour, it removes a corner case from the grouping, and it 
keeps the tasks roughly in sync.

(Incidentally: to get exactly-once messaging/state semantics, my project 
'coast' also expect offsets to be ordered within a partition. If you can't 
assume an ordering, even Camus' simple trick of storing the offset alongside 
the state doesn't work.)

> Support assigning the same SSP to multiple tasknames
> ----------------------------------------------------
>
>                 Key: SAMZA-353
>                 URL: https://issues.apache.org/jira/browse/SAMZA-353
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Jakob Homan
>              Labels: design
>             Fix For: 0.8.0
>
>         Attachments: DESIGN-SAMZA-353-0.md, DESIGN-SAMZA-353-0.pdf
>
>
> Post SAMZA-123, it is possible to add the same SSP to multiple tasknames, 
> although currently we check for this and error out if this is done.  We 
> should think through the implications of having the same SSP appear in 
> multiple tasknames and support this if it makes sense.  
> This could be used as a broadcast stream that's either added by Samza itself 
> to each taskname, or individual groupers could do this as makes sense.  Right 
> now the container maintains a map of SSP to TaskInstance and delivers the ssp 
> to that task instance.  With this change, we'd need to change the map to SSP 
> to Set[TaskInstance] and deliver the message to each TI in the set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to