[
https://issues.apache.org/jira/browse/SAMZA-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218739#comment-14218739
]
Ben Kirwin commented on SAMZA-353:
----------------------------------
I'd like to throw a couple more use-cases in the ring.
- Storm / Trident have a concept of a 'DRPC' topology. It's a pretty typical
scatter / gather pattern: there's a stream of incoming requests that get
broadcast to a bunch of workers, and the results get funnelled together at the
bottom. (Some examples / commentary in here:
https://storm.apache.org/documentation/Trident-tutorial.html#reach ) 'Global'
streams seem like a good abstraction for this.
- It allows you to trade off between bandwidth and compute. For example, say I
want to join a tiny topic with 2 partitions and a large topic with 32
partitions. Right now, I'd be forced to use only 2 tasks -- which might be too
little to keep up with the data in the large topic. If we allowed the same SSP
to go to multiple tasks, I could make a task for each partition in the large
topic -- this means I'd be consuming the small topic from many tasks, but that
extra bandwidth might be worth it to scale the job across many machines.
- It seems like a common use case for 'windowing' is to periodically dump some
accumulated stats / etc. Having a broadcast stream would let you trigger this
from outside the job, so you could (say) have an external service send a
message every hour or whenever a user requested a more up-to-date version. (If
a machine was down at the time, it would catch up when the task is restarted.)
Personally, I like the original proposal best: make the offsets ordered, take
the minimum offset on container start, then drop messages for the tasks where
the latest commit is greater than the stream's offset. From a user's
perspective, this is very simple: it doesn't introduce any new concepts or
change existing behaviour, it removes a corner case from the grouping, and it
keeps the tasks roughly in sync.
(Incidentally: to get exactly-once messaging/state semantics, my project
'coast' also expect offsets to be ordered within a partition. If you can't
assume an ordering, even Camus' simple trick of storing the offset alongside
the state doesn't work.)
> Support assigning the same SSP to multiple tasknames
> ----------------------------------------------------
>
> Key: SAMZA-353
> URL: https://issues.apache.org/jira/browse/SAMZA-353
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.8.0
> Reporter: Jakob Homan
> Labels: design
> Fix For: 0.8.0
>
> Attachments: DESIGN-SAMZA-353-0.md, DESIGN-SAMZA-353-0.pdf
>
>
> Post SAMZA-123, it is possible to add the same SSP to multiple tasknames,
> although currently we check for this and error out if this is done. We
> should think through the implications of having the same SSP appear in
> multiple tasknames and support this if it makes sense.
> This could be used as a broadcast stream that's either added by Samza itself
> to each taskname, or individual groupers could do this as makes sense. Right
> now the container maintains a map of SSP to TaskInstance and delivers the ssp
> to that task instance. With this change, we'd need to change the map to SSP
> to Set[TaskInstance] and deliver the message to each TI in the set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)