[
https://issues.apache.org/jira/browse/SAMZA-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115682#comment-14115682
]
Chinmay Soman commented on SAMZA-353:
-------------------------------------
Essentially - we are building a distributed read-only key value store on top of
Kafka ? Seems very useful.
Although, I have a couple of questions
1) Priority of bootstrap stream ?
In case of ip-domain: marking it as a 'bootstrap=True' stream works when the
container is starting up. In this phase, the MessageChooser will simply
prioritize 'ip-domain' messages over those contained in 'page-views' - neat !
However, what happens when a few hours pass by and new data is written to
ip-domain ? Do we again give ip-domain more priority ? Or do we continue to
multiplex messages from these two streams ?
Pros of always giving the bootstrap stream more priority: we are always
guaranteed to have the latest data in the global state store
Cons: This is essentially bringing the container to a halt until the bootstrap
is done.
My opinion: We only give higher priority to the boostrap streams on startup -
after that we treat all the streams as equally important and live with the
resulting staleness.
2) Reading bootstrap stream (ip-domain) during startup ?
For a given container - do we still read from all the partitions ? Or do we
only read from the partition(s) assigned to that container ? It seems to me
that from this design -> you should only read from the assigned partitions. Can
you confirm ?
If we do indeed read from different partitions for ip-domain (and then use
Kafka for making sure all the containers get all the data) what is the
guarantee that all the containers have fully bootstrapped the global state
store ? This new technique is more asynchronous since the writes and reads are
separated by Kafka.
Happy to talk in person if I'm not making any sense :)
> Support assigning the same SSP to multiple tasknames
> ----------------------------------------------------
>
> Key: SAMZA-353
> URL: https://issues.apache.org/jira/browse/SAMZA-353
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.8.0
> Reporter: Jakob Homan
> Attachments: DESIGN-SAMZA-353-0.md, DESIGN-SAMZA-353-0.pdf
>
>
> Post SAMZA-123, it is possible to add the same SSP to multiple tasknames,
> although currently we check for this and error out if this is done. We
> should think through the implications of having the same SSP appear in
> multiple tasknames and support this if it makes sense.
> This could be used as a broadcast stream that's either added by Samza itself
> to each taskname, or individual groupers could do this as makes sense. Right
> now the container maintains a map of SSP to TaskInstance and delivers the ssp
> to that task instance. With this change, we'd need to change the map to SSP
> to Set[TaskInstance] and deliver the message to each TI in the set.
--
This message was sent by Atlassian JIRA
(v6.2#6252)