[
https://issues.apache.org/jira/browse/SAMZA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129093#comment-14129093
]
Chris Riccomini commented on SAMZA-402:
---------------------------------------
bq. Say you write to the stream using a client (in some other system) which
doesn't do key partitioning in the same way, or which accidentally omits the
partitioning key. If we're consuming a single partition, that write will either
take effect (if it's in partition 0) or be ignored, but the outcome is
deterministic.
But in this case, isn't the fact that it's deterministic kind of irrelevant
since the state is basically corrupted (since a portion of the writes are
totally disregarded)? It seems that in this scenario a more desirable thing
would be to just fail outright if the input stream has more than one partition.
bq. So it seems to me that a single partition is less error-prone, and I can't
see a compelling advantage of using multiple partitions.
Two reasons that I like the multi-partition approach are:
# The Samza job reading the input stream for global state might not have
control over the partition size. For example, if it's consuming from the
changelog of another Samza job to build its global state.
# I can foresee people pushing data from Hadoop (or some other mechanism) to a
topic that doesn't yet exist. When this happens, the default partition size
will be used, which in most real-world production clusters is a partition count
> 1 (a wild guess, but it's true at LI, anyway). When this happens, the Samza
job will either disregard all of the state except partition 0, or fail the job
(depending on implementation). The developer will then be forced to either
shrink the topic partition size (can't be done in Kafka), or create a new
stream and delete the old one (topics can't be deleted in Kafka either, yet).
> Provide a "shared state" store among StreamTasks
> ------------------------------------------------
>
> Key: SAMZA-402
> URL: https://issues.apache.org/jira/browse/SAMZA-402
> Project: Samza
> Issue Type: Bug
> Components: container, kv
> Affects Versions: 0.8.0
> Reporter: Chris Riccomini
> Attachments: DESIGN-SAMZA-402-0.md, DESIGN-SAMZA-402-0.pdf,
> DESIGN-SAMZA-402-1.md, DESIGN-SAMZA-402-1.pdf
>
>
> There has been a lot of discussion about shared state stores in SAMZA-353.
> Initially, it seemed as though we might implement them through SAMZA-353, but
> now it seems more preferable to implement them separately. As such, this
> ticket is to discuss global state/shared state (terms that are being used
> interchangeably) between StreamTasks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)