[ 
https://issues.apache.org/jira/browse/SAMZA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129093#comment-14129093
 ] 

Chris Riccomini commented on SAMZA-402:
---------------------------------------

bq. Say you write to the stream using a client (in some other system) which 
doesn't do key partitioning in the same way, or which accidentally omits the 
partitioning key. If we're consuming a single partition, that write will either 
take effect (if it's in partition 0) or be ignored, but the outcome is 
deterministic.

But in this case, isn't the fact that it's deterministic kind of irrelevant 
since the state is basically corrupted (since a portion of the writes are 
totally disregarded)? It seems that in this scenario a more desirable thing 
would be to just fail outright if the input stream has more than one partition.

bq. So it seems to me that a single partition is less error-prone, and I can't 
see a compelling advantage of using multiple partitions.

Two reasons that I like the multi-partition approach are:

# The Samza job reading the input stream for global state might not have 
control over the partition size. For example, if it's consuming from the 
changelog of another Samza job to build its global state.
# I can foresee people pushing data from Hadoop (or some other mechanism) to a 
topic that doesn't yet exist. When this happens, the default partition size 
will be used, which in most real-world production clusters is a partition count 
> 1 (a wild guess, but it's true at LI, anyway). When this happens, the Samza 
job will either disregard all of the state except partition 0, or fail the job 
(depending on implementation). The developer will then be forced to either 
shrink the topic partition size (can't be done in Kafka), or create a new 
stream and delete the old one (topics can't be deleted in Kafka either, yet).

> Provide a "shared state" store among StreamTasks
> ------------------------------------------------
>
>                 Key: SAMZA-402
>                 URL: https://issues.apache.org/jira/browse/SAMZA-402
>             Project: Samza
>          Issue Type: Bug
>          Components: container, kv
>    Affects Versions: 0.8.0
>            Reporter: Chris Riccomini
>         Attachments: DESIGN-SAMZA-402-0.md, DESIGN-SAMZA-402-0.pdf, 
> DESIGN-SAMZA-402-1.md, DESIGN-SAMZA-402-1.pdf
>
>
> There has been a lot of discussion about shared state stores in SAMZA-353. 
> Initially, it seemed as though we might implement them through SAMZA-353, but 
> now it seems more preferable to implement them separately. As such, this 
> ticket is to discuss global state/shared state (terms that are being used 
> interchangeably) between StreamTasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to