[ https://issues.apache.org/jira/browse/SAMZA-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aditya reassigned SAMZA-1613: ----------------------------- Assignee: Ahmed Abdul Hamid > Stream-table join: Intermediate streams do not inherit bootstrap stream > semantic > -------------------------------------------------------------------------------- > > Key: SAMZA-1613 > URL: https://issues.apache.org/jira/browse/SAMZA-1613 > Project: Samza > Issue Type: Bug > Reporter: Aditya > Assignee: Ahmed Abdul Hamid > Priority: Major > > There are two issues with repartitioning a bootstrap stream: > * Samza does not propagate the bootstrap semantics to the intermediate > stream. > * For a bootstrap stream, Samza gets the newest offset for that stream > before consumption and considers the bootstrap to be complete once the > message with that offset is consumed. This is clearly a problem for > intermediate streams as there might not be any messages at the beginning of > the job. > Even though the first stage abides by the bootstrap semantics and does not > consume from non-bootstrap streams until all the bootstrap stream partitions > owned by that container are re-partitioned, there will be scenarios where one > container finishes bootstrap faster than others and also the rate of > consumption from the intermediate stream in the second stage might be slower. > All these would result in the job breaking the bootstrap semantics across the > two stages. Please note that this issue is not just specific to > repartitioning but could happen with any intermediate streams. > To support this, we will have to solve both the afore-mentioned issues: > * Propagate bootstrap semantics to the intermediate streams if the > repartitioned stream is bootstrap stream. > * Either add end of bootstrap control message (note that all the containers > have to co-ordinate here) or wait for event time support in Samza. > This is clearly a problem for Stream-Table join. Consider the scenario where > the load on the stream is much higher than the load on the stream marked as > table. Consequently, the number of partitions in the stream will be higher > than that of the table. To join these two, we will have to increase the > number of partitions in the table by repartitioning, which is where we end up > in the premature join of stream with the table, before seeding the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)