[ 
https://issues.apache.org/jira/browse/SAMZA-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya reassigned SAMZA-1613:
-----------------------------

    Assignee: Ahmed Abdul Hamid

> Stream-table join: Intermediate streams do not inherit bootstrap stream 
> semantic
> --------------------------------------------------------------------------------
>
>                 Key: SAMZA-1613
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1613
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Aditya
>            Assignee: Ahmed Abdul Hamid
>            Priority: Major
>
> There are two issues with repartitioning a bootstrap stream: 
>  * Samza does not propagate the bootstrap semantics to the intermediate 
> stream. 
>  * For a bootstrap stream, Samza gets the newest offset for that stream 
> before consumption and considers the bootstrap to be complete once the 
> message with that offset is consumed. This is clearly a problem for 
> intermediate streams as there might not be any messages at the beginning of 
> the job. 
> Even though the first stage abides by the bootstrap semantics and does not 
> consume from non-bootstrap streams until all the bootstrap stream partitions 
> owned by that container are re-partitioned, there will be scenarios where one 
> container finishes bootstrap faster than others and also the rate of 
> consumption from the intermediate stream in the second stage might be slower. 
> All these would result in the job breaking the bootstrap semantics across the 
> two stages. Please note that this issue is not just specific to 
> repartitioning but could happen with any intermediate streams. 
> To support this, we will have to solve both the afore-mentioned issues: 
>  * Propagate bootstrap semantics to the intermediate streams if the 
> repartitioned stream is bootstrap stream. 
>  * Either add end of bootstrap control message (note that all the containers 
> have to co-ordinate here) or wait for event time support in Samza. 
> This is clearly a problem for Stream-Table join. Consider the scenario where 
> the load on the stream is much higher than the load on the stream marked as 
> table. Consequently, the number of partitions in the stream will be higher 
> than that of the table. To join these two, we will have to increase the 
> number of partitions in the table by repartitioning, which is where we end up 
> in the premature join of stream with the table, before seeding the table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to