Aditya created SAMZA-1613:
-----------------------------

             Summary: Stream-table join: Intermediate streams do not inherit 
bootstrap stream semantic
                 Key: SAMZA-1613
                 URL: https://issues.apache.org/jira/browse/SAMZA-1613
             Project: Samza
          Issue Type: Bug
            Reporter: Aditya


There are two issues with repartitioning a bootstrap stream: 
 * Samza does not propagate the bootstrap semantics to the intermediate stream. 
 * For a bootstrap stream, Samza gets the newest offset for that stream before 
consumption and considers the bootstrap to be complete once the message with 
that offset is consumed. This is clearly a problem for intermediate streams as 
there might not be any messages at the beginning of the job. 

Even though the first stage abides by the bootstrap semantics and does not 
consume from non-bootstrap streams until all the bootstrap stream partitions 
owned by that container are re-partitioned, there will be scenarios where one 
container finishes bootstrap faster than others and also the rate of 
consumption from the intermediate stream in the second stage might be slower. 
All these would result in the job breaking the bootstrap semantics across the 
two stages. Please note that this issue is not just specific to repartitioning 
but could happen with any intermediate streams. 

To support this, we will have to solve both the afore-mentioned issues: 
 * Propagate bootstrap semantics to the intermediate streams if the 
repartitioned stream is bootstrap stream. 
 * Either add end of bootstrap control message (note that all the containers 
have to co-ordinate here) or wait for event time support in Samza. 

This is clearly a problem for Stream-Table join. Consider the scenario where 
the load on the stream is much higher than the load on the stream marked as 
table. Consequently, the number of partitions in the stream will be higher than 
that of the table. To join these two, we will have to increase the number of 
partitions in the table by repartitioning, which is where we end up in the 
premature join of stream with the table, before seeding the table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to