[ 
https://issues.apache.org/jira/browse/SAMZA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Riccomini updated SAMZA-461:
----------------------------------
    Fix Version/s: 0.9.0

> Race when initializing offsets at job startup leads to skipped messages
> -----------------------------------------------------------------------
>
>                 Key: SAMZA-461
>                 URL: https://issues.apache.org/jira/browse/SAMZA-461
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>            Reporter: Ben Kirwin
>             Fix For: 0.9.0
>
>
> If the default offset is set to oldest, a Samza job should start from the 
> very beginning of the stream:
> {code}
> systems.kafka.samza.offset.default=oldest
> {code}
> However, if the very first messages are added to the stream while the job is 
> booting up, it's possible for those messages to be skipped entirely.
> When there are no messages in a stream, Samza reads the 'oldest' offset as 
> null. This null value is added to the map of starting offsets in the offset 
> manager. When the Kafka broker proxy gets the null offset, it complains:
> {code}
> It appears that we received an invalid or empty offset [...] Attempting to 
> use Kafka's auto.offset.reset setting. This can result in data loss if 
> processing continues.
> {code}
> If auto.offset.reset is not manually configured, this defaults to starting 
> with the latest value. If messages have appeared in the stream in the 
> meantime, the job will start *after* those messages, and data is indeed lost.
> It seems like setting oldestOffset to equal upcomingOffset would solve the 
> issue. (It's also semantically reasonable -- the upcoming offset is indeed 
> the oldest offset that will ever be read.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to