Re: Processing time series data in order
The batch size can be large, so in memory ordering isn't an option, unfortunately. On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges wrote: > Depending on the expected max out of order window, why not order them in > memory? Then you don't need to reread from Cassandra, in case of a problem > you can reread data from Kafka. > > -Jesse > > > On Dec 21, 2016, at 7:24 PM, Ali Akhtar wrote: > > > > - I'm receiving a batch of messages to a Kafka topic. > > > > Each message has a timestamp, however the messages can arrive / get > processed out of order. I.e event 1's timestamp could've been a few seconds > before event 2, and event 2 could still get processed before event 1. > > > > - I know the number of messages that are sent per batch. > > > > - I need to process the messages in order. The messages are basically > providing the history of an item. I need to be able to track the history > accurately (i.e, if an event occurred 3 times, i need to accurately log the > dates of the first, 2nd, and 3rd time it occurred). > > > > The approach I'm considering is: > > > > - Creating a cassandra table which is ordered by the timestamp of the > messages. > > > > - Once a batch of messages has arrived, writing them all to cassandra, > counting on them being ordered by the timestamp even if they are processed > out of order. > > > > - Then iterating over the messages in the cassandra table, to process > them in order. > > > > However, I'm concerned about Cassandra's eventual consistency. Could it > be that even though I wrote the messages, they are not there when I try to > read them (which would be almost immediately after they are written)? > > > > Should I enforce consistency = ALL to make sure the messages will be > available immediately after being written? > > > > Is there a better way to handle this thru either Kafka streams or > Cassandra? >
Re: Processing time series data in order
Depending on the expected max out of order window, why not order them in memory? Then you don't need to reread from Cassandra, in case of a problem you can reread data from Kafka. -Jesse > On Dec 21, 2016, at 7:24 PM, Ali Akhtar wrote: > > - I'm receiving a batch of messages to a Kafka topic. > > Each message has a timestamp, however the messages can arrive / get processed > out of order. I.e event 1's timestamp could've been a few seconds before > event 2, and event 2 could still get processed before event 1. > > - I know the number of messages that are sent per batch. > > - I need to process the messages in order. The messages are basically > providing the history of an item. I need to be able to track the history > accurately (i.e, if an event occurred 3 times, i need to accurately log the > dates of the first, 2nd, and 3rd time it occurred). > > The approach I'm considering is: > > - Creating a cassandra table which is ordered by the timestamp of the > messages. > > - Once a batch of messages has arrived, writing them all to cassandra, > counting on them being ordered by the timestamp even if they are processed > out of order. > > - Then iterating over the messages in the cassandra table, to process them in > order. > > However, I'm concerned about Cassandra's eventual consistency. Could it be > that even though I wrote the messages, they are not there when I try to read > them (which would be almost immediately after they are written)? > > Should I enforce consistency = ALL to make sure the messages will be > available immediately after being written? > > Is there a better way to handle this thru either Kafka streams or Cassandra?
Processing time series data in order
- I'm receiving a batch of messages to a Kafka topic. Each message has a timestamp, however the messages can arrive / get processed out of order. I.e event 1's timestamp could've been a few seconds before event 2, and event 2 could still get processed before event 1. - I know the number of messages that are sent per batch. - I need to process the messages in order. The messages are basically providing the history of an item. I need to be able to track the history accurately (i.e, if an event occurred 3 times, i need to accurately log the dates of the first, 2nd, and 3rd time it occurred). The approach I'm considering is: - Creating a cassandra table which is ordered by the timestamp of the messages. - Once a batch of messages has arrived, writing them all to cassandra, counting on them being ordered by the timestamp even if they are processed out of order. - Then iterating over the messages in the cassandra table, to process them in order. However, I'm concerned about Cassandra's eventual consistency. Could it be that even though I wrote the messages, they are not there when I try to read them (which would be almost immediately after they are written)? Should I enforce consistency = ALL to make sure the messages will be available immediately after being written? Is there a better way to handle this thru either Kafka streams or Cassandra?