Processing time series data in order

2016-12-21 Thread Ali Akhtar
- I'm receiving a batch of messages to a Kafka topic.

Each message has a timestamp, however the messages can arrive / get
processed out of order. I.e event 1's timestamp could've been a few seconds
before event 2, and event 2 could still get processed before event 1.

- I know the number of messages that are sent per batch.

- I need to process the messages in order. The messages are basically
providing the history of an item. I need to be able to track the history
accurately (i.e, if an event occurred 3 times, i need to accurately log the
dates of the first, 2nd, and 3rd time it occurred).

The approach I'm considering is:

- Creating a cassandra table which is ordered by the timestamp of the
messages.

- Once a batch of messages has arrived, writing them all to cassandra,
counting on them being ordered by the timestamp even if they are processed
out of order.

- Then iterating over the messages in the cassandra table, to process them
in order.

However, I'm concerned about Cassandra's eventual consistency. Could it be
that even though I wrote the messages, they are not there when I try to
read them (which would be almost immediately after they are written)?

Should I enforce consistency = ALL to make sure the messages will be
available immediately after being written?

Is there a better way to handle this thru either Kafka streams or Cassandra?


Re: Processing time series data in order

2016-12-21 Thread Ali Akhtar
The batch size can be large, so in memory ordering isn't an option,
unfortunately.

On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges 
wrote:

> Depending on the expected max out of order window, why not order them in
> memory? Then you don't need to reread from Cassandra, in case of a problem
> you can reread data from Kafka.
>
> -Jesse
>
> > On Dec 21, 2016, at 7:24 PM, Ali Akhtar  wrote:
> >
> > - I'm receiving a batch of messages to a Kafka topic.
> >
> > Each message has a timestamp, however the messages can arrive / get
> processed out of order. I.e event 1's timestamp could've been a few seconds
> before event 2, and event 2 could still get processed before event 1.
> >
> > - I know the number of messages that are sent per batch.
> >
> > - I need to process the messages in order. The messages are basically
> providing the history of an item. I need to be able to track the history
> accurately (i.e, if an event occurred 3 times, i need to accurately log the
> dates of the first, 2nd, and 3rd time it occurred).
> >
> > The approach I'm considering is:
> >
> > - Creating a cassandra table which is ordered by the timestamp of the
> messages.
> >
> > - Once a batch of messages has arrived, writing them all to cassandra,
> counting on them being ordered by the timestamp even if they are processed
> out of order.
> >
> > - Then iterating over the messages in the cassandra table, to process
> them in order.
> >
> > However, I'm concerned about Cassandra's eventual consistency. Could it
> be that even though I wrote the messages, they are not there when I try to
> read them (which would be almost immediately after they are written)?
> >
> > Should I enforce consistency = ALL to make sure the messages will be
> available immediately after being written?
> >
> > Is there a better way to handle this thru either Kafka streams or
> Cassandra?
>


Re: Processing time series data in order

2016-12-21 Thread Jesse Hodges
Depending on the expected max out of order window, why not order them in 
memory? Then you don't need to reread from Cassandra, in case of a problem you 
can reread data from Kafka. 

-Jesse 

> On Dec 21, 2016, at 7:24 PM, Ali Akhtar  wrote:
> 
> - I'm receiving a batch of messages to a Kafka topic.
> 
> Each message has a timestamp, however the messages can arrive / get processed 
> out of order. I.e event 1's timestamp could've been a few seconds before 
> event 2, and event 2 could still get processed before event 1.
> 
> - I know the number of messages that are sent per batch.
> 
> - I need to process the messages in order. The messages are basically 
> providing the history of an item. I need to be able to track the history 
> accurately (i.e, if an event occurred 3 times, i need to accurately log the 
> dates of the first, 2nd, and 3rd time it occurred).
> 
> The approach I'm considering is:
> 
> - Creating a cassandra table which is ordered by the timestamp of the 
> messages.
> 
> - Once a batch of messages has arrived, writing them all to cassandra, 
> counting on them being ordered by the timestamp even if they are processed 
> out of order.
> 
> - Then iterating over the messages in the cassandra table, to process them in 
> order.
> 
> However, I'm concerned about Cassandra's eventual consistency. Could it be 
> that even though I wrote the messages, they are not there when I try to read 
> them (which would be almost immediately after they are written)?
> 
> Should I enforce consistency = ALL to make sure the messages will be 
> available immediately after being written?
> 
> Is there a better way to handle this thru either Kafka streams or Cassandra?


Re: Processing time series data in order

2016-12-26 Thread Asaf Mesika
There is a much easier approach: your can route all messages of a given Id
to a specific partition. Since each partition has a single writer you get
the ordering you wish for. Of course this won't work if your updates occur
in different hosts.
Also maybe Kafka streams can help shard the based on item Id to a second
topic
On Thu, 22 Dec 2016 at 4:31 Ali Akhtar  wrote:

> The batch size can be large, so in memory ordering isn't an option,
> unfortunately.
>
> On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges 
> wrote:
>
> > Depending on the expected max out of order window, why not order them in
> > memory? Then you don't need to reread from Cassandra, in case of a
> problem
> > you can reread data from Kafka.
> >
> > -Jesse
> >
> > > On Dec 21, 2016, at 7:24 PM, Ali Akhtar  wrote:
> > >
> > > - I'm receiving a batch of messages to a Kafka topic.
> > >
> > > Each message has a timestamp, however the messages can arrive / get
> > processed out of order. I.e event 1's timestamp could've been a few
> seconds
> > before event 2, and event 2 could still get processed before event 1.
> > >
> > > - I know the number of messages that are sent per batch.
> > >
> > > - I need to process the messages in order. The messages are basically
> > providing the history of an item. I need to be able to track the history
> > accurately (i.e, if an event occurred 3 times, i need to accurately log
> the
> > dates of the first, 2nd, and 3rd time it occurred).
> > >
> > > The approach I'm considering is:
> > >
> > > - Creating a cassandra table which is ordered by the timestamp of the
> > messages.
> > >
> > > - Once a batch of messages has arrived, writing them all to cassandra,
> > counting on them being ordered by the timestamp even if they are
> processed
> > out of order.
> > >
> > > - Then iterating over the messages in the cassandra table, to process
> > them in order.
> > >
> > > However, I'm concerned about Cassandra's eventual consistency. Could it
> > be that even though I wrote the messages, they are not there when I try
> to
> > read them (which would be almost immediately after they are written)?
> > >
> > > Should I enforce consistency = ALL to make sure the messages will be
> > available immediately after being written?
> > >
> > > Is there a better way to handle this thru either Kafka streams or
> > Cassandra?
> >
>


Re: Processing time series data in order

2016-12-26 Thread Ali Akhtar
How would I route the messages to a specific partition?

On 27 Dec 2016 10:25 a.m., "Asaf Mesika"  wrote:

> There is a much easier approach: your can route all messages of a given Id
> to a specific partition. Since each partition has a single writer you get
> the ordering you wish for. Of course this won't work if your updates occur
> in different hosts.
> Also maybe Kafka streams can help shard the based on item Id to a second
> topic
> On Thu, 22 Dec 2016 at 4:31 Ali Akhtar  wrote:
>
> > The batch size can be large, so in memory ordering isn't an option,
> > unfortunately.
> >
> > On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges 
> > wrote:
> >
> > > Depending on the expected max out of order window, why not order them
> in
> > > memory? Then you don't need to reread from Cassandra, in case of a
> > problem
> > > you can reread data from Kafka.
> > >
> > > -Jesse
> > >
> > > > On Dec 21, 2016, at 7:24 PM, Ali Akhtar 
> wrote:
> > > >
> > > > - I'm receiving a batch of messages to a Kafka topic.
> > > >
> > > > Each message has a timestamp, however the messages can arrive / get
> > > processed out of order. I.e event 1's timestamp could've been a few
> > seconds
> > > before event 2, and event 2 could still get processed before event 1.
> > > >
> > > > - I know the number of messages that are sent per batch.
> > > >
> > > > - I need to process the messages in order. The messages are basically
> > > providing the history of an item. I need to be able to track the
> history
> > > accurately (i.e, if an event occurred 3 times, i need to accurately log
> > the
> > > dates of the first, 2nd, and 3rd time it occurred).
> > > >
> > > > The approach I'm considering is:
> > > >
> > > > - Creating a cassandra table which is ordered by the timestamp of the
> > > messages.
> > > >
> > > > - Once a batch of messages has arrived, writing them all to
> cassandra,
> > > counting on them being ordered by the timestamp even if they are
> > processed
> > > out of order.
> > > >
> > > > - Then iterating over the messages in the cassandra table, to process
> > > them in order.
> > > >
> > > > However, I'm concerned about Cassandra's eventual consistency. Could
> it
> > > be that even though I wrote the messages, they are not there when I try
> > to
> > > read them (which would be almost immediately after they are written)?
> > > >
> > > > Should I enforce consistency = ALL to make sure the messages will be
> > > available immediately after being written?
> > > >
> > > > Is there a better way to handle this thru either Kafka streams or
> > > Cassandra?
> > >
> >
>


Re: Processing time series data in order

2016-12-27 Thread Tauzell, Dave
If you specify a key with each message then all messages with the same key get 
sent to the same partition.

> On Dec 26, 2016, at 23:32, Ali Akhtar  wrote:
>
> How would I route the messages to a specific partition?
>
>> On 27 Dec 2016 10:25 a.m., "Asaf Mesika"  wrote:
>>
>> There is a much easier approach: your can route all messages of a given Id
>> to a specific partition. Since each partition has a single writer you get
>> the ordering you wish for. Of course this won't work if your updates occur
>> in different hosts.
>> Also maybe Kafka streams can help shard the based on item Id to a second
>> topic
>>> On Thu, 22 Dec 2016 at 4:31 Ali Akhtar  wrote:
>>>
>>> The batch size can be large, so in memory ordering isn't an option,
>>> unfortunately.
>>>
>>> On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges 
>>> wrote:
>>>
 Depending on the expected max out of order window, why not order them
>> in
 memory? Then you don't need to reread from Cassandra, in case of a
>>> problem
 you can reread data from Kafka.

 -Jesse

> On Dec 21, 2016, at 7:24 PM, Ali Akhtar 
>> wrote:
>
> - I'm receiving a batch of messages to a Kafka topic.
>
> Each message has a timestamp, however the messages can arrive / get
 processed out of order. I.e event 1's timestamp could've been a few
>>> seconds
 before event 2, and event 2 could still get processed before event 1.
>
> - I know the number of messages that are sent per batch.
>
> - I need to process the messages in order. The messages are basically
 providing the history of an item. I need to be able to track the
>> history
 accurately (i.e, if an event occurred 3 times, i need to accurately log
>>> the
 dates of the first, 2nd, and 3rd time it occurred).
>
> The approach I'm considering is:
>
> - Creating a cassandra table which is ordered by the timestamp of the
 messages.
>
> - Once a batch of messages has arrived, writing them all to
>> cassandra,
 counting on them being ordered by the timestamp even if they are
>>> processed
 out of order.
>
> - Then iterating over the messages in the cassandra table, to process
 them in order.
>
> However, I'm concerned about Cassandra's eventual consistency. Could
>> it
 be that even though I wrote the messages, they are not there when I try
>>> to
 read them (which would be almost immediately after they are written)?
>
> Should I enforce consistency = ALL to make sure the messages will be
 available immediately after being written?
>
> Is there a better way to handle this thru either Kafka streams or
 Cassandra?

>>>
>>
This e-mail and any files transmitted with it are confidential, may contain 
sensitive information, and are intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this e-mail in error, 
please notify the sender by reply e-mail immediately and destroy all copies of 
the e-mail and any attachments.


Re: Processing time series data in order

2016-12-28 Thread Ali Akhtar
This will only ensure the order of delivery though, not the actual order of
the events, right?

I.e if due to network lag or any other reason, if the producer sends A,
then B, but B arrives before A, then B will be returned before A even if
they both went to the same partition. Am I correct about that?

Or can I use KTables to ensure A is processed before B? (Both messages will
have a timestamp which is being extracted by a TimestampExtractor ).

On Tue, Dec 27, 2016 at 8:15 PM, Tauzell, Dave  wrote:

> If you specify a key with each message then all messages with the same key
> get sent to the same partition.
>
> > On Dec 26, 2016, at 23:32, Ali Akhtar  wrote:
> >
> > How would I route the messages to a specific partition?
> >
> >> On 27 Dec 2016 10:25 a.m., "Asaf Mesika"  wrote:
> >>
> >> There is a much easier approach: your can route all messages of a given
> Id
> >> to a specific partition. Since each partition has a single writer you
> get
> >> the ordering you wish for. Of course this won't work if your updates
> occur
> >> in different hosts.
> >> Also maybe Kafka streams can help shard the based on item Id to a second
> >> topic
> >>> On Thu, 22 Dec 2016 at 4:31 Ali Akhtar  wrote:
> >>>
> >>> The batch size can be large, so in memory ordering isn't an option,
> >>> unfortunately.
> >>>
> >>> On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges 
> >>> wrote:
> >>>
>  Depending on the expected max out of order window, why not order them
> >> in
>  memory? Then you don't need to reread from Cassandra, in case of a
> >>> problem
>  you can reread data from Kafka.
> 
>  -Jesse
> 
> > On Dec 21, 2016, at 7:24 PM, Ali Akhtar 
> >> wrote:
> >
> > - I'm receiving a batch of messages to a Kafka topic.
> >
> > Each message has a timestamp, however the messages can arrive / get
>  processed out of order. I.e event 1's timestamp could've been a few
> >>> seconds
>  before event 2, and event 2 could still get processed before event 1.
> >
> > - I know the number of messages that are sent per batch.
> >
> > - I need to process the messages in order. The messages are basically
>  providing the history of an item. I need to be able to track the
> >> history
>  accurately (i.e, if an event occurred 3 times, i need to accurately
> log
> >>> the
>  dates of the first, 2nd, and 3rd time it occurred).
> >
> > The approach I'm considering is:
> >
> > - Creating a cassandra table which is ordered by the timestamp of the
>  messages.
> >
> > - Once a batch of messages has arrived, writing them all to
> >> cassandra,
>  counting on them being ordered by the timestamp even if they are
> >>> processed
>  out of order.
> >
> > - Then iterating over the messages in the cassandra table, to process
>  them in order.
> >
> > However, I'm concerned about Cassandra's eventual consistency. Could
> >> it
>  be that even though I wrote the messages, they are not there when I
> try
> >>> to
>  read them (which would be almost immediately after they are written)?
> >
> > Should I enforce consistency = ALL to make sure the messages will be
>  available immediately after being written?
> >
> > Is there a better way to handle this thru either Kafka streams or
>  Cassandra?
> 
> >>>
> >>
> This e-mail and any files transmitted with it are confidential, may
> contain sensitive information, and are intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> e-mail in error, please notify the sender by reply e-mail immediately and
> destroy all copies of the e-mail and any attachments.
>


Re: Processing time series data in order

2016-12-29 Thread Ewen Cheslack-Postava
The best you can do to ensure ordering today is to set:

acks = all
retries = Integer.MAX_VALUE
max.block.ms = Long.MAX_VALUE
max.in.flight.requests.per.connection = 1

This ensures there's only one outstanding produce request (batch of
messages) at a time, it will be retried indefinitely on retriable errors,
it will be fully replicated before it is acked, and if you run out of
buffer space you will block indefinitely until some of the data is
successfully produced and frees up buffer space. This effectively makes the
scenario you describe impossible.

-Ewen

On Wed, Dec 28, 2016 at 11:46 AM, Ali Akhtar  wrote:

> This will only ensure the order of delivery though, not the actual order of
> the events, right?
>
> I.e if due to network lag or any other reason, if the producer sends A,
> then B, but B arrives before A, then B will be returned before A even if
> they both went to the same partition. Am I correct about that?
>
> Or can I use KTables to ensure A is processed before B? (Both messages will
> have a timestamp which is being extracted by a TimestampExtractor ).
>
> On Tue, Dec 27, 2016 at 8:15 PM, Tauzell, Dave <
> dave.tauz...@surescripts.com
> > wrote:
>
> > If you specify a key with each message then all messages with the same
> key
> > get sent to the same partition.
> >
> > > On Dec 26, 2016, at 23:32, Ali Akhtar  wrote:
> > >
> > > How would I route the messages to a specific partition?
> > >
> > >> On 27 Dec 2016 10:25 a.m., "Asaf Mesika" 
> wrote:
> > >>
> > >> There is a much easier approach: your can route all messages of a
> given
> > Id
> > >> to a specific partition. Since each partition has a single writer you
> > get
> > >> the ordering you wish for. Of course this won't work if your updates
> > occur
> > >> in different hosts.
> > >> Also maybe Kafka streams can help shard the based on item Id to a
> second
> > >> topic
> > >>> On Thu, 22 Dec 2016 at 4:31 Ali Akhtar  wrote:
> > >>>
> > >>> The batch size can be large, so in memory ordering isn't an option,
> > >>> unfortunately.
> > >>>
> > >>> On Thu, Dec 22, 2016 at 7:09 AM, Jesse Hodges <
> hodges.je...@gmail.com>
> > >>> wrote:
> > >>>
> >  Depending on the expected max out of order window, why not order
> them
> > >> in
> >  memory? Then you don't need to reread from Cassandra, in case of a
> > >>> problem
> >  you can reread data from Kafka.
> > 
> >  -Jesse
> > 
> > > On Dec 21, 2016, at 7:24 PM, Ali Akhtar 
> > >> wrote:
> > >
> > > - I'm receiving a batch of messages to a Kafka topic.
> > >
> > > Each message has a timestamp, however the messages can arrive / get
> >  processed out of order. I.e event 1's timestamp could've been a few
> > >>> seconds
> >  before event 2, and event 2 could still get processed before event
> 1.
> > >
> > > - I know the number of messages that are sent per batch.
> > >
> > > - I need to process the messages in order. The messages are
> basically
> >  providing the history of an item. I need to be able to track the
> > >> history
> >  accurately (i.e, if an event occurred 3 times, i need to accurately
> > log
> > >>> the
> >  dates of the first, 2nd, and 3rd time it occurred).
> > >
> > > The approach I'm considering is:
> > >
> > > - Creating a cassandra table which is ordered by the timestamp of
> the
> >  messages.
> > >
> > > - Once a batch of messages has arrived, writing them all to
> > >> cassandra,
> >  counting on them being ordered by the timestamp even if they are
> > >>> processed
> >  out of order.
> > >
> > > - Then iterating over the messages in the cassandra table, to
> process
> >  them in order.
> > >
> > > However, I'm concerned about Cassandra's eventual consistency.
> Could
> > >> it
> >  be that even though I wrote the messages, they are not there when I
> > try
> > >>> to
> >  read them (which would be almost immediately after they are
> written)?
> > >
> > > Should I enforce consistency = ALL to make sure the messages will
> be
> >  available immediately after being written?
> > >
> > > Is there a better way to handle this thru either Kafka streams or
> >  Cassandra?
> > 
> > >>>
> > >>
> > This e-mail and any files transmitted with it are confidential, may
> > contain sensitive information, and are intended solely for the use of the
> > individual or entity to whom they are addressed. If you have received
> this
> > e-mail in error, please notify the sender by reply e-mail immediately and
> > destroy all copies of the e-mail and any attachments.
> >
>