Re: Most common kafka client comsumer implementations?

2014-08-13 Thread Anand Nalya
Hi Jim,

In one of the applications, we implemented option #1:

messageList = getNext(1000)
process(messageList)
commit()

In case of failure, this resulted in duplicate processing for at most 1000
records per partition.

Regards,
Anand


On 1 August 2014 20:35, Jim jimi...@gmail.com wrote:

 Thanks Guozhang,

 I was looking for actual real world workflows. I realize you can commit
 after each message but if you’re using ZK for offsets for instance you’ll
 put too much write load on the nodes and crush your throughput. So I was
 interested in batching strategies people have used that balance high/full
 throughput and fully committed events.


 On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang wangg...@gmail.com wrote:

  Hi Jim,
 
  Whether to use high level or simple consumer depends on your use case. If
  you need to manually manage partition assignments among your consumers,
 or
  you need to commit your offsets elsewhere than ZK, or you do not want
 auto
  rebalancing of consumers upon failures etc, you will use simple
 consumers;
  otherwise you use high level consumer.
 
  From your description of pulling a batch of messages it seems you are
  currently using the simple consumer. Suppose you are using the high level
  consumer, to achieve at-lease-once basically you can do sth like:
 
  message = consumer.iter.next()
  process(message)
  consumer.commit()
 
  which is effectively the same as option 2 for using a simple consumer. Of
  course, doing so has a heavy overhead of one-commit-per-message, you can
  also do option 1, by the cost of duplicates, which is tolerable for
  at-least-once.
 
  Guozhang
 
 
  On Wed, Jul 30, 2014 at 8:25 PM, Jim jimi...@gmail.com wrote:
 
   Curious on a couple questions...
  
   Are most people(are you?) using the simple consumer vs the high level
   consumer in production?
  
  
   What is the common processing paradigm for maintaining a full pipeline
  for
   kafka consumers for at-least-once messaging? E.g. you pull a batch of
  1000
   messages and:
  
   option 1.
   you wait for the slowest worker to finish working on that message, when
  you
   get back 1000 acks internally you commit your offset and pull another
  batch
  
   option 2.
   you feed your workers n msgs at a time in sequence and move your offset
  up
   as you work through your batch
  
   option 3.
   you maintain a full stream of 1000 messages ideally and as you get acks
   back from your workers you see if you can move your offset up in the
  stream
   to pull n more messages to fill up your pipeline so you're not blocked
 by
   the slowest consumer (probability wise)
  
  
   any good docs or articles on the subject would be great, thanks!
  
 
 
 
  --
  -- Guozhang
 



Re: Most common kafka client comsumer implementations?

2014-08-01 Thread Jim
Thanks Guozhang,

I was looking for actual real world workflows. I realize you can commit
after each message but if you’re using ZK for offsets for instance you’ll
put too much write load on the nodes and crush your throughput. So I was
interested in batching strategies people have used that balance high/full
throughput and fully committed events.


On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang wangg...@gmail.com wrote:

 Hi Jim,

 Whether to use high level or simple consumer depends on your use case. If
 you need to manually manage partition assignments among your consumers, or
 you need to commit your offsets elsewhere than ZK, or you do not want auto
 rebalancing of consumers upon failures etc, you will use simple consumers;
 otherwise you use high level consumer.

 From your description of pulling a batch of messages it seems you are
 currently using the simple consumer. Suppose you are using the high level
 consumer, to achieve at-lease-once basically you can do sth like:

 message = consumer.iter.next()
 process(message)
 consumer.commit()

 which is effectively the same as option 2 for using a simple consumer. Of
 course, doing so has a heavy overhead of one-commit-per-message, you can
 also do option 1, by the cost of duplicates, which is tolerable for
 at-least-once.

 Guozhang


 On Wed, Jul 30, 2014 at 8:25 PM, Jim jimi...@gmail.com wrote:

  Curious on a couple questions...
 
  Are most people(are you?) using the simple consumer vs the high level
  consumer in production?
 
 
  What is the common processing paradigm for maintaining a full pipeline
 for
  kafka consumers for at-least-once messaging? E.g. you pull a batch of
 1000
  messages and:
 
  option 1.
  you wait for the slowest worker to finish working on that message, when
 you
  get back 1000 acks internally you commit your offset and pull another
 batch
 
  option 2.
  you feed your workers n msgs at a time in sequence and move your offset
 up
  as you work through your batch
 
  option 3.
  you maintain a full stream of 1000 messages ideally and as you get acks
  back from your workers you see if you can move your offset up in the
 stream
  to pull n more messages to fill up your pipeline so you're not blocked by
  the slowest consumer (probability wise)
 
 
  any good docs or articles on the subject would be great, thanks!
 



 --
 -- Guozhang



Re: Most common kafka client comsumer implementations?

2014-07-31 Thread Guozhang Wang
Hi Jim,

Whether to use high level or simple consumer depends on your use case. If
you need to manually manage partition assignments among your consumers, or
you need to commit your offsets elsewhere than ZK, or you do not want auto
rebalancing of consumers upon failures etc, you will use simple consumers;
otherwise you use high level consumer.

From your description of pulling a batch of messages it seems you are
currently using the simple consumer. Suppose you are using the high level
consumer, to achieve at-lease-once basically you can do sth like:

message = consumer.iter.next()
process(message)
consumer.commit()

which is effectively the same as option 2 for using a simple consumer. Of
course, doing so has a heavy overhead of one-commit-per-message, you can
also do option 1, by the cost of duplicates, which is tolerable for
at-least-once.

Guozhang


On Wed, Jul 30, 2014 at 8:25 PM, Jim jimi...@gmail.com wrote:

 Curious on a couple questions...

 Are most people(are you?) using the simple consumer vs the high level
 consumer in production?


 What is the common processing paradigm for maintaining a full pipeline for
 kafka consumers for at-least-once messaging? E.g. you pull a batch of 1000
 messages and:

 option 1.
 you wait for the slowest worker to finish working on that message, when you
 get back 1000 acks internally you commit your offset and pull another batch

 option 2.
 you feed your workers n msgs at a time in sequence and move your offset up
 as you work through your batch

 option 3.
 you maintain a full stream of 1000 messages ideally and as you get acks
 back from your workers you see if you can move your offset up in the stream
 to pull n more messages to fill up your pipeline so you're not blocked by
 the slowest consumer (probability wise)


 any good docs or articles on the subject would be great, thanks!




-- 
-- Guozhang