RE: New consumer API waits indefinitely

2016-04-12 Thread Lohith Samaga M
Dear All,
After a system restart, the new consumer is working as expected.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga




-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com] 
Sent: Tuesday, April 12, 2016 17.00
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Dear All,
I installed Kafka on a Linux VM.
Here too:
1. The producer is able to store messages in the topic (sent from 
Windows host).
2. The consumer is unable to read it either from Windows host or from 
kafka-console-consumer on the Linux VM console.

In the logs, I see:
[2016-04-12 16:51:00,672] INFO [GroupCoordinator 0]: Stabilized group 
console-consumer-39913 generation 1 (kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:00,676] INFO [GroupCoordinator 0]: Assignment received from 
leader for group console-consumer-39913 for generation 1 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:09,638] INFO [GroupCoordinator 0]: Preparing to restabilize 
group console-consumer-39913 with old generation 1 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:09,640] INFO [GroupCoordinator 0]: Group 
console-consumer-39913 generation 1 is dead and removed 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:53:08,489] INFO [Group Metadata Manager on Broker 0]: Removed 0 
expired offsets in 1 milliseconds. (kafka.coordinator.GroupMetadataManager)

When I run my Java code, I still get the exception - 
org.apache.kafka.clients.consumer.internals.SendFailedException


So, is it advisable to use the old consumer on Kafka 0.9.0.1?

Please help.

Thanks in advance.


Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga



-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com]
Sent: Tuesday, April 05, 2016 13.36
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Hi Ismael, Niko,
After cleaning up the zookeeper and kafka logs, I do not get the below 
server exception anymore. I think Kafka did not like me opening the .log file 
in notepad.

The only exception that I now get is 
org.apache.kafka.clients.consumer.internals.SendFailedException in 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.RequestFutureCompletionHandler.
After that, the consumer goes into a loop.

Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga



-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com]
Sent: Tuesday, April 05, 2016 12.38
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Hi Ismael,
I see the following exception when I (re)start Kafka (even a fresh 
setup after the previous one). And where is the configuration to set the data 
directory for Kafka (not the logs)?

java.io.IOException: The requested operation cannot be performed on a file with 
a user-mapped section open
at java.io.RandomAccessFile.setLength(Native Method)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:285)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(OffsetI
ndex.scala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264)
at kafka.log.LogSegment.recover(LogSegment.scala:199)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:778)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimize
d.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:777)
at kafka.log.Log.loadSegments(Log.scala:160)
at kafka.log.Log.(Log.scala:90)
at kafka.log.LogManager.createLog(LogManager.scala:357)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:91)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:173)
at kafka.cluster.Partition$$anonfun$4.apply

RE: New consumer API waits indefinitely

2016-04-12 Thread Lohith Samaga M
Dear All,
I installed Kafka on a Linux VM.
Here too:
1. The producer is able to store messages in the topic (sent from 
Windows host).
2. The consumer is unable to read it either from Windows host or from 
kafka-console-consumer on the Linux VM console.

In the logs, I see:
[2016-04-12 16:51:00,672] INFO [GroupCoordinator 0]: Stabilized group 
console-consumer-39913 generation 1 (kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:00,676] INFO [GroupCoordinator 0]: Assignment received from 
leader for group console-consumer-39913 for generation 1 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:09,638] INFO [GroupCoordinator 0]: Preparing to restabilize 
group console-consumer-39913 with old generation 1 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:51:09,640] INFO [GroupCoordinator 0]: Group 
console-consumer-39913 generation 1 is dead and removed 
(kafka.coordinator.GroupCoordinator)
[2016-04-12 16:53:08,489] INFO [Group Metadata Manager on Broker 0]: Removed 0 
expired offsets in 1 milliseconds. (kafka.coordinator.GroupMetadataManager)

When I run my Java code, I still get the exception - 
org.apache.kafka.clients.consumer.internals.SendFailedException


So, is it advisable to use the old consumer on Kafka 0.9.0.1?

Please help.

Thanks in advance.


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga



-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com] 
Sent: Tuesday, April 05, 2016 13.36
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Hi Ismael, Niko,
After cleaning up the zookeeper and kafka logs, I do not get the below 
server exception anymore. I think Kafka did not like me opening the .log file 
in notepad.

The only exception that I now get is 
org.apache.kafka.clients.consumer.internals.SendFailedException in 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.RequestFutureCompletionHandler.
After that, the consumer goes into a loop.

Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga



-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com]
Sent: Tuesday, April 05, 2016 12.38
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Hi Ismael,
I see the following exception when I (re)start Kafka (even a fresh 
setup after the previous one). And where is the configuration to set the data 
directory for Kafka (not the logs)?

java.io.IOException: The requested operation cannot be performed on a file with 
a user-mapped section open
at java.io.RandomAccessFile.setLength(Native Method)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:285)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(OffsetI
ndex.scala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264)
at kafka.log.LogSegment.recover(LogSegment.scala:199)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:778)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimize
d.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:777)
at kafka.log.Log.loadSegments(Log.scala:160)
at kafka.log.Log.(Log.scala:90)
at kafka.log.LogManager.createLog(LogManager.scala:357)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:91)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:173)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:165)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:270)
at kafka.cluster.Partition.makeLeader(Partition.scala:165)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManag
er.scala:692)
at kafka.server.ReplicaManager$$anonfun$makeLeaders

RE: New consumer API waits indefinitely

2016-04-05 Thread Lohith Samaga M
Hi Ismael, Niko,
After cleaning up the zookeeper and kafka logs, I do not get the below 
server exception anymore. I think Kafka did not like me opening the .log file 
in notepad.

The only exception that I now get is 
org.apache.kafka.clients.consumer.internals.SendFailedException in 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.RequestFutureCompletionHandler.
After that, the consumer goes into a loop.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga



-Original Message-
From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com] 
Sent: Tuesday, April 05, 2016 12.38
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

Hi Ismael,
I see the following exception when I (re)start Kafka (even a fresh 
setup after the previous one). And where is the configuration to set the data 
directory for Kafka (not the logs)?

java.io.IOException: The requested operation cannot be performed on a file with 
a user-mapped section open
at java.io.RandomAccessFile.setLength(Native Method)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:285)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(OffsetI
ndex.scala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264)
at kafka.log.LogSegment.recover(LogSegment.scala:199)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:778)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimize
d.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:777)
at kafka.log.Log.loadSegments(Log.scala:160)
at kafka.log.Log.(Log.scala:90)
at kafka.log.LogManager.createLog(LogManager.scala:357)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:91)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:173)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:165)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:270)
at kafka.cluster.Partition.makeLeader(Partition.scala:165)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManag
er.scala:692)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManag
er.scala:691)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at kafka.server.ReplicaManager.makeLeaders(ReplicaManager.scala:691)
at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.sca
la:637)
at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:131)

at kafka.server.KafkaApis.handle(KafkaApis.scala:72)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
at java.lang.Thread.run(Thread.java:724)




Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga



-Original Message-
From: isma...@gmail.com [mailto:isma...@gmail.com] On Behalf Of Ismael Juma
Sent: Monday, April 04, 2016 17.21
To: users@kafka.apache.org
Subject: Re: New consumer API waits indefinitely

Hi Lohith,

Are there any errors in your broker logs? I think there may be some issues with 
compacted topics on Windows and the new consumer uses a compacted topic to 
store offsets.

Ismael

On Mon, Apr 4, 2016 at 12:20 PM, Lohith Samaga M 
wrote:

> Dear All,
> The error seems to be NOT_COORDINATOR_FOR_GROUP.
> The exception thrown in
> org.apache.kafka.clients.consumer.internals.Requ

RE: New consumer API waits indefinitely

2016-04-05 Thread Lohith Samaga M
Hi Ismael,
I see the following exception when I (re)start Kafka (even a fresh 
setup after the previous one). And where is the configuration to set the data 
directory for Kafka (not the logs)?

java.io.IOException: The requested operation cannot be performed on a file with
a user-mapped section open
at java.io.RandomAccessFile.setLength(Native Method)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:285)
at kafka.log.OffsetIndex$$anonfun$resize$1.apply(OffsetIndex.scala:276)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.resize(OffsetIndex.scala:276)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(OffsetI
ndex.scala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.log.OffsetIndex$$anonfun$trimToValidSize$1.apply(OffsetIndex.sc
ala:265)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.log.OffsetIndex.trimToValidSize(OffsetIndex.scala:264)
at kafka.log.LogSegment.recover(LogSegment.scala:199)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:188)
at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:160)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:778)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimize
d.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:777)
at kafka.log.Log.loadSegments(Log.scala:160)
at kafka.log.Log.(Log.scala:90)
at kafka.log.LogManager.createLog(LogManager.scala:357)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:91)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$2.apply(Partition.s
cala:173)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:79)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:173)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:165)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:270)
at kafka.cluster.Partition.makeLeader(Partition.scala:165)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManag
er.scala:692)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManag
er.scala:691)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at kafka.server.ReplicaManager.makeLeaders(ReplicaManager.scala:691)
at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.sca
la:637)
at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:131)

at kafka.server.KafkaApis.handle(KafkaApis.scala:72)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
at java.lang.Thread.run(Thread.java:724)




Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga



-Original Message-
From: isma...@gmail.com [mailto:isma...@gmail.com] On Behalf Of Ismael Juma
Sent: Monday, April 04, 2016 17.21
To: users@kafka.apache.org
Subject: Re: New consumer API waits indefinitely

Hi Lohith,

Are there any errors in your broker logs? I think there may be some issues with 
compacted topics on Windows and the new consumer uses a compacted topic to 
store offsets.

Ismael

On Mon, Apr 4, 2016 at 12:20 PM, Lohith Samaga M 
wrote:

> Dear All,
> The error seems to be NOT_COORDINATOR_FOR_GROUP.
> The exception thrown in
> org.apache.kafka.clients.consumer.internals.RequestFuture is:
> org.apache.kafka.common.errors.NotCoordinatorForGroupException:
> This is not the correct coordinator for this group.
>
> However, this exception is considered RetriableException in 
> org.apache.kafka.clients.consumer.internals.RequestFuture.
> So, the retry goes on - in a loop.
>
> It also happens that the Coordinator object becomes null in 
> AbstractCoordinator class.
>
> Can somebody please help?
>
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations M. 
> Lohith Samaga
>
>
>
>
> -Original Message-
> From: Ratha v [mailto:vijayara...@gmail.com]
> Sent: Monday, April 04, 2016 12.22
> To: users@kafka.apache.org
> Subjec

RE: New consumer API waits indefinitely

2016-04-05 Thread Lohith Samaga M
Thanks Niko!

I think I missed an 
org.apache.kafka.clients.consumer.internals.SendFailedException exception at 
the very beginning (or atleast it is giving an exception today).

Even after using a new install of Kafka, I get the same errors. Strangely, all 
topics are re-created in the logs. I cannot find the data directory in my drive.
How can I cleanup and start again?

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


-Original Message-
From: Niko Davor [mailto:nikoda...@gmail.com] 
Sent: Monday, April 04, 2016 23.59
To: users@kafka.apache.org
Subject: RE: New consumer API waits indefinitely

M. Lohith Samaga,

Your Java code looks fine.

Usually, if consumer.poll(100); doesn't return, there is probably a basic 
connection error. If Kafka can't connect, it will internally go into an 
infinite loop. To me, that doesn't seem like a good design, but that's a 
separate tangent.

Turn SLF4J root logging up to debug and you will probably see the connection 
error messages.

A second thought is it might be worth trying using Kafka on a small Linux VM. 
The docs say, "Windows is not currently a well supported platform though we 
would be happy to change that.". Even if you want to use Windows as a server in 
the long run, at least as a development test option, I'd want to be able to 
test with a Linux VM.

FYI, I'm a Kafka newbie, and I've had no problems getting working code samples 
up and running with Kafka 0.9.0.1 and the new Producer/Consumer APIs. I've 
gotten code samples running in Java, Scala, and Python, and everything works, 
including cross language tests.

Lastly, as a mailing list question, how do I reply to a question like this if I 
see the original question in the web archives but it is not in my mail client? 
I suspect that this reply will show up as a different thread which is not what 
I want.
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Re: New consumer API waits indefinitely

2016-04-04 Thread Ratha v
This is the same logs i get with my local kafka server, that works fine..

On 5 April 2016 at 10:20, Ratha v  wrote:

> HI Niko;
> I face this issue with linux systems..
> I changed the logging level to debug and when I start and stop my consumer
> (stopping the program)
>  I get same exception. What is the cause here?
>
> [2016-04-05 00:01:08,784] DEBUG Connection with /192.xx.xx.248
> disconnected (org.apache.kafka.common.network.Selector)
>
> kafka_1 | java.io.EOFException
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
>
> kafka_1 | at
> org.apache.kafka.common.network.Selector.poll(Selector.java:286)
>
> kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)
>
> kafka_1 | at java.lang.Thread.run(Thread.java:745)
>
> kafka_1 | [2016-04-05 00:01:09,236] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:11,236] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:13,238] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:14,078] DEBUG Connection with /192.168.0.248
> disconnected (org.apache.kafka.common.network.Selector)
>
> kafka_1 | java.io.EOFException
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
>
> kafka_1 | at
> org.apache.kafka.common.network.Selector.poll(Selector.java:286)
>
> kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)
>
> kafka_1 | at java.lang.Thread.run(Thread.java:745)
>
> kafka_1 | [2016-04-05 00:01:15,240] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:17,240] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:19,242] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)
>
> kafka_1 | [2016-04-05 00:01:19,558] DEBUG Connection with /192.xx.xx.248
> disconnected (org.apache.kafka.common.network.Selector)
>
> kafka_1 | java.io.EOFException
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
>
> kafka_1 | at
> org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
>
> kafka_1 | at
> org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
>
> kafka_1 | at
> org.apache.kafka.common.network.Selector.poll(Selector.java:286)
>
> kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)
>
> kafka_1 | at java.lang.Thread.run(Thread.java:745)
>
> kafka_1 | [2016-04-05 00:01:21,242] DEBUG Got ping response for
> sessionid: 0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnx
>
>
> On 5 April 2016 at 04:29, Niko Davor  wrote:
>
>> M. Lohith Samaga,
>>
>> Your Java code looks fine.
>>
>> Usually, if consumer.poll(100); doesn't return, there is probably a basic
>> connection error. If Kafka can't connect, it will internally go into an
>> infinite loop. To me, that doesn't seem like a good design, but that's a
>> separate tangent.
>>
>> Turn SLF4J root logging up to debug and you will probably see the
>> connection error messages.
>>
>> A second thought is it might be worth trying using Kafka on a small Linux
>> VM. The docs say, "Windows is not currently a well supported platform
>> though we would be happy to change that.". Even if you want to use Windows
>> as a server in the long run, at least as a development test option, I'd
>> want to be able to test with a Linux VM.
>>
>> FYI, I'm a Kafka newbie, and I've had no problems getting working code
>> samples up and running with Kafka 0.9.0.1 and the new Producer/Consumer
>> APIs. I've gotten code samples running in Java, Scala, and Python, and
>> everything works, including cross language tests.
>>
>> Lastly, as a mailing list question, how do I reply to a question like this
>> if I see the original question in the web archives but it is not in my
>> mail
>> client? I suspect that this reply will show up as a different thread which
>> is not what I want.
>

Re: New consumer API waits indefinitely

2016-04-04 Thread Ratha v
HI Niko;
I face this issue with linux systems..
I changed the logging level to debug and when I start and stop my consumer
(stopping the program)
 I get same exception. What is the cause here?

[2016-04-05 00:01:08,784] DEBUG Connection with /192.xx.xx.248 disconnected
(org.apache.kafka.common.network.Selector)

kafka_1 | java.io.EOFException

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)

kafka_1 | at
org.apache.kafka.common.network.Selector.poll(Selector.java:286)

kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)

kafka_1 | at java.lang.Thread.run(Thread.java:745)

kafka_1 | [2016-04-05 00:01:09,236] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:11,236] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:13,238] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:14,078] DEBUG Connection with /192.168.0.248
disconnected (org.apache.kafka.common.network.Selector)

kafka_1 | java.io.EOFException

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)

kafka_1 | at
org.apache.kafka.common.network.Selector.poll(Selector.java:286)

kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)

kafka_1 | at java.lang.Thread.run(Thread.java:745)

kafka_1 | [2016-04-05 00:01:15,240] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:17,240] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:19,242] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnxn)

kafka_1 | [2016-04-05 00:01:19,558] DEBUG Connection with /192.xx.xx.248
disconnected (org.apache.kafka.common.network.Selector)

kafka_1 | java.io.EOFException

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)

kafka_1 | at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)

kafka_1 | at
org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)

kafka_1 | at
org.apache.kafka.common.network.Selector.poll(Selector.java:286)

kafka_1 | at kafka.network.Processor.run(SocketServer.scala:413)

kafka_1 | at java.lang.Thread.run(Thread.java:745)

kafka_1 | [2016-04-05 00:01:21,242] DEBUG Got ping response for sessionid:
0x253405b88b300a4 after 0ms (org.apache.zookeeper.ClientCnx


On 5 April 2016 at 04:29, Niko Davor  wrote:

> M. Lohith Samaga,
>
> Your Java code looks fine.
>
> Usually, if consumer.poll(100); doesn't return, there is probably a basic
> connection error. If Kafka can't connect, it will internally go into an
> infinite loop. To me, that doesn't seem like a good design, but that's a
> separate tangent.
>
> Turn SLF4J root logging up to debug and you will probably see the
> connection error messages.
>
> A second thought is it might be worth trying using Kafka on a small Linux
> VM. The docs say, "Windows is not currently a well supported platform
> though we would be happy to change that.". Even if you want to use Windows
> as a server in the long run, at least as a development test option, I'd
> want to be able to test with a Linux VM.
>
> FYI, I'm a Kafka newbie, and I've had no problems getting working code
> samples up and running with Kafka 0.9.0.1 and the new Producer/Consumer
> APIs. I've gotten code samples running in Java, Scala, and Python, and
> everything works, including cross language tests.
>
> Lastly, as a mailing list question, how do I reply to a question like this
> if I see the original question in the web archives but it is not in my mail
> client? I suspect that this reply will show up as a different thread which
> is not what I want.
>



-- 
-Ratha
http://vvratha.blogspot.com/


RE: New consumer API waits indefinitely

2016-04-04 Thread Niko Davor
M. Lohith Samaga,

Your Java code looks fine.

Usually, if consumer.poll(100); doesn't return, there is probably a basic
connection error. If Kafka can't connect, it will internally go into an
infinite loop. To me, that doesn't seem like a good design, but that's a
separate tangent.

Turn SLF4J root logging up to debug and you will probably see the
connection error messages.

A second thought is it might be worth trying using Kafka on a small Linux
VM. The docs say, "Windows is not currently a well supported platform
though we would be happy to change that.". Even if you want to use Windows
as a server in the long run, at least as a development test option, I'd
want to be able to test with a Linux VM.

FYI, I'm a Kafka newbie, and I've had no problems getting working code
samples up and running with Kafka 0.9.0.1 and the new Producer/Consumer
APIs. I've gotten code samples running in Java, Scala, and Python, and
everything works, including cross language tests.

Lastly, as a mailing list question, how do I reply to a question like this
if I see the original question in the web archives but it is not in my mail
client? I suspect that this reply will show up as a different thread which
is not what I want.


Re: New consumer API waits indefinitely

2016-04-04 Thread Ismael Juma
Hi Lohith,

Are there any errors in your broker logs? I think there may be some issues
with compacted topics on Windows and the new consumer uses a compacted
topic to store offsets.

Ismael

On Mon, Apr 4, 2016 at 12:20 PM, Lohith Samaga M 
wrote:

> Dear All,
> The error seems to be NOT_COORDINATOR_FOR_GROUP.
> The exception thrown in
> org.apache.kafka.clients.consumer.internals.RequestFuture is:
> org.apache.kafka.common.errors.NotCoordinatorForGroupException:
> This is not the correct coordinator for this group.
>
> However, this exception is considered RetriableException in
> org.apache.kafka.clients.consumer.internals.RequestFuture.
> So, the retry goes on - in a loop.
>
> It also happens that the Coordinator object becomes null in
> AbstractCoordinator class.
>
> Can somebody please help?
>
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations
> M. Lohith Samaga
>
>
>
>
> -Original Message-
> From: Ratha v [mailto:vijayara...@gmail.com]
> Sent: Monday, April 04, 2016 12.22
> To: users@kafka.apache.org
> Subject: Re: New consumer API waits indefinitely
>
> Still struggling :)
> Check following threads;
>
>- If my producer producing, then why the consumer couldn't consume? it
>stuck @ poll()
>- Consumer thread is waiting forever, not returning any objects
>
>
> I think new APIs are recommended.
>
>
> On 4 April 2016 at 16:37, Lohith Samaga M 
> wrote:
>
> > Thanks for letting me know.
> >
> > Is there any work around? A fix?
> >
> > Which set of API is recommended for production use?
> >
> > Best regards / Mit freundlichen Grüßen / Sincères salutations M.
> > Lohith Samaga
> >
> >
> >
> >
> > -Original Message-
> > From: Ratha v [mailto:vijayara...@gmail.com]
> > Sent: Monday, April 04, 2016 11.27
> > To: users@kafka.apache.org
> > Subject: Re: New consumer API waits indefinitely
> >
> > I too face same issue:(
> >
> > On 4 April 2016 at 15:51, Lohith Samaga M 
> > wrote:
> >
> > > HI,
> > > Good morning.
> > >
> > > I am new to Kafka. So, please bear with me.
> > > I am using the new Producer and Consumer API with
> > > Kafka
> > > 0.9.0.1 running on Windows 7 laptop with zookeeper.
> > >
> > > I was able to send messages using the new Producer
> > > API. I can see the messages in the Kafka data directory.
> > >
> > > However, when I run the consumer, it does not
> > > retrieve the messages. It keeps waiting for the messages indefinitely.
> > > My code (taken from Javadoc and modified)  is as below:
> > >
> > > props.put("bootstrap.servers", "localhost:9092");
> > > props.put("group.id", "new01");
> > > props.put("enable.auto.commit", "true");
> > > props.put("auto.commit.interval.ms", "1000");
> > > props.put("session.timeout.ms", "3");
> > > props.put("key.deserializer",
> > > "org.apache.kafka.common.serialization.StringDeserializer");
> > > props.put("value.deserializer",
> > > "org.apache.kafka.common.serialization.StringDeserializer");
> > >
> > > KafkaConsumer consumer = new
> > > KafkaConsumer<>(props);
> > > consumer.subscribe(Arrays.asList("new-producer"));
> > > while (true) {
> > > ConsumerRecords records =
> > > consumer.poll(100);
> > > for (ConsumerRecord record : records)
> > > System.out.printf("offset = %d, key = %s, value
> > > = %s", record.offset(), record.key(), record.value());
> > > }
> > >
> > > Can anybody please tell me what went wrong?
> > >
> > > Thanks & Regards,
> > > M. Lohith Samaga
> > >
> > > Information transmitted by this e-mail is proprietary to Mphasis,
> > > its associated companies and/ or its customers and is intended for
> > > use only by the individual or entity to which it is addressed, and
> > > may contain information that is privileged, confidential or exempt
> > > from disclosure under applicable law. If you are no

RE: New consumer API waits indefinitely

2016-04-04 Thread Lohith Samaga M
Dear All,
The error seems to be NOT_COORDINATOR_FOR_GROUP.
The exception thrown in 
org.apache.kafka.clients.consumer.internals.RequestFuture is:
org.apache.kafka.common.errors.NotCoordinatorForGroupException: This is 
not the correct coordinator for this group.

However, this exception is considered RetriableException in 
org.apache.kafka.clients.consumer.internals.RequestFuture.
So, the retry goes on - in a loop.

It also happens that the Coordinator object becomes null in 
AbstractCoordinator class.

Can somebody please help?


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga




-Original Message-
From: Ratha v [mailto:vijayara...@gmail.com] 
Sent: Monday, April 04, 2016 12.22
To: users@kafka.apache.org
Subject: Re: New consumer API waits indefinitely

Still struggling :)
Check following threads;

   - If my producer producing, then why the consumer couldn't consume? it
   stuck @ poll()
   - Consumer thread is waiting forever, not returning any objects


I think new APIs are recommended.


On 4 April 2016 at 16:37, Lohith Samaga M  wrote:

> Thanks for letting me know.
>
> Is there any work around? A fix?
>
> Which set of API is recommended for production use?
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations M. 
> Lohith Samaga
>
>
>
>
> -Original Message-
> From: Ratha v [mailto:vijayara...@gmail.com]
> Sent: Monday, April 04, 2016 11.27
> To: users@kafka.apache.org
> Subject: Re: New consumer API waits indefinitely
>
> I too face same issue:(
>
> On 4 April 2016 at 15:51, Lohith Samaga M 
> wrote:
>
> > HI,
> > Good morning.
> >
> > I am new to Kafka. So, please bear with me.
> > I am using the new Producer and Consumer API with 
> > Kafka
> > 0.9.0.1 running on Windows 7 laptop with zookeeper.
> >
> > I was able to send messages using the new Producer 
> > API. I can see the messages in the Kafka data directory.
> >
> > However, when I run the consumer, it does not 
> > retrieve the messages. It keeps waiting for the messages indefinitely.
> > My code (taken from Javadoc and modified)  is as below:
> >
> > props.put("bootstrap.servers", "localhost:9092");
> > props.put("group.id", "new01");
> > props.put("enable.auto.commit", "true");
> > props.put("auto.commit.interval.ms", "1000");
> > props.put("session.timeout.ms", "3");
> > props.put("key.deserializer", 
> > "org.apache.kafka.common.serialization.StringDeserializer");
> > props.put("value.deserializer", 
> > "org.apache.kafka.common.serialization.StringDeserializer");
> >
> > KafkaConsumer consumer = new 
> > KafkaConsumer<>(props);
> > consumer.subscribe(Arrays.asList("new-producer"));
> > while (true) {
> > ConsumerRecords records = 
> > consumer.poll(100);
> > for (ConsumerRecord record : records)
> > System.out.printf("offset = %d, key = %s, value 
> > = %s", record.offset(), record.key(), record.value());
> > }
> >
> > Can anybody please tell me what went wrong?
> >
> > Thanks & Regards,
> > M. Lohith Samaga
> >
> > Information transmitted by this e-mail is proprietary to Mphasis, 
> > its associated companies and/ or its customers and is intended for 
> > use only by the individual or entity to which it is addressed, and 
> > may contain information that is privileged, confidential or exempt 
> > from disclosure under applicable law. If you are not the intended 
> > recipient or it appears that this mail has been forwarded to you 
> > without proper authority, you are notified that any use or 
> > dissemination of this information in any manner is strictly 
> > prohibited. In such cases, please notify us immediately at 
> > mailmas...@mphasis.com and delete this mail from your records.
> >
>
>
>
> --
> -Ratha
> http://vvratha.blogspot.com/
> Information transmitted by this e-mail is proprietary to Mphasis, its 
> associated companies and/ or its customers and is intended for use 
> only by the individual or entity to which it is addressed, and may 
> contain information that is privileged, confidential or exempt from 
> disclosure under applic

RE: New consumer API waits indefinitely

2016-04-04 Thread Lohith Samaga M
Thanks Ratha.

I am trying tounderstand the code...

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


-Original Message-
From: Ratha v [mailto:vijayara...@gmail.com] 
Sent: Monday, April 04, 2016 12.22
To: users@kafka.apache.org
Subject: Re: New consumer API waits indefinitely

Still struggling :)
Check following threads;

   - If my producer producing, then why the consumer couldn't consume? it
   stuck @ poll()
   - Consumer thread is waiting forever, not returning any objects


I think new APIs are recommended.


On 4 April 2016 at 16:37, Lohith Samaga M  wrote:

> Thanks for letting me know.
>
> Is there any work around? A fix?
>
> Which set of API is recommended for production use?
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations M. 
> Lohith Samaga
>
>
>
>
> -Original Message-
> From: Ratha v [mailto:vijayara...@gmail.com]
> Sent: Monday, April 04, 2016 11.27
> To: users@kafka.apache.org
> Subject: Re: New consumer API waits indefinitely
>
> I too face same issue:(
>
> On 4 April 2016 at 15:51, Lohith Samaga M 
> wrote:
>
> > HI,
> > Good morning.
> >
> > I am new to Kafka. So, please bear with me.
> > I am using the new Producer and Consumer API with 
> > Kafka
> > 0.9.0.1 running on Windows 7 laptop with zookeeper.
> >
> > I was able to send messages using the new Producer 
> > API. I can see the messages in the Kafka data directory.
> >
> > However, when I run the consumer, it does not 
> > retrieve the messages. It keeps waiting for the messages indefinitely.
> > My code (taken from Javadoc and modified)  is as below:
> >
> > props.put("bootstrap.servers", "localhost:9092");
> > props.put("group.id", "new01");
> > props.put("enable.auto.commit", "true");
> > props.put("auto.commit.interval.ms", "1000");
> > props.put("session.timeout.ms", "3");
> > props.put("key.deserializer", 
> > "org.apache.kafka.common.serialization.StringDeserializer");
> > props.put("value.deserializer", 
> > "org.apache.kafka.common.serialization.StringDeserializer");
> >
> > KafkaConsumer consumer = new 
> > KafkaConsumer<>(props);
> > consumer.subscribe(Arrays.asList("new-producer"));
> > while (true) {
> > ConsumerRecords records = 
> > consumer.poll(100);
> > for (ConsumerRecord record : records)
> > System.out.printf("offset = %d, key = %s, value 
> > = %s", record.offset(), record.key(), record.value());
> > }
> >
> > Can anybody please tell me what went wrong?
> >
> > Thanks & Regards,
> > M. Lohith Samaga
> >
> > Information transmitted by this e-mail is proprietary to Mphasis, 
> > its associated companies and/ or its customers and is intended for 
> > use only by the individual or entity to which it is addressed, and 
> > may contain information that is privileged, confidential or exempt 
> > from disclosure under applicable law. If you are not the intended 
> > recipient or it appears that this mail has been forwarded to you 
> > without proper authority, you are notified that any use or 
> > dissemination of this information in any manner is strictly 
> > prohibited. In such cases, please notify us immediately at 
> > mailmas...@mphasis.com and delete this mail from your records.
> >
>
>
>
> --
> -Ratha
> http://vvratha.blogspot.com/
> Information transmitted by this e-mail is proprietary to Mphasis, its 
> associated companies and/ or its customers and is intended for use 
> only by the individual or entity to which it is addressed, and may 
> contain information that is privileged, confidential or exempt from 
> disclosure under applicable law. If you are not the intended recipient 
> or it appears that this mail has been forwarded to you without proper 
> authority, you are notified that any use or dissemination of this 
> information in any manner is strictly prohibited. In such cases, 
> please notify us immediately at mailmas...@mphasis.com and delete this 
> mail from your records.
>



--
-Ratha
http://vvratha.blogspot.com/
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Re: New consumer API waits indefinitely

2016-04-03 Thread Ratha v
Still struggling :)
Check following threads;

   - If my producer producing, then why the consumer couldn't consume? it
   stuck @ poll()
   - Consumer thread is waiting forever, not returning any objects


I think new APIs are recommended.


On 4 April 2016 at 16:37, Lohith Samaga M  wrote:

> Thanks for letting me know.
>
> Is there any work around? A fix?
>
> Which set of API is recommended for production use?
>
> Best regards / Mit freundlichen Grüßen / Sincères salutations
> M. Lohith Samaga
>
>
>
>
> -Original Message-
> From: Ratha v [mailto:vijayara...@gmail.com]
> Sent: Monday, April 04, 2016 11.27
> To: users@kafka.apache.org
> Subject: Re: New consumer API waits indefinitely
>
> I too face same issue:(
>
> On 4 April 2016 at 15:51, Lohith Samaga M 
> wrote:
>
> > HI,
> > Good morning.
> >
> > I am new to Kafka. So, please bear with me.
> > I am using the new Producer and Consumer API with
> > Kafka
> > 0.9.0.1 running on Windows 7 laptop with zookeeper.
> >
> > I was able to send messages using the new Producer
> > API. I can see the messages in the Kafka data directory.
> >
> > However, when I run the consumer, it does not retrieve
> > the messages. It keeps waiting for the messages indefinitely.
> > My code (taken from Javadoc and modified)  is as below:
> >
> > props.put("bootstrap.servers", "localhost:9092");
> > props.put("group.id", "new01");
> > props.put("enable.auto.commit", "true");
> > props.put("auto.commit.interval.ms", "1000");
> > props.put("session.timeout.ms", "3");
> > props.put("key.deserializer",
> > "org.apache.kafka.common.serialization.StringDeserializer");
> > props.put("value.deserializer",
> > "org.apache.kafka.common.serialization.StringDeserializer");
> >
> > KafkaConsumer consumer = new
> > KafkaConsumer<>(props);
> > consumer.subscribe(Arrays.asList("new-producer"));
> > while (true) {
> > ConsumerRecords records =
> > consumer.poll(100);
> > for (ConsumerRecord record : records)
> > System.out.printf("offset = %d, key = %s, value =
> > %s", record.offset(), record.key(), record.value());
> > }
> >
> > Can anybody please tell me what went wrong?
> >
> > Thanks & Regards,
> > M. Lohith Samaga
> >
> > Information transmitted by this e-mail is proprietary to Mphasis, its
> > associated companies and/ or its customers and is intended for use
> > only by the individual or entity to which it is addressed, and may
> > contain information that is privileged, confidential or exempt from
> > disclosure under applicable law. If you are not the intended recipient
> > or it appears that this mail has been forwarded to you without proper
> > authority, you are notified that any use or dissemination of this
> > information in any manner is strictly prohibited. In such cases,
> > please notify us immediately at mailmas...@mphasis.com and delete this
> > mail from your records.
> >
>
>
>
> --
> -Ratha
> http://vvratha.blogspot.com/
> Information transmitted by this e-mail is proprietary to Mphasis, its
> associated companies and/ or its customers and is intended
> for use only by the individual or entity to which it is addressed, and may
> contain information that is privileged, confidential or
> exempt from disclosure under applicable law. If you are not the intended
> recipient or it appears that this mail has been forwarded
> to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at
> mailmas...@mphasis.com and delete this mail from your records.
>



-- 
-Ratha
http://vvratha.blogspot.com/


RE: New consumer API waits indefinitely

2016-04-03 Thread Lohith Samaga M
Thanks for letting me know.

Is there any work around? A fix?

Which set of API is recommended for production use?

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga




-Original Message-
From: Ratha v [mailto:vijayara...@gmail.com] 
Sent: Monday, April 04, 2016 11.27
To: users@kafka.apache.org
Subject: Re: New consumer API waits indefinitely

I too face same issue:(

On 4 April 2016 at 15:51, Lohith Samaga M  wrote:

> HI,
> Good morning.
>
> I am new to Kafka. So, please bear with me.
> I am using the new Producer and Consumer API with 
> Kafka
> 0.9.0.1 running on Windows 7 laptop with zookeeper.
>
> I was able to send messages using the new Producer 
> API. I can see the messages in the Kafka data directory.
>
> However, when I run the consumer, it does not retrieve 
> the messages. It keeps waiting for the messages indefinitely.
> My code (taken from Javadoc and modified)  is as below:
>
> props.put("bootstrap.servers", "localhost:9092");
> props.put("group.id", "new01");
> props.put("enable.auto.commit", "true");
> props.put("auto.commit.interval.ms", "1000");
> props.put("session.timeout.ms", "3");
> props.put("key.deserializer", 
> "org.apache.kafka.common.serialization.StringDeserializer");
> props.put("value.deserializer", 
> "org.apache.kafka.common.serialization.StringDeserializer");
>
> KafkaConsumer consumer = new 
> KafkaConsumer<>(props);
> consumer.subscribe(Arrays.asList("new-producer"));
> while (true) {
> ConsumerRecords records = 
> consumer.poll(100);
> for (ConsumerRecord record : records)
> System.out.printf("offset = %d, key = %s, value = 
> %s", record.offset(), record.key(), record.value());
> }
>
> Can anybody please tell me what went wrong?
>
> Thanks & Regards,
> M. Lohith Samaga
>
> Information transmitted by this e-mail is proprietary to Mphasis, its 
> associated companies and/ or its customers and is intended for use 
> only by the individual or entity to which it is addressed, and may 
> contain information that is privileged, confidential or exempt from 
> disclosure under applicable law. If you are not the intended recipient 
> or it appears that this mail has been forwarded to you without proper 
> authority, you are notified that any use or dissemination of this 
> information in any manner is strictly prohibited. In such cases, 
> please notify us immediately at mailmas...@mphasis.com and delete this 
> mail from your records.
>



--
-Ratha
http://vvratha.blogspot.com/
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Re: New consumer API waits indefinitely

2016-04-03 Thread Ratha v
I too face same issue:(

On 4 April 2016 at 15:51, Lohith Samaga M  wrote:

> HI,
> Good morning.
>
> I am new to Kafka. So, please bear with me.
> I am using the new Producer and Consumer API with Kafka
> 0.9.0.1 running on Windows 7 laptop with zookeeper.
>
> I was able to send messages using the new Producer API. I
> can see the messages in the Kafka data directory.
>
> However, when I run the consumer, it does not retrieve the
> messages. It keeps waiting for the messages indefinitely.
> My code (taken from Javadoc and modified)  is as below:
>
> props.put("bootstrap.servers", "localhost:9092");
> props.put("group.id", "new01");
> props.put("enable.auto.commit", "true");
> props.put("auto.commit.interval.ms", "1000");
> props.put("session.timeout.ms", "3");
> props.put("key.deserializer",
> "org.apache.kafka.common.serialization.StringDeserializer");
> props.put("value.deserializer",
> "org.apache.kafka.common.serialization.StringDeserializer");
>
> KafkaConsumer consumer = new
> KafkaConsumer<>(props);
> consumer.subscribe(Arrays.asList("new-producer"));
> while (true) {
> ConsumerRecords records =
> consumer.poll(100);
> for (ConsumerRecord record : records)
> System.out.printf("offset = %d, key = %s, value = %s",
> record.offset(), record.key(), record.value());
> }
>
> Can anybody please tell me what went wrong?
>
> Thanks & Regards,
> M. Lohith Samaga
>
> Information transmitted by this e-mail is proprietary to Mphasis, its
> associated companies and/ or its customers and is intended
> for use only by the individual or entity to which it is addressed, and may
> contain information that is privileged, confidential or
> exempt from disclosure under applicable law. If you are not the intended
> recipient or it appears that this mail has been forwarded
> to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at
> mailmas...@mphasis.com and delete this mail from your records.
>



-- 
-Ratha
http://vvratha.blogspot.com/


Re: new consumer api / heartbeat, manual commit & long to process messages

2016-02-26 Thread Jason Gustafson
Hey Guven,

A heartbeat API actually came up in the discussion of KIP-41. Ultimately we
rejected it because it led to confusing API semantics. The problem is that
heartbeat responses are used by the coordinator to tell consumers when a
rebalance is needed. But what should the user do if they call heartbeat()
and find that the group is rebalancing? If they don't stop message
processing and rejoin, then they may be kicked out of the group just as if
they had failed to heartbeat before expiration of the session timeout.
Alternatively, if we made heartbeat() blocking and let the rebalance
complete in the call itself, then the consumer may no longer be assigned
the same partitions. So either way, unless you can preempt message
processing, you may fall out of the group and pending messages will need to
be reprocessed after the rebalance completes. And if you can preempt
message processing, then you can ensure that heartbeats get sent by always
preempting the processor before the session timeout expires.

In the end, we felt that max.poll.records was a simpler option since it
gives you fine control over the poll loop and doesn't require any confusing
API changes . As long as you can put some upper bound on the processing
time, you can set max.poll.records=1 and the session timeout to whatever
the upper bound is.

However, if you have a use case where there is a very high variance in
message processing times, it may not be so helpful. In that case, the best
options I can think of at the moment are the following:

1. Move the processing to another thread. Basically the workflow would be
something like this: 1) receive records for a partition in poll(), 2)
submit them to an executor for processing, 3) pause the partition, and 4)
continue the poll loop. When the processor finishes with the records, you
can use resume() to reenable fetching. You'll have to manage offset commits
yourself since you wouldn't want to commit before the thread has actually
finished processing. You'll also have to account for the possibility of a
rebalance completing while the thread is still processing a batch (an easy
way to do this would probably be to just ignore CommitFailedException
thrown from commit).

2. This is a tad hacky, but you could take advantage of the fact that the
coordinator treats commits as heartbeats and call commitSync() periodically
while handling a batch of records. Note in this case that you should not
use the no-arg commitSync() variant which will commit the offsets for the
full batch returned from the last poll(). Instead you should pass the
offsets of the records already processed explicitly in
commitSync(Map).

3. Use the consumer in "simple" mode. If you don't actually need group
coordination, then you can assign the partitions you want to read from
manually and consume them at your own rate. There is no heartbeating or
rebalancing to worry about.

-Jason

On Fri, Feb 26, 2016 at 1:20 AM, Guven Demir 
wrote:

> thanks for the response Jason,
>
> i've already experimented with a similar solution myself, lowering
> max.partition.fetch.bytes to barely fit the largest message (2k at the
> moment)
>
> still, i've observed similar problems, which is caused by really long
> processing times, e.g. downloading a large video via a link received in the
> message
>
> it's not very feasible to increase the heartbeat timeout too much, as
> session timeout is recommened to be at least 3 times that of heartbeat
> timeout. and that is bounded by broker's group.max.session.timeout.ms,
> which i would not want to increase as it would affect all other
> topics/consumers
>
> could there be an api for triggering the heartbeat manually maybe? it can
> be argued that that would beat the purpose of a heartbeat though, it might
> be used improperly, i.e. in my case rather than sending heartbeats inside
> the download/save loop but in an empty loop waiting for the download to
> complete, which might never happen. again, sending heartbeats in
> application code might be considered tight coupling as well
>
> other than that, i will experiment with the pause() api, separate thread
> for the actual message processing and poll()'ing with all partitions paused
>
> guven
>
>
> > On 25 Feb 2016, at 20:19, Jason Gustafson  wrote:
> >
> > Hey Guven,
> >
> > This problem is what KIP-41 was created for:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
> > .
> >
> > The patch for this was committed yesterday and will be included in 0.10.
> If
> > you need something in the shorter term, you could probably use the client
> > from trunk (no changes to the server are needed).
> >
> > If this is still not sufficient, I recommend looking into the pause()
> API,
> > which can facilitate asynchronous message processing in another thread.
> >
> > -Jason
> >
> > On Thu, Feb 25, 2016 at 8:53 AM, Guven Demir  >
> > wrote:
> >
> >> hi all,
> >>
> >> i'm having trouble processing a topic which includes paths to images
> 

Re: new consumer api / heartbeat, manual commit & long to process messages

2016-02-26 Thread Guven Demir
thanks for the response Jason,

i've already experimented with a similar solution myself, lowering 
max.partition.fetch.bytes to barely fit the largest message (2k at the moment)

still, i've observed similar problems, which is caused by really long 
processing times, e.g. downloading a large video via a link received in the 
message

it's not very feasible to increase the heartbeat timeout too much, as session 
timeout is recommened to be at least 3 times that of heartbeat timeout. and 
that is bounded by broker's group.max.session.timeout.ms, which i would not 
want to increase as it would affect all other topics/consumers

could there be an api for triggering the heartbeat manually maybe? it can be 
argued that that would beat the purpose of a heartbeat though, it might be used 
improperly, i.e. in my case rather than sending heartbeats inside the 
download/save loop but in an empty loop waiting for the download to complete, 
which might never happen. again, sending heartbeats in application code might 
be considered tight coupling as well

other than that, i will experiment with the pause() api, separate thread for 
the actual message processing and poll()'ing with all partitions paused

guven


> On 25 Feb 2016, at 20:19, Jason Gustafson  wrote:
> 
> Hey Guven,
> 
> This problem is what KIP-41 was created for:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
> .
> 
> The patch for this was committed yesterday and will be included in 0.10. If
> you need something in the shorter term, you could probably use the client
> from trunk (no changes to the server are needed).
> 
> If this is still not sufficient, I recommend looking into the pause() API,
> which can facilitate asynchronous message processing in another thread.
> 
> -Jason
> 
> On Thu, Feb 25, 2016 at 8:53 AM, Guven Demir 
> wrote:
> 
>> hi all,
>> 
>> i'm having trouble processing a topic which includes paths to images which
>> need to be downloaded and saved to disk (each takes ~3-5 seconds) and
>> several are received on each poll
>> 
>> within this scenario, i'm receiving the following error:
>> 
>>org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
>> be completed due to group rebalance
>> 
>> which i assume is due to heartbeat failure and broker re-assigning the
>> consumer's partition to another consumer
>> 
>> are there any recommendations for processing long to process messages?
>> 
>> thanks in advance,
>> guven
>> 
>> 
>> 



Re: new consumer api / heartbeat, manual commit & long to process messages

2016-02-25 Thread Jason Gustafson
Hey Guven,

This problem is what KIP-41 was created for:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
.

The patch for this was committed yesterday and will be included in 0.10. If
you need something in the shorter term, you could probably use the client
from trunk (no changes to the server are needed).

If this is still not sufficient, I recommend looking into the pause() API,
which can facilitate asynchronous message processing in another thread.

-Jason

On Thu, Feb 25, 2016 at 8:53 AM, Guven Demir 
wrote:

> hi all,
>
> i'm having trouble processing a topic which includes paths to images which
> need to be downloaded and saved to disk (each takes ~3-5 seconds) and
> several are received on each poll
>
> within this scenario, i'm receiving the following error:
>
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
> be completed due to group rebalance
>
> which i assume is due to heartbeat failure and broker re-assigning the
> consumer's partition to another consumer
>
> are there any recommendations for processing long to process messages?
>
> thanks in advance,
> guven
>
>
>


Re: New Consumer API + Reactive Kafka

2015-12-02 Thread Guozhang Wang
In the new API commitSync() handles retires and reconnecting, and will only
throw an exception if it encounters a non-retriable error (e.g. it is been
told that the partitions it wants to commit no longer belongs to itself) or
a timeout has elapsed. You can find possible exceptions thrown from this
function here (for function commitSync):

http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

Guozhang


On Wed, Dec 2, 2015 at 8:58 AM, Krzysztof Ciesielski <
krzysztof.ciesiel...@softwaremill.pl> wrote:

> I see, that’s actually a very important point, thanks Jay.
> I think that we are very optimistic about updating Reactive Kafka now
> after getting all these details :)
> I have one more question: in the new client we only have to call
> commitSync(offsets). This is a ‘void’ method so I suspect that it commits
> atomically?
> In our current native committer, we have quite a lot of additional code
> for retries, reconnecting or finding new channel coordinator. I suspect
> that the new API handles it all internally and if commitSync() fails then
> it means that the only thing we can do is kill the consumer and try to
> create a new one?
>
> —
> Bests,
> Chris
> SoftwareMill
> On 2 December 2015 at 17:42:24, Jay Kreps (j...@confluent.io) wrote:
>
> It's worth noting that both the old and new consumer are identical in the
> number of records fetched at once and this is bounded by the fetch size and
> the number of partitions you subscribe to. The old consumer held these in
> memory internally and waited for you to ask for them, the new consumer
> immediately gives you what it has. Overall, though, the new consumer gives
> much better control over what is being fetched since it only uses memory
> when you call poll(); the old consumer had a background thread doing this
> which would only stop when it filled up a queue of unprocessed
> chunks...this is a lot harder to predict.
>
> -Jay
>
> On Wed, Dec 2, 2015 at 7:13 AM, Gwen Shapira  wrote:
>
> > On Wed, Dec 2, 2015 at 10:44 PM, Krzysztof Ciesielski <
> > krzysztof.ciesiel...@softwaremill.pl> wrote:
> >
> > > Hello,
> > >
> > > I’m the main maintainer of Reactive Kafka - a wrapper library that
> > > provides Kafka API as Reactive Streams (
> > > https://github.com/softwaremill/reactive-kafka).
> > > I’m a bit concerned about switching to Kafka 0.9 because of the new
> > > Consumer API which doesn’t seem to fit well into this paradigm,
> comparing
> > > to the old one. My main concerns are:
> > >
> > > 1. Our current code uses the KafkaIterator and reads messages
> > > sequentially, then sends them further upstream. In the new API, you
> > cannot
> > > control how many messages are returned with poll(), so we would need to
> > > introduce some kind of in-memory buffering.
> > > 2. You cannot specify which offsets to commit. Our current native
> > > committer (
> > >
> >
> https://github.com/softwaremill/reactive-kafka/blob/4055e88c09b8e08aefe8dbbd4748605df5779b07/core/src/main/scala/com/softwaremill/react/kafka/commit/native/NativeCommitter.scala
> > )
> > > uses the OffsetCommitRequest/Response API and
> > > kafka.api.ConsumerMetadataRequest/Response for resolving brokers.
> > Switching
> > > to Kafka 0.9 brings some compilation errors that raise questions.
> > >
> > > My questions are:
> > >
> > > 1. Do I understand the capabilities and limitations of new API
> correctly?
> > > :)
> > >
> >
> > The first limitation is correct - poll() may return any number of records
> > and you need to handle this.
> > The second is not correct - commitSync() can take a map of TopicPartition
> > and Offsets, so you would only commit specific offsets of specific
> > partitions.
> >
> >
> >
> > > 2. Can we stay with the old iterator-based client, or is it going to
> get
> > > abandoned in future Kafka versions, or discouraged for some reasons?
> > >
> >
> > It is already a bit behind - only the new client includes support for
> > secured clusters (authentication and encryption). It will get deprecated
> in
> > the future.
> >
> >
> > > 3. Can we still use the OffsetCommitRequest/Response API to commit
> > > messages manually? If yes, could someone update this example:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Committing+and+fetching+consumer+offsets+in+Kafka
> > or
> > > give me a few hints on how to do this with 0.9?
> > >
> >
> > AFAIK, the wire protocol and the API is not going anywhere. Hopefully you
> > can use the new objects we provide in the clients jar
> > (org.apache.kafka.common.requests).
> >
> >
> > >
> > > By the way, we’d like our library to appear on the Ecosystem Wiki, I’m
> > not
> > > sure how to request that officially :)
> > >
> >
> > Let us know what to write there and where to link :)
> >
> >
> > >
> > > —
> > > Bests,
> > > Chris
> > > SoftwareMill
> >
>



-- 
-- Guozhang


Re: New Consumer API + Reactive Kafka

2015-12-02 Thread Krzysztof Ciesielski
I see, that’s actually a very important point, thanks Jay.
I think that we are very optimistic about updating Reactive Kafka now after 
getting all these details :)
I have one more question: in the new client we only have to call 
commitSync(offsets). This is a ‘void’ method so I suspect that it commits 
atomically?
In our current native committer, we have quite a lot of additional code for 
retries, reconnecting or finding new channel coordinator. I suspect that the 
new API handles it all internally and if commitSync() fails then it means that 
the only thing we can do is kill the consumer and try to create a new one?

— 
Bests,
Chris
SoftwareMill
On 2 December 2015 at 17:42:24, Jay Kreps (j...@confluent.io) wrote:

It's worth noting that both the old and new consumer are identical in the  
number of records fetched at once and this is bounded by the fetch size and  
the number of partitions you subscribe to. The old consumer held these in  
memory internally and waited for you to ask for them, the new consumer  
immediately gives you what it has. Overall, though, the new consumer gives  
much better control over what is being fetched since it only uses memory  
when you call poll(); the old consumer had a background thread doing this  
which would only stop when it filled up a queue of unprocessed  
chunks...this is a lot harder to predict.  

-Jay  

On Wed, Dec 2, 2015 at 7:13 AM, Gwen Shapira  wrote:  

> On Wed, Dec 2, 2015 at 10:44 PM, Krzysztof Ciesielski <  
> krzysztof.ciesiel...@softwaremill.pl> wrote:  
>  
> > Hello,  
> >  
> > I’m the main maintainer of Reactive Kafka - a wrapper library that  
> > provides Kafka API as Reactive Streams (  
> > https://github.com/softwaremill/reactive-kafka).  
> > I’m a bit concerned about switching to Kafka 0.9 because of the new  
> > Consumer API which doesn’t seem to fit well into this paradigm, comparing  
> > to the old one. My main concerns are:  
> >  
> > 1. Our current code uses the KafkaIterator and reads messages  
> > sequentially, then sends them further upstream. In the new API, you  
> cannot  
> > control how many messages are returned with poll(), so we would need to  
> > introduce some kind of in-memory buffering.  
> > 2. You cannot specify which offsets to commit. Our current native  
> > committer (  
> >  
> https://github.com/softwaremill/reactive-kafka/blob/4055e88c09b8e08aefe8dbbd4748605df5779b07/core/src/main/scala/com/softwaremill/react/kafka/commit/native/NativeCommitter.scala
>   
> )  
> > uses the OffsetCommitRequest/Response API and  
> > kafka.api.ConsumerMetadataRequest/Response for resolving brokers.  
> Switching  
> > to Kafka 0.9 brings some compilation errors that raise questions.  
> >  
> > My questions are:  
> >  
> > 1. Do I understand the capabilities and limitations of new API correctly?  
> > :)  
> >  
>  
> The first limitation is correct - poll() may return any number of records  
> and you need to handle this.  
> The second is not correct - commitSync() can take a map of TopicPartition  
> and Offsets, so you would only commit specific offsets of specific  
> partitions.  
>  
>  
>  
> > 2. Can we stay with the old iterator-based client, or is it going to get  
> > abandoned in future Kafka versions, or discouraged for some reasons?  
> >  
>  
> It is already a bit behind - only the new client includes support for  
> secured clusters (authentication and encryption). It will get deprecated in  
> the future.  
>  
>  
> > 3. Can we still use the OffsetCommitRequest/Response API to commit  
> > messages manually? If yes, could someone update this example:  
> >  
> https://cwiki.apache.org/confluence/display/KAFKA/Committing+and+fetching+consumer+offsets+in+Kafka
>   
> or  
> > give me a few hints on how to do this with 0.9?  
> >  
>  
> AFAIK, the wire protocol and the API is not going anywhere. Hopefully you  
> can use the new objects we provide in the clients jar  
> (org.apache.kafka.common.requests).  
>  
>  
> >  
> > By the way, we’d like our library to appear on the Ecosystem Wiki, I’m  
> not  
> > sure how to request that officially :)  
> >  
>  
> Let us know what to write there and where to link :)  
>  
>  
> >  
> > —  
> > Bests,  
> > Chris  
> > SoftwareMill  
>  


Re: New Consumer API + Reactive Kafka

2015-12-02 Thread Jay Kreps
It's worth noting that both the old and new consumer are identical in the
number of records fetched at once and this is bounded by the fetch size and
the number of partitions you subscribe to. The old consumer held these in
memory internally and waited for you to ask for them, the new consumer
immediately gives you what it has. Overall, though, the new consumer gives
much better control over what is being fetched since it only uses memory
when you call poll(); the old consumer had a background thread doing this
which would only stop when it filled up a queue of unprocessed
chunks...this is a lot harder to predict.

-Jay

On Wed, Dec 2, 2015 at 7:13 AM, Gwen Shapira  wrote:

> On Wed, Dec 2, 2015 at 10:44 PM, Krzysztof Ciesielski <
> krzysztof.ciesiel...@softwaremill.pl> wrote:
>
> > Hello,
> >
> > I’m the main maintainer of Reactive Kafka - a wrapper library that
> > provides Kafka API as Reactive Streams (
> > https://github.com/softwaremill/reactive-kafka).
> > I’m a bit concerned about switching to Kafka 0.9 because of the new
> > Consumer API which doesn’t seem to fit well into this paradigm, comparing
> > to the old one. My main concerns are:
> >
> > 1. Our current code uses the KafkaIterator and reads messages
> > sequentially, then sends them further upstream. In the new API, you
> cannot
> > control how many messages are returned with poll(), so we would need to
> > introduce some kind of in-memory buffering.
> > 2. You cannot specify which offsets to commit. Our current native
> > committer (
> >
> https://github.com/softwaremill/reactive-kafka/blob/4055e88c09b8e08aefe8dbbd4748605df5779b07/core/src/main/scala/com/softwaremill/react/kafka/commit/native/NativeCommitter.scala
> )
> > uses the OffsetCommitRequest/Response API and
> > kafka.api.ConsumerMetadataRequest/Response for resolving brokers.
> Switching
> > to Kafka 0.9 brings some compilation errors that raise questions.
> >
> > My questions are:
> >
> > 1. Do I understand the capabilities and limitations of new API correctly?
> > :)
> >
>
> The first limitation is correct - poll() may return any number of records
> and you need to handle this.
> The second is not correct - commitSync() can take a map of TopicPartition
> and Offsets, so you would only commit specific offsets of specific
> partitions.
>
>
>
> > 2. Can we stay with the old iterator-based client, or is it going to get
> > abandoned in future Kafka versions, or discouraged for some reasons?
> >
>
> It is already a bit behind - only the new client includes support for
> secured clusters (authentication and encryption). It will get deprecated in
> the future.
>
>
> > 3. Can we still use the OffsetCommitRequest/Response API to commit
> > messages manually? If yes, could someone update this example:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Committing+and+fetching+consumer+offsets+in+Kafka
> or
> > give me a few hints on how to do this with 0.9?
> >
>
> AFAIK, the wire protocol and the API is not going anywhere. Hopefully you
> can use the new objects we provide in the clients jar
> (org.apache.kafka.common.requests).
>
>
> >
> > By the way, we’d like our library to appear on the Ecosystem Wiki, I’m
> not
> > sure how to request that officially :)
> >
>
> Let us know what to write there and where to link :)
>
>
> >
> > —
> > Bests,
> > Chris
> > SoftwareMill
>


Re: New Consumer API + Reactive Kafka

2015-12-02 Thread Gwen Shapira
On Wed, Dec 2, 2015 at 10:44 PM, Krzysztof Ciesielski <
krzysztof.ciesiel...@softwaremill.pl> wrote:

> Hello,
>
> I’m the main maintainer of Reactive Kafka - a wrapper library that
> provides Kafka API as Reactive Streams (
> https://github.com/softwaremill/reactive-kafka).
> I’m a bit concerned about switching to Kafka 0.9 because of the new
> Consumer API which doesn’t seem to fit well into this paradigm, comparing
> to the old one. My main concerns are:
>
> 1. Our current code uses the KafkaIterator and reads messages
> sequentially, then sends them further upstream. In the new API, you cannot
> control how many messages are returned with poll(), so we would need to
> introduce some kind of in-memory buffering.
> 2. You cannot specify which offsets to commit. Our current native
> committer (
> https://github.com/softwaremill/reactive-kafka/blob/4055e88c09b8e08aefe8dbbd4748605df5779b07/core/src/main/scala/com/softwaremill/react/kafka/commit/native/NativeCommitter.scala)
> uses the OffsetCommitRequest/Response API and
> kafka.api.ConsumerMetadataRequest/Response for resolving brokers. Switching
> to Kafka 0.9 brings some compilation errors that raise questions.
>
> My questions are:
>
> 1. Do I understand the capabilities and limitations of new API correctly?
> :)
>

The first limitation is correct - poll() may return any number of records
and you need to handle this.
The second is not correct - commitSync() can take a map of TopicPartition
and Offsets, so you would only commit specific offsets of specific
partitions.



> 2. Can we stay with the old iterator-based client, or is it going to get
> abandoned in future Kafka versions, or discouraged for some reasons?
>

It is already a bit behind - only the new client includes support for
secured clusters (authentication and encryption). It will get deprecated in
the future.


> 3. Can we still use the OffsetCommitRequest/Response API to commit
> messages manually? If yes, could someone update this example:
> https://cwiki.apache.org/confluence/display/KAFKA/Committing+and+fetching+consumer+offsets+in+Kafka
>  or
> give me a few hints on how to do this with 0.9?
>

AFAIK, the wire protocol and the API is not going anywhere. Hopefully you
can use the new objects we provide in the clients jar
(org.apache.kafka.common.requests).


>
> By the way, we’d like our library to appear on the Ecosystem Wiki, I’m not
> sure how to request that officially :)
>

Let us know what to write there and where to link :)


>
> —
> Bests,
> Chris
> SoftwareMill


Re: new consumer API & release 0.8.3

2015-09-04 Thread Jason Gustafson
Hey Shashank,

If you'd like to get started with the new consumer, I urge you to checkout
trunk and take it for a spin. The API is still a little unstable, but I
doubt that changes from here on will be too dramatic. If you have any
questions or run into any issues, this mailing list is a great place to get
help.

Also, I think 0.8.3 is starting to get closer, but I'm not sure that any
specific dates have been announced.

Thanks,
Jason

On Fri, Sep 4, 2015 at 12:20 PM, Shashank Singh 
wrote:

> Hi
>
> I am eager to get to use the enhanced Consumer API which provides better
> control in terms of offset management etc. As I believe from reading
> through forums it is coming as part of 0.8.3 release. However there is no
> tentative date for the same.
>
> Can you please give any hint on that. Also which is the best forum to ask
> questions on how these new APIs are shaping up and the details about the
> same..
>
> --
>
> *Warm Regards,*
>
> *Shashank  *
>
> *Mobile: +91 9910478553 *
>
> *Linkedin: in.linkedin.com/pub/shashank-singh/13/763/906/
> *
>


Re: New Consumer API and Range Consumption with Fail-over

2015-08-05 Thread Bhavesh Mistry
Hi Jason,

Thanks for info.  I will implement (by end of next week) what you have
proposed.  If I encounter any issue,  I will let you know.

Indeed, adding new API would be uphill battle.  I did follow email chain
"Re: Kafka Consumer thoughts".

Thanks,

Bhavesh

On Wed, Aug 5, 2015 at 10:03 AM, Jason Gustafson  wrote:

> Hey Bhavesh,
>
> I think your use case can be handled with the new consumer API in roughly
> the manner I suggested previously. It might be a little easier if we added
> the ability to set the end offset for consumption. Perhaps something like
> this:
>
> // stop consumption from the partition when offset is reached
> void limit(TopicPartition partition, long offset)
>
> My guess is that we'd have a bit of an uphill battle to get this into the
> first release, but it may be possible if the use case is common enough. In
> any case, I think consuming to the limit offset and manually pausing the
> partition is a viable alternative.
>
> As for your question about fail-over, the new consumer provides a similar
> capability to the old high-level consumer. Here is a link to the wiki which
> describes its design:
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
>
> -Jason
>
> On Tue, Aug 4, 2015 at 12:01 AM, Bhavesh Mistry <
> mistry.p.bhav...@gmail.com>
> wrote:
>
> > Hi Jason and Kafka Dev Team,
> >
> >
> >
> > First of all thanks for responding and I think you got expected behavior
> > correctly.
> >
> >
> >
> > The use-case is offset range consumption.  We store each minute highest
> > offset for each topic per partition.  So if we need to reload or
> re-consume
> > data from yesterday per say 8AM to noon, we would have offset start
> mapping
> > at 8AM and end offset mapping at noon in Time Series Database.
> >
> >
> >
> > I was trying to load this use case with New Consumer API.   Do you or
> Kafka
> > Dev team agree with request to either have API that takes in topic and
> its
> > start/end offset for High Level Consumer group  (With older consumer API
> we
> > used Simple consumer before without fail-over).  Also, for each
> > range-consumption, there will be different group id  and group id will
> not
> > be reused.  The main purpose is to reload or process past data again (due
> > to production bugs or downtime etc occasionally and let main
> consumer-group
> > continue to consume latest records).
> >
> >
> > void subscribe(TopicPartition[] startOffsetPartitions, TopicPartition[]
> > endOffsetPartitions)
> >
> >
> >
> > or something similar which will allow following:
> >
> >
> >
> > 1)   When consumer group already exists (meaning have consumed data and
> > committed offset to storage system either Kafka or ZK) ignore start
> offset
> > positions and use committed offset.  If not committed use start Offset
> > Partition.
> >
> > 2)   When partition consumption has reached end Offset for given
> partition,
> > pause is fine or this assigned thread become fail over or wait for
> > reassignment.
> >
> > 3)   When all are Consumer Group is done consuming all partitions offset
> > ranges (start to end), gracefully shutdown entire consumer group.
> >
> > 4)   While consuming records, if one of node or consuming thread goes
> down
> > automatic fail-over to others (Similar to High Level Consumer for OLD
> > Consumer API.   I am not sure if there exists High level and/or Simple
> > Consumer concept for New API  )
> >
> >
> >
> > I hope above explanation clarifies use-case and intended behavior.
> Thanks
> > for clarifications, and you are correct we need pause(TopicPartition tp),
> > resume(TopicPartition tp), and/or API to set to end offset for each
> > partition.
> >
> >
> >
> > Please do let us know your preference to support above simple use-case.
> >
> >
> > Thanks,
> >
> >
> > Bhavesh
> >
> > On Thu, Jul 30, 2015 at 1:23 PM, Jason Gustafson 
> > wrote:
> >
> > > Hi Bhavesh,
> > >
> > > I'm not totally sure I understand the expected behavior, but I think
> this
> > > can work. Instead of seeking to the start of the range before the poll
> > > loop, you should probably provide a ConsumerRebalanceCallback to get
> > > notifications when group assignment has changed (e.g. when one of your
> > > nodes dies). When a new partition is assigned, the callback will be
> > invoked
> > > by the consumer and you can use it to check if there's a committed
> > position
> > > in the range or if you need to seek to the beginning of the range. For
> > > example:
> > >
> > > void onPartitionsAssigned(consumer, partitions) {
> > >   for (partition : partitions) {
> > >  try {
> > >offset = consumer.committed(partition)
> > >consumer.seek(partition, offset)
> > >  } catch (NoOffsetForPartition) {
> > >consumer.seek(partition, rangeStart)
> > >  }
> > >   }
> > > }
> > >
> > > If a failure occurs, then the partitions will be rebalanced across
> > > whichever consumers are still active. The case of the entire cluster
> > being
> > > reb

Re: New Consumer API and Range Consumption with Fail-over

2015-08-05 Thread Jason Gustafson
Hey Bhavesh,

I think your use case can be handled with the new consumer API in roughly
the manner I suggested previously. It might be a little easier if we added
the ability to set the end offset for consumption. Perhaps something like
this:

// stop consumption from the partition when offset is reached
void limit(TopicPartition partition, long offset)

My guess is that we'd have a bit of an uphill battle to get this into the
first release, but it may be possible if the use case is common enough. In
any case, I think consuming to the limit offset and manually pausing the
partition is a viable alternative.

As for your question about fail-over, the new consumer provides a similar
capability to the old high-level consumer. Here is a link to the wiki which
describes its design:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

-Jason

On Tue, Aug 4, 2015 at 12:01 AM, Bhavesh Mistry 
wrote:

> Hi Jason and Kafka Dev Team,
>
>
>
> First of all thanks for responding and I think you got expected behavior
> correctly.
>
>
>
> The use-case is offset range consumption.  We store each minute highest
> offset for each topic per partition.  So if we need to reload or re-consume
> data from yesterday per say 8AM to noon, we would have offset start mapping
> at 8AM and end offset mapping at noon in Time Series Database.
>
>
>
> I was trying to load this use case with New Consumer API.   Do you or Kafka
> Dev team agree with request to either have API that takes in topic and its
> start/end offset for High Level Consumer group  (With older consumer API we
> used Simple consumer before without fail-over).  Also, for each
> range-consumption, there will be different group id  and group id will not
> be reused.  The main purpose is to reload or process past data again (due
> to production bugs or downtime etc occasionally and let main consumer-group
> continue to consume latest records).
>
>
> void subscribe(TopicPartition[] startOffsetPartitions, TopicPartition[]
> endOffsetPartitions)
>
>
>
> or something similar which will allow following:
>
>
>
> 1)   When consumer group already exists (meaning have consumed data and
> committed offset to storage system either Kafka or ZK) ignore start offset
> positions and use committed offset.  If not committed use start Offset
> Partition.
>
> 2)   When partition consumption has reached end Offset for given partition,
> pause is fine or this assigned thread become fail over or wait for
> reassignment.
>
> 3)   When all are Consumer Group is done consuming all partitions offset
> ranges (start to end), gracefully shutdown entire consumer group.
>
> 4)   While consuming records, if one of node or consuming thread goes down
> automatic fail-over to others (Similar to High Level Consumer for OLD
> Consumer API.   I am not sure if there exists High level and/or Simple
> Consumer concept for New API  )
>
>
>
> I hope above explanation clarifies use-case and intended behavior.  Thanks
> for clarifications, and you are correct we need pause(TopicPartition tp),
> resume(TopicPartition tp), and/or API to set to end offset for each
> partition.
>
>
>
> Please do let us know your preference to support above simple use-case.
>
>
> Thanks,
>
>
> Bhavesh
>
> On Thu, Jul 30, 2015 at 1:23 PM, Jason Gustafson 
> wrote:
>
> > Hi Bhavesh,
> >
> > I'm not totally sure I understand the expected behavior, but I think this
> > can work. Instead of seeking to the start of the range before the poll
> > loop, you should probably provide a ConsumerRebalanceCallback to get
> > notifications when group assignment has changed (e.g. when one of your
> > nodes dies). When a new partition is assigned, the callback will be
> invoked
> > by the consumer and you can use it to check if there's a committed
> position
> > in the range or if you need to seek to the beginning of the range. For
> > example:
> >
> > void onPartitionsAssigned(consumer, partitions) {
> >   for (partition : partitions) {
> >  try {
> >offset = consumer.committed(partition)
> >consumer.seek(partition, offset)
> >  } catch (NoOffsetForPartition) {
> >consumer.seek(partition, rangeStart)
> >  }
> >   }
> > }
> >
> > If a failure occurs, then the partitions will be rebalanced across
> > whichever consumers are still active. The case of the entire cluster
> being
> > rebooted is not really different. When the consumers come back, they
> check
> > the committed position and resume where they left off. Does that make
> > sense?
> >
> > After you are finished consuming a partition's range, you can use
> > KafkaConsumer.pause(partition) to prevent further fetches from being
> > initiated while still maintaining the current assignment. The patch to
> add
> > pause() is not in trunk yet, but it probably will be before too long.
> >
> > One potential problem is that you wouldn't be able to reuse the same
> group
> > to consume a different range because of the way it depends on the
> comm

Re: new consumer api?

2015-08-04 Thread Jason Gustafson
Hey Simon,

The new consumer has the ability to forego group management and assign
partitions directly. Once assigned, you can seek to any offset you want.

-Jason

On Tue, Aug 4, 2015 at 5:08 AM, Simon Cooper <
simon.coo...@featurespace.co.uk> wrote:

> Reading on the consumer docs, there's no mention of a relatively simple
> consumer that doesn't need groups, coordinators, commits, anything like
> that - just read and poll from specified offsets of specific topic
> partitions - but automatically deals with leadership changes and connection
> losses (so one level up from SimpleConsumer).
>
> Will the new API be able to be used in this relatively simple way?
> SimonC
>
> -Original Message-
> From: Jun Rao [mailto:j...@confluent.io]
> Sent: 03 August 2015 18:19
> To: users@kafka.apache.org
> Subject: Re: new consumer api?
>
> Jalpesh,
>
> We are still iterating on the new consumer a bit and are waiting for some
> of the security jiras to be committed. So now, we are shooting for
> releasing 0.8.3 in Oct (just updated
> https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan).
>
> Thanks,
>
> Jun
>
> On Mon, Aug 3, 2015 at 8:41 AM, Jalpesh Patadia <
> jalpesh.pata...@clickbank.com> wrote:
>
> > Hello guys,
> >
> > A while ago i read that the new consumer api was going to be released
> > sometime in July as part of the 0.8.3/0.9 release.
> > https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan
> >
> >
> > Do we have an update when we think that can happen?
> >
> >
> > Thanks,
> >
> > Jalpesh
> >
> >
> > -- PRIVILEGED AND CONFIDENTIAL This transmission may contain
> > privileged, proprietary or confidential information. If you are not
> > the intended recipient, you are instructed not to review this
> > transmission. If you are not the intended recipient, please notify the
> > sender that you received this message and delete this transmission from
> your system.
> >
>


RE: new consumer api?

2015-08-04 Thread Simon Cooper
Reading on the consumer docs, there's no mention of a relatively simple 
consumer that doesn't need groups, coordinators, commits, anything like that - 
just read and poll from specified offsets of specific topic partitions - but 
automatically deals with leadership changes and connection losses (so one level 
up from SimpleConsumer).

Will the new API be able to be used in this relatively simple way?
SimonC

-Original Message-
From: Jun Rao [mailto:j...@confluent.io] 
Sent: 03 August 2015 18:19
To: users@kafka.apache.org
Subject: Re: new consumer api?

Jalpesh,

We are still iterating on the new consumer a bit and are waiting for some of 
the security jiras to be committed. So now, we are shooting for releasing 0.8.3 
in Oct (just updated 
https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan).

Thanks,

Jun

On Mon, Aug 3, 2015 at 8:41 AM, Jalpesh Patadia < 
jalpesh.pata...@clickbank.com> wrote:

> Hello guys,
>
> A while ago i read that the new consumer api was going to be released 
> sometime in July as part of the 0.8.3/0.9 release.
> https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan
>
>
> Do we have an update when we think that can happen?
>
>
> Thanks,
>
> Jalpesh
>
>
> -- PRIVILEGED AND CONFIDENTIAL This transmission may contain 
> privileged, proprietary or confidential information. If you are not 
> the intended recipient, you are instructed not to review this 
> transmission. If you are not the intended recipient, please notify the 
> sender that you received this message and delete this transmission from your 
> system.
>


Re: New Consumer API and Range Consumption with Fail-over

2015-08-04 Thread Bhavesh Mistry
Hi Jason and Kafka Dev Team,



First of all thanks for responding and I think you got expected behavior
correctly.



The use-case is offset range consumption.  We store each minute highest
offset for each topic per partition.  So if we need to reload or re-consume
data from yesterday per say 8AM to noon, we would have offset start mapping
at 8AM and end offset mapping at noon in Time Series Database.



I was trying to load this use case with New Consumer API.   Do you or Kafka
Dev team agree with request to either have API that takes in topic and its
start/end offset for High Level Consumer group  (With older consumer API we
used Simple consumer before without fail-over).  Also, for each
range-consumption, there will be different group id  and group id will not
be reused.  The main purpose is to reload or process past data again (due
to production bugs or downtime etc occasionally and let main consumer-group
continue to consume latest records).


void subscribe(TopicPartition[] startOffsetPartitions, TopicPartition[]
endOffsetPartitions)



or something similar which will allow following:



1)   When consumer group already exists (meaning have consumed data and
committed offset to storage system either Kafka or ZK) ignore start offset
positions and use committed offset.  If not committed use start Offset
Partition.

2)   When partition consumption has reached end Offset for given partition,
pause is fine or this assigned thread become fail over or wait for
reassignment.

3)   When all are Consumer Group is done consuming all partitions offset
ranges (start to end), gracefully shutdown entire consumer group.

4)   While consuming records, if one of node or consuming thread goes down
automatic fail-over to others (Similar to High Level Consumer for OLD
Consumer API.   I am not sure if there exists High level and/or Simple
Consumer concept for New API  )



I hope above explanation clarifies use-case and intended behavior.  Thanks
for clarifications, and you are correct we need pause(TopicPartition tp),
resume(TopicPartition tp), and/or API to set to end offset for each
partition.



Please do let us know your preference to support above simple use-case.


Thanks,


Bhavesh

On Thu, Jul 30, 2015 at 1:23 PM, Jason Gustafson  wrote:

> Hi Bhavesh,
>
> I'm not totally sure I understand the expected behavior, but I think this
> can work. Instead of seeking to the start of the range before the poll
> loop, you should probably provide a ConsumerRebalanceCallback to get
> notifications when group assignment has changed (e.g. when one of your
> nodes dies). When a new partition is assigned, the callback will be invoked
> by the consumer and you can use it to check if there's a committed position
> in the range or if you need to seek to the beginning of the range. For
> example:
>
> void onPartitionsAssigned(consumer, partitions) {
>   for (partition : partitions) {
>  try {
>offset = consumer.committed(partition)
>consumer.seek(partition, offset)
>  } catch (NoOffsetForPartition) {
>consumer.seek(partition, rangeStart)
>  }
>   }
> }
>
> If a failure occurs, then the partitions will be rebalanced across
> whichever consumers are still active. The case of the entire cluster being
> rebooted is not really different. When the consumers come back, they check
> the committed position and resume where they left off. Does that make
> sense?
>
> After you are finished consuming a partition's range, you can use
> KafkaConsumer.pause(partition) to prevent further fetches from being
> initiated while still maintaining the current assignment. The patch to add
> pause() is not in trunk yet, but it probably will be before too long.
>
> One potential problem is that you wouldn't be able to reuse the same group
> to consume a different range because of the way it depends on the committed
> offsets. Kafka's commit API actually allows some additional metadata to go
> along with a committed offset and that could potentially be used to tie the
> commit to the range, but it's not yet exposed in KafkaConsumer. I assume it
> will be eventually, but I'm not sure whether that will be part of the
> initial release.
>
>
> Hope that helps!
>
> Jason
>
> On Thu, Jul 30, 2015 at 7:54 AM, Bhavesh Mistry <
> mistry.p.bhav...@gmail.com>
> wrote:
>
> > Hello Kafka Dev Team,
> >
> >
> > With new Consumer API redesign  (
> >
> >
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java
> > ),  is there a capability to consume given the topic and partition
> start/
> > end position.  How would I achieve following use case of range
> consumption
> > with fail-over.
> >
> >
> > Use Case:
> > Ability to reload data given topic and its partition offset start/end
> with
> > High Level Consumer with fail over.   Basically, High Level Range
> > consumption and consumer group dies while main consumer group.
> >
> >
> > Suppose you have a topic called “test-topi

Re: new consumer api?

2015-08-03 Thread Jun Rao
Jalpesh,

We are still iterating on the new consumer a bit and are waiting for some
of the security jiras to be committed. So now, we are shooting for
releasing 0.8.3 in Oct (just updated
https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan).

Thanks,

Jun

On Mon, Aug 3, 2015 at 8:41 AM, Jalpesh Patadia <
jalpesh.pata...@clickbank.com> wrote:

> Hello guys,
>
> A while ago i read that the new consumer api was going to be released
> sometime in July as part of the 0.8.3/0.9 release.
> https://cwiki.apache.org/confluence/display/KAFKA/Future+release+plan
>
>
> Do we have an update when we think that can happen?
>
>
> Thanks,
>
> Jalpesh
>
>
> -- PRIVILEGED AND CONFIDENTIAL This transmission may contain privileged,
> proprietary or confidential information. If you are not the intended
> recipient, you are instructed not to review this transmission. If you are
> not the intended recipient, please notify the sender that you received this
> message and delete this transmission from your system.
>


Re: New Consumer API and Range Consumption with Fail-over

2015-07-30 Thread Jason Gustafson
Hi Bhavesh,

I'm not totally sure I understand the expected behavior, but I think this
can work. Instead of seeking to the start of the range before the poll
loop, you should probably provide a ConsumerRebalanceCallback to get
notifications when group assignment has changed (e.g. when one of your
nodes dies). When a new partition is assigned, the callback will be invoked
by the consumer and you can use it to check if there's a committed position
in the range or if you need to seek to the beginning of the range. For
example:

void onPartitionsAssigned(consumer, partitions) {
  for (partition : partitions) {
 try {
   offset = consumer.committed(partition)
   consumer.seek(partition, offset)
 } catch (NoOffsetForPartition) {
   consumer.seek(partition, rangeStart)
 }
  }
}

If a failure occurs, then the partitions will be rebalanced across
whichever consumers are still active. The case of the entire cluster being
rebooted is not really different. When the consumers come back, they check
the committed position and resume where they left off. Does that make sense?

After you are finished consuming a partition's range, you can use
KafkaConsumer.pause(partition) to prevent further fetches from being
initiated while still maintaining the current assignment. The patch to add
pause() is not in trunk yet, but it probably will be before too long.

One potential problem is that you wouldn't be able to reuse the same group
to consume a different range because of the way it depends on the committed
offsets. Kafka's commit API actually allows some additional metadata to go
along with a committed offset and that could potentially be used to tie the
commit to the range, but it's not yet exposed in KafkaConsumer. I assume it
will be eventually, but I'm not sure whether that will be part of the
initial release.


Hope that helps!

Jason

On Thu, Jul 30, 2015 at 7:54 AM, Bhavesh Mistry 
wrote:

> Hello Kafka Dev Team,
>
>
> With new Consumer API redesign  (
>
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java
> ),  is there a capability to consume given the topic and partition  start/
> end position.  How would I achieve following use case of range consumption
> with fail-over.
>
>
> Use Case:
> Ability to reload data given topic and its partition offset start/end with
> High Level Consumer with fail over.   Basically, High Level Range
> consumption and consumer group dies while main consumer group.
>
>
> Suppose you have a topic called “test-topic” and its partition begin and
> end offset.
>
> {
>
> topic:  test-topic,
>
> [   {  partition id : 1 , offset start:   100,  offset end:
> 500,000 },
>
>
> {  partition id : 2 ,  offset start:   200,000, offset end:
> 500,000
>
> ….. for n partitions
>
> ]
>
> }
>
> Each you create consumer group: “Range-Consumer “ and use seek method and
> for each partition.   Your feedback is greatly appreciated.
>
>
> In each JVM,
>
>
> For each consumption tread:
>
>
> Consumer c = KafkaConsumer( { group.id=”Range-consumer}…)
>
> Map parttionTOEndOfsetMapping ….
>
> for(TopicPartition tp : topicPartitionlist){
>
> seek(TopicPartition(Parition 1), long offset)
>
> }
>
>
>
> while(true){
>
> ConsumerRecords records = consumer.poll(1);
>
> // for each record check the offset
>
> record = record.iterator().next();
>
> if(parttionTOEndOfsetMapping(record.getPartition()) <=
> record.getoffset) {
>   // consume  record
>
> //commit  offset
>
>   consumer.commit(CommitType.SYNC);
>
> }else {
>
> // Should I unsubscribe it now  for this partition
> ?
>
> consumer.unscribe(record.getPartition)
>
> }
>
>
>
> }
>
>
>
>
> Please let me know if the above approach is valid:
>
> 1) how will fail-over work.
>
> 2) how Rebooting entire consumer group impacts offset seek ? Since offset
> are stored by Kafka itsself.
>
> Thanks ,
>
> Bhavesh
>


Re: New consumer API used in mirror maker

2015-07-12 Thread Jiangjie Qin
Yes, we are going to use new consumer after it is ready.


Jiangjie (Becket) Qin

On 7/12/15, 8:21 PM, "tao xiao"  wrote:

>Hi team,
>
>The trunk code of mirror maker now uses the old consumer API, Is there any
>plan to use new Java consumer api in mirror maker?



Re: New Consumer API discussion

2014-03-27 Thread Neha Narkhede
If people don't have any more thoughts on this, I will go ahead and submit
a reviewboard to https://issues.apache.org/jira/browse/KAFKA-1328.

Thanks,
Neha


On Mon, Mar 24, 2014 at 5:39 PM, Neha Narkhede wrote:

> I took some time to write some example code using the new consumer APIs to
> cover a range of use cases. This exercise was very useful (thanks for the
> suggestion, Jay!) since I found several improvements to the APIs to make
> them more usable. Here are some of the 
> changesI
>  made -
>
> 1. Added usage examples to the KafkaConsumer 
> javadoc.
> I find it useful for the examples to be in the javadoc vs some wiki. Please
> go through these examples and suggest improvements. The goal would be to
> document a limited set of examples that cover every major use case.
> 2. All APIs that either accept or return offsets are changed to
> Map instead of TopicPartitionOffset... In all the
> examples that I wrote, it was much easier to deal with offsets and pass
> them around in the consumer APIs if they were maps instead of lists
> 3. Due to the above change, I had to introduce 
> commit()and
>  commitAsync() APIs explicitly, in addition to
> commit(Map offsets) and
> commitAsync(Map offsets), since the no-argument case
> would not be covered automatically with Map as the input parameter to the
> commit APIs
> 4. Offset rewind logic is funky with group management. I took a stab and
> it and wrote examples to cover the various offset rewind uses cases I could
> think of. I'm not so sure I like it, so I encourage people to take a look
> at the examples and provide feedback. This feedback is very critical in
> finalizing the consumer APIs as we might have to add/change APIs to make
> offset rewind intuitive and easy to use. (Please see the 3rd and 4th
> examples 
> here
> )
>
> Once I have feedback on the above, I will go ahead and submit a review
> board for the new APIs and javadoc.
>
> Thanks
> Neha
>
>
> On Mon, Mar 24, 2014 at 5:29 PM, Neha Narkhede wrote:
>
>> Hey Chris,
>>
>> Really sorry for the late reply, wonder how this fell through the cracks.
>> Anyhow, thanks for the great feedback! Here are my comments -
>>
>>
>> 1. Why is the config String->Object instead of String->String?
>>
>> This is probably more of a feedback about the new config management that
>> we adopted in the new clients. I think it is more convenient to write
>> configs.put("a", 42);
>> instead of
>> configs.put("a", Integer.toString(42));
>>
>> 2. Are these Java docs correct?
>>
>>   KafkaConsumer(java.util.Map<
>> java.lang.String,java.lang.Object> configs)
>>   A consumer is instantiated by providing a set of key-value pairs as
>> configuration and a ConsumerRebalanceCallback implementation
>>
>> There is no ConsumerRebalanceCallback parameter.
>>
>> Fixed.
>>
>>
>> 3. Would like to have a method:
>>
>>   poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
>> TopicPartition... topicAndPartitionsToPoll)
>>
>> I see I can effectively do this by just fiddling with subscribe and
>> unsubscribe before each poll. Is this a low-overhead operation? Can I just
>> unsubscribe from everything after each poll, then re-subscribe to a topic
>> the next iteration. I would probably be doing this in a fairly tight loop.
>>
>> The subscribe and unsubscribe will be very lightweight in-memory
>> operations,
>> so it shouldn't be a problem to just use those APIs directly.
>> Let me know if you think otherwise.
>>
>> 4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
>> are use cases for decoupling "what to do when no offset exists" from "what
>> to do when I'm out of range". I might want to start from smallest the
>> first time I run, but fail if I ever get offset out of range.
>>
>> How about adding a third option "disable" to "auto.offset.reset"?
>> What this says is that never automatically reset the offset, either if
>> one is not found or if the offset
>> falls out of range. Presumably, you would want to turn this off when you
>> want to control the offsets
>> yourself and use custom rewind/replay logic to reset the consumer's
>> offset. In this case, you would
>> want to turn this feature off so Kafka does not accidentally reset the
>> offset to something else.
>>
>> I'm not so sure when you would want to make the distinction regarding
>> startup and offset falling out
>> of range. Presumably, if you don't trust Kafka to reset the offset, then
>> you can always turn this off
>> and use commit/commitAsync and seek() to set the consumer to the right
>> offset on startup and every
>> time your co

Re: New Consumer API discussion

2014-03-24 Thread Neha Narkhede
I took some time to write some example code using the new consumer APIs to
cover a range of use cases. This exercise was very useful (thanks for the
suggestion, Jay!) since I found several improvements to the APIs to make
them more usable. Here are some of the
changesI
made -

1. Added usage examples to the KafkaConsumer
javadoc.
I find it useful for the examples to be in the javadoc vs some wiki. Please
go through these examples and suggest improvements. The goal would be to
document a limited set of examples that cover every major use case.
2. All APIs that either accept or return offsets are changed to
Map instead of TopicPartitionOffset... In all the
examples that I wrote, it was much easier to deal with offsets and pass
them around in the consumer APIs if they were maps instead of lists
3. Due to the above change, I had to introduce
commit()and
commitAsync() APIs explicitly, in addition to
commit(Map offsets) and
commitAsync(Map offsets), since the no-argument case
would not be covered automatically with Map as the input parameter to the
commit APIs
4. Offset rewind logic is funky with group management. I took a stab and it
and wrote examples to cover the various offset rewind uses cases I could
think of. I'm not so sure I like it, so I encourage people to take a look
at the examples and provide feedback. This feedback is very critical in
finalizing the consumer APIs as we might have to add/change APIs to make
offset rewind intuitive and easy to use. (Please see the 3rd and 4th
examples 
here
)

Once I have feedback on the above, I will go ahead and submit a review
board for the new APIs and javadoc.

Thanks
Neha


On Mon, Mar 24, 2014 at 5:29 PM, Neha Narkhede wrote:

> Hey Chris,
>
> Really sorry for the late reply, wonder how this fell through the cracks.
> Anyhow, thanks for the great feedback! Here are my comments -
>
>
> 1. Why is the config String->Object instead of String->String?
>
> This is probably more of a feedback about the new config management that
> we adopted in the new clients. I think it is more convenient to write
> configs.put("a", 42);
> instead of
> configs.put("a", Integer.toString(42));
>
> 2. Are these Java docs correct?
>
>   KafkaConsumer(java.util.Map<
> java.lang.String,java.lang.Object> configs)
>   A consumer is instantiated by providing a set of key-value pairs as
> configuration and a ConsumerRebalanceCallback implementation
>
> There is no ConsumerRebalanceCallback parameter.
>
> Fixed.
>
>
> 3. Would like to have a method:
>
>   poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
> TopicPartition... topicAndPartitionsToPoll)
>
> I see I can effectively do this by just fiddling with subscribe and
> unsubscribe before each poll. Is this a low-overhead operation? Can I just
> unsubscribe from everything after each poll, then re-subscribe to a topic
> the next iteration. I would probably be doing this in a fairly tight loop.
>
> The subscribe and unsubscribe will be very lightweight in-memory
> operations,
> so it shouldn't be a problem to just use those APIs directly.
> Let me know if you think otherwise.
>
> 4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
> are use cases for decoupling "what to do when no offset exists" from "what
> to do when I'm out of range". I might want to start from smallest the
> first time I run, but fail if I ever get offset out of range.
>
> How about adding a third option "disable" to "auto.offset.reset"?
> What this says is that never automatically reset the offset, either if one
> is not found or if the offset
> falls out of range. Presumably, you would want to turn this off when you
> want to control the offsets
> yourself and use custom rewind/replay logic to reset the consumer's
> offset. In this case, you would
> want to turn this feature off so Kafka does not accidentally reset the
> offset to something else.
>
> I'm not so sure when you would want to make the distinction regarding
> startup and offset falling out
> of range. Presumably, if you don't trust Kafka to reset the offset, then
> you can always turn this off
> and use commit/commitAsync and seek() to set the consumer to the right
> offset on startup and every
> time your consumer falls out of range.
>
> Does that make sense?
>
> 5. ENABLE_JMX could use Java docs, even though it's fairly
> self-explanatory.
>
> Fixed.
>
> 6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or
> across all topic/partitions is useful. I believe it's per-topic/partition,
> right? That is, setting to 2 megs with two TopicAndPartit

Re: New Consumer API discussion

2014-03-24 Thread Neha Narkhede
Hey Chris,

Really sorry for the late reply, wonder how this fell through the cracks.
Anyhow, thanks for the great feedback! Here are my comments -

1. Why is the config String->Object instead of String->String?

This is probably more of a feedback about the new config management that
we adopted in the new clients. I think it is more convenient to write
configs.put("a", 42);
instead of
configs.put("a", Integer.toString(42));

2. Are these Java docs correct?

  KafkaConsumer(java.util.Map<
java.lang.String,java.lang.Object> configs)
  A consumer is instantiated by providing a set of key-value pairs as
configuration and a ConsumerRebalanceCallback implementation

There is no ConsumerRebalanceCallback parameter.

Fixed.

3. Would like to have a method:

  poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
TopicPartition... topicAndPartitionsToPoll)

I see I can effectively do this by just fiddling with subscribe and
unsubscribe before each poll. Is this a low-overhead operation? Can I just
unsubscribe from everything after each poll, then re-subscribe to a topic
the next iteration. I would probably be doing this in a fairly tight loop.

The subscribe and unsubscribe will be very lightweight in-memory operations,
so it shouldn't be a problem to just use those APIs directly.
Let me know if you think otherwise.

4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
are use cases for decoupling "what to do when no offset exists" from "what
to do when I'm out of range". I might want to start from smallest the
first time I run, but fail if I ever get offset out of range.

How about adding a third option "disable" to "auto.offset.reset"?
What this says is that never automatically reset the offset, either if one
is not found or if the offset
falls out of range. Presumably, you would want to turn this off when you
want to control the offsets
yourself and use custom rewind/replay logic to reset the consumer's offset.
In this case, you would
want to turn this feature off so Kafka does not accidentally reset the
offset to something else.

I'm not so sure when you would want to make the distinction regarding
startup and offset falling out
of range. Presumably, if you don't trust Kafka to reset the offset, then
you can always turn this off
and use commit/commitAsync and seek() to set the consumer to the right
offset on startup and every
time your consumer falls out of range.

Does that make sense?

5. ENABLE_JMX could use Java docs, even though it's fairly
self-explanatory.

Fixed.

6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or
across all topic/partitions is useful. I believe it's per-topic/partition,
right? That is, setting to 2 megs with two TopicAndPartitions would result
in 4 megs worth of data coming in per fetch, right?

Good point, clarified that. Take a look again to see if it makes sense now.

7. What does the consumer do if METADATA_FETCH_TIMEOUT_CONFIG times out?
Retry, or throw exception?

Throw a TimeoutException. Clarified that in the
docs
.

8. Does RECONNECT_BACKOFF_MS_CONFIG apply to both metadata requests and
fetch requests?

Applies to all requests. Clarified that in the docs.

9. What does SESSION_TIMEOUT_MS default to?

Defaults are largely TODO, but session.timeout.ms currently defaults to
1000.

10. Is this consumer thread-safe?

It should be. Updated the
docsto
clarify that.

11. How do you use a different offset management strategy? Your email
implies that it's pluggable, but I don't see how. "The offset management
strategy defaults to Kafka based offset management and the API provides a
way for the user to use a customized offset store to manage the consumer's
offsets."

12. If I wish to decouple the consumer from the offset checkpointing, is
it OK to use Joel's offset management stuff directly, rather than through
the consumer's commit API?

For #11 and #12, I updated the
docsto
include actual usage examples.
Could you take a look and see if answers your questions?

Thanks,
Neha



On Mon, Mar 3, 2014 at 10:28 AM, Chris Riccomini wrote:

> Hey Guys,
>
> Also, for reference, we'll be looking to implement new Samza consumers
> which have these APIs:
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or
> g/apache/samza/system/SystemConsumer.html
>
> http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or
> g/apache/samza/checkpoint/CheckpointManager.html
>
>
> Question (3) below is a result of having Samza's SystemConsumers poll
> allow specific topic/partitions to be specified.
>
> The split between consumer and checkpoint manager is the reason for
> question (12) be

Re: New Consumer API discussion

2014-03-17 Thread Neha Narkhede
I'm not quite sure if I fully understood your question. The consumer API
exposes a close() method that will shutdown the consumer's connections to
all brokers and frees up resources that the consumer uses.

I've updated the javadoc for the new consumer API to include a few examples
of different ways of using the consumer. Probably you might find it useful
-
http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html

Thanks,
Neha


On Sun, Mar 16, 2014 at 7:55 PM, Shanmugam, Srividhya <
srividhyashanmu...@fico.com> wrote:

> Can the consumer API provide a way to shut down the connector by doing a
> look up by the consumer group Id? For example, application may be consuming
> the messages in one thread whereas the shutdown call can  be initiated in a
> different thread.
>
> This email and any files transmitted with it are confidential, proprietary
> and intended solely for the individual or entity to whom they are
> addressed. If you have received this email in error please delete it
> immediately.
>


Re: New Consumer API discussion

2014-03-16 Thread Shanmugam, Srividhya
Can the consumer API provide a way to shut down the connector by doing a look 
up by the consumer group Id? For example, application may be consuming the 
messages in one thread whereas the shutdown call can  be initiated in a 
different thread.

This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.


Re: New Consumer API discussion

2014-03-03 Thread Chris Riccomini
Hey Guys,

Also, for reference, we'll be looking to implement new Samza consumers
which have these APIs:

http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or
g/apache/samza/system/SystemConsumer.html

http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or
g/apache/samza/checkpoint/CheckpointManager.html


Question (3) below is a result of having Samza's SystemConsumers poll
allow specific topic/partitions to be specified.

The split between consumer and checkpoint manager is the reason for
question (12) below.

Cheers,
Chris

On 3/3/14 10:19 AM, "Chris Riccomini"  wrote:

>Hey Guys,
>
>Sorry for the late follow up. Here are my questions/thoughts on the API:
>
>1. Why is the config String->Object instead of String->String?
>
>2. Are these Java docs correct?
>
>  KafkaConsumer(java.util.Map configs)
>  A consumer is instantiated by providing a set of key-value pairs as
>configuration and a ConsumerRebalanceCallback implementation
>
>There is no ConsumerRebalanceCallback parameter.
>
>3. Would like to have a method:
>
>  poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
>TopicPartition... topicAndPartitionsToPoll)
>
>I see I can effectively do this by just fiddling with subscribe and
>unsubscribe before each poll. Is this a low-overhead operation? Can I just
>unsubscribe from everything after each poll, then re-subscribe to a topic
>the next iteration. I would probably be doing this in a fairly tight loop.
>
>4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
>are use cases for decoupling "what to do when no offset exists" from "what
>to do when I'm out of range". I might want to start from smallest the
>first time I run, but fail if I ever get offset out of range.
>
>5. ENABLE_JMX could use Java docs, even though it's fairly
>self-explanatory.
>
>6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or
>across all topic/partitions is useful. I believe it's per-topic/partition,
>right? That is, setting to 2 megs with two TopicAndPartitions would result
>in 4 megs worth of data coming in per fetch, right?
>
>7. What does the consumer do if METADATA_FETCH_TIMEOUT_CONFIG times out?
>Retry, or throw exception?
>
>8. Does RECONNECT_BACKOFF_MS_CONFIG apply to both metadata requests and
>fetch requests?
>
>9. What does SESSION_TIMEOUT_MS default to?
>
>10. Is this consumer thread-safe?
>
>11. How do you use a different offset management strategy? Your email
>implies that it's pluggable, but I don't see how. "The offset management
>strategy defaults to Kafka based offset management and the API provides a
>way for the user to use a customized offset store to manage the consumer's
>offsets."
>
>12. If I wish to decouple the consumer from the offset checkpointing, is
>it OK to use Joel's offset management stuff directly, rather than through
>the consumer's commit API?
>
>
>Cheers,
>Chris
>
>On 2/10/14 10:54 AM, "Neha Narkhede"  wrote:
>
>>As mentioned in previous emails, we are also working on a
>>re-implementation
>>of the consumer. I would like to use this email thread to discuss the
>>details of the public API. I would also like us to be picky about this
>>public api now so it is as good as possible and we don't need to break it
>>in the future.
>>
>>The best way to get a feel for the API is actually to take a look at the
>>javadoc>/
>>doc/kafka/clients/consumer/KafkaConsumer.html>,
>>the hope is to get the api docs good enough so that it is
>>self-explanatory.
>>You can also take a look at the configs
>>here>c
>>/kafka/clients/consumer/ConsumerConfig.html>
>>
>>Some background info on implementation:
>>
>>At a high level the primary difference in this consumer is that it
>>removes
>>the distinction between the "high-level" and "low-level" consumer. The
>>new
>>consumer API is non blocking and instead of returning a blocking
>>iterator,
>>the consumer provides a poll() API that returns a list of records. We
>>think
>>this is better compared to the blocking iterators since it effectively
>>decouples the threading strategy used for processing messages from the
>>consumer. It is worth noting that the consumer is entirely single
>>threaded
>>and runs in the user thread. The advantage is that it can be easily
>>rewritten in less multi-threading-friendly languages. The consumer
>>batches
>>data and multiplexes I/O over TCP connections to each of the brokers it
>>communicates with, for high throughput. The consumer also allows long
>>poll
>>to reduce the end-to-end message latency for low throughput data.
>>
>>The consumer provides a group management facility that supports the
>>concept
>>of a group with multiple consumer instances (just like the current
>>consumer). This is done through a custom heartbeat and group management
>>protocol transparent to the user. At the same time, it allows users th

Re: New Consumer API discussion

2014-03-03 Thread Chris Riccomini
Hey Guys,

Sorry for the late follow up. Here are my questions/thoughts on the API:

1. Why is the config String->Object instead of String->String?

2. Are these Java docs correct?

  KafkaConsumer(java.util.Map configs)
  A consumer is instantiated by providing a set of key-value pairs as
configuration and a ConsumerRebalanceCallback implementation

There is no ConsumerRebalanceCallback parameter.

3. Would like to have a method:

  poll(long timeout, java.util.concurrent.TimeUnit timeUnit,
TopicPartition... topicAndPartitionsToPoll)

I see I can effectively do this by just fiddling with subscribe and
unsubscribe before each poll. Is this a low-overhead operation? Can I just
unsubscribe from everything after each poll, then re-subscribe to a topic
the next iteration. I would probably be doing this in a fairly tight loop.

4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there
are use cases for decoupling "what to do when no offset exists" from "what
to do when I'm out of range". I might want to start from smallest the
first time I run, but fail if I ever get offset out of range.

5. ENABLE_JMX could use Java docs, even though it's fairly
self-explanatory.

6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or
across all topic/partitions is useful. I believe it's per-topic/partition,
right? That is, setting to 2 megs with two TopicAndPartitions would result
in 4 megs worth of data coming in per fetch, right?

7. What does the consumer do if METADATA_FETCH_TIMEOUT_CONFIG times out?
Retry, or throw exception?

8. Does RECONNECT_BACKOFF_MS_CONFIG apply to both metadata requests and
fetch requests?

9. What does SESSION_TIMEOUT_MS default to?

10. Is this consumer thread-safe?

11. How do you use a different offset management strategy? Your email
implies that it's pluggable, but I don't see how. "The offset management
strategy defaults to Kafka based offset management and the API provides a
way for the user to use a customized offset store to manage the consumer's
offsets."

12. If I wish to decouple the consumer from the offset checkpointing, is
it OK to use Joel's offset management stuff directly, rather than through
the consumer's commit API?


Cheers,
Chris

On 2/10/14 10:54 AM, "Neha Narkhede"  wrote:

>As mentioned in previous emails, we are also working on a
>re-implementation
>of the consumer. I would like to use this email thread to discuss the
>details of the public API. I would also like us to be picky about this
>public api now so it is as good as possible and we don't need to break it
>in the future.
>
>The best way to get a feel for the API is actually to take a look at the
>javadocdoc/kafka/clients/consumer/KafkaConsumer.html>,
>the hope is to get the api docs good enough so that it is
>self-explanatory.
>You can also take a look at the configs
>here/kafka/clients/consumer/ConsumerConfig.html>
>
>Some background info on implementation:
>
>At a high level the primary difference in this consumer is that it removes
>the distinction between the "high-level" and "low-level" consumer. The new
>consumer API is non blocking and instead of returning a blocking iterator,
>the consumer provides a poll() API that returns a list of records. We
>think
>this is better compared to the blocking iterators since it effectively
>decouples the threading strategy used for processing messages from the
>consumer. It is worth noting that the consumer is entirely single threaded
>and runs in the user thread. The advantage is that it can be easily
>rewritten in less multi-threading-friendly languages. The consumer batches
>data and multiplexes I/O over TCP connections to each of the brokers it
>communicates with, for high throughput. The consumer also allows long poll
>to reduce the end-to-end message latency for low throughput data.
>
>The consumer provides a group management facility that supports the
>concept
>of a group with multiple consumer instances (just like the current
>consumer). This is done through a custom heartbeat and group management
>protocol transparent to the user. At the same time, it allows users the
>option to subscribe to a fixed set of partitions and not use group
>management at all. The offset management strategy defaults to Kafka based
>offset management and the API provides a way for the user to use a
>customized offset store to manage the consumer's offsets.
>
>A key difference in this consumer also is the fact that it does not depend
>on zookeeper at all.
>
>More details about the new consumer design are
>hereRewrite+Design>
>
>Please take a look at the new
>APIkafka/clients/consumer/KafkaConsumer.html>and
>give us any thoughts you may have.
>
>Thanks,
>Neha



Re: New Consumer API discussion

2014-02-28 Thread Neha Narkhede
ere are some comments -
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> 1. The using of ellipsis: This may make passing a list of
> items
> > >> from
> > >>>>>> a
> > >>>>>>>>>> collection to the api a bit harder. Suppose that you have a
> list
> > >> of
> > >>>>>>>>>> topics
> > >>>>>>>>>> stored in
> > >>>>>>>>>>
> > >>>>>>>>>> ArrayList topics;
> > >>>>>>>>>>
> > >>>>>>>>>> If you want subscribe to all topics in one call, you will have
> > to
> > >>>> do:
> > >>>>>>>>>>
> > >>>>>>>>>> String[] topicArray = new String[topics.size()];
> > >>>>>>>>>> consumer.subscribe(topics.
> > >>>>>>>>>> toArray(topicArray));
> > >>>>>>>>>>
> > >>>>>>>>>> A similar argument can be made for arguably the more common
> use
> > >> case
> > >>>>>> of
> > >>>>>>>>>> subscribing to a single topic as well. In these cases, user is
> > >>>>>> required
> > >>>>>>>>>> to write more
> > >>>>>>>>>> code to create a single item collection and pass it in. Since
> > >>>>>>>>>> subscription is extremely lightweight
> > >>>>>>>>>> invoking it multiple times also seems like a workable
> solution,
> > >> no?
> > >>>>>>>>>>
> > >>>>>>>>>> 2. It would be good to document that the following apis are
> > >> mutually
> > >>>>>>>>>> exclusive. Also, if the partition level subscription is
> > specified,
> > >>>>>>> there
> > >>>>>>>>>> is
> > >>>>>>>>>> no group management. Finally, unsubscribe() can only be used
> to
> > >>>>>> cancel
> > >>>>>>>>>> subscriptions with the same pattern. For example, you can't
> > >>>>>> unsubscribe
> > >>>>>>>>>> at
> > >>>>>>>>>> the partition level if the subscription is done at the topic
> > >> level.
> > >>>>>>>>>>
> > >>>>>>>>>> *subscribe*(java.lang.String... topics)
> > >>>>>>>>>> *subscribe*(java.lang.String topic, int... partitions)
> > >>>>>>>>>>
> > >>>>>>>>>> Makes sense. Made the suggested improvements to the docs<
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
> > >>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> 3.commit(): The following comment in the doc should probably
> say
> > >>>>>>> "commit
> > >>>>>>>>>> offsets for partitions assigned to this consumer".
> > >>>>>>>>>>
> > >>>>>>>>>> If no partitions are specified, commits offsets for the
> > subscribed
> > >>>>>>> list
> > >>>>>>>>>> of
> > >>>>>>>>>> topics and partitions to Kafka.
> > >>>>>>>>>>
> > >>>>>>>>>> Could you give more context on this suggestion? Here is the
> > entire
> > >>>>>> doc
> > >>>>>>> -
> > >>>>>>>>>>
> > >>>>>>>>>> Synchronously commits the specified offsets for the specified
> > list
> > >>>> of
> > >>>>>>>>>> topics and partitions to *Kafka*. If no partitions are
> > specified,
> &

Re: New Consumer API discussion

2014-02-28 Thread S Ahmed
api
> >>>> to
> >>>>>>>>> fetch the last offset from the server for a partition. Something
> >> like
> >>>>>>>>> long lastOffset(TopicPartition tp)
> >>>>>>>>> and for symmetry
> >>>>>>>>> long firstOffset(TopicPartition tp)
> >>>>>>>>>
> >>>>>>>>> Likely this would have to be batched.
> >>>>>>>>>
> >>>>>>>>> A fixed range of data load can be done using the existing APIs as
> >>>>>>>>> follows. This assumes you know the endOffset which can be
> >>>>>> currentOffset
> >>>>>>> + n
> >>>>>>>>> (number of messages in the load)
> >>>>>>>>>
> >>>>>>>>> long startOffset = consumer.position(partition);
> >>>>>>>>> long endOffset = startOffset + n;
> >>>>>>>>> while(consumer.position(partition) <= endOffset) {
> >>>>>>>>>   List messages = consumer.poll(timeout,
> >>>>>>>>> TimeUnit.MILLISECONDS);
> >>>>>>>>>   process(messages, endOffset);  // processes messages
> >>>>>> until
> >>>>>>>>> endOffset
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> Does that make sense?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 25, 2014 at 9:49 AM, Neha Narkhede <
> >>>>>> neha.narkh...@gmail.com
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks for the review, Jun. Here are some comments -
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. The using of ellipsis: This may make passing a list of items
> >> from
> >>>>>> a
> >>>>>>>>>> collection to the api a bit harder. Suppose that you have a list
> >> of
> >>>>>>>>>> topics
> >>>>>>>>>> stored in
> >>>>>>>>>>
> >>>>>>>>>> ArrayList topics;
> >>>>>>>>>>
> >>>>>>>>>> If you want subscribe to all topics in one call, you will have
> to
> >>>> do:
> >>>>>>>>>>
> >>>>>>>>>> String[] topicArray = new String[topics.size()];
> >>>>>>>>>> consumer.subscribe(topics.
> >>>>>>>>>> toArray(topicArray));
> >>>>>>>>>>
> >>>>>>>>>> A similar argument can be made for arguably the more common use
> >> case
> >>>>>> of
> >>>>>>>>>> subscribing to a single topic as well. In these cases, user is
> >>>>>> required
> >>>>>>>>>> to write more
> >>>>>>>>>> code to create a single item collection and pass it in. Since
> >>>>>>>>>> subscription is extremely lightweight
> >>>>>>>>>> invoking it multiple times also seems like a workable solution,
> >> no?
> >>>>>>>>>>
> >>>>>>>>>> 2. It would be good to document that the following apis are
> >> mutually
> >>>>>>>>>> exclusive. Also, if the partition level subscription is
> specified,
> >>>>>>> there
> >>>>>>>>>> is
> >>>>>>>>>> no group management. Finally, unsubscribe() can only be used to
> >>>>>> cancel
> >>>>>>>>>> subscriptions with the same pattern. For example, you can't
> >>>>>> unsubscribe
> >>>>>>>>>> at
> >>>>>>>>>> the partition level if the subscription is done at the topic
> >> level.
> >>>>>>>>>>
> >>>>>>>>>> *subscribe*(java.lang.String... topics)
> >>>>>>>>>> *subscribe*(java.lang.String topic, int...

Re: New Consumer API discussion

2014-02-27 Thread Robert Withers
; 
>>>>>>>>>> Thanks for the review, Jun. Here are some comments -
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 1. The using of ellipsis: This may make passing a list of items
>> from
>>>>>> a
>>>>>>>>>> collection to the api a bit harder. Suppose that you have a list
>> of
>>>>>>>>>> topics
>>>>>>>>>> stored in
>>>>>>>>>> 
>>>>>>>>>> ArrayList topics;
>>>>>>>>>> 
>>>>>>>>>> If you want subscribe to all topics in one call, you will have to
>>>> do:
>>>>>>>>>> 
>>>>>>>>>> String[] topicArray = new String[topics.size()];
>>>>>>>>>> consumer.subscribe(topics.
>>>>>>>>>> toArray(topicArray));
>>>>>>>>>> 
>>>>>>>>>> A similar argument can be made for arguably the more common use
>> case
>>>>>> of
>>>>>>>>>> subscribing to a single topic as well. In these cases, user is
>>>>>> required
>>>>>>>>>> to write more
>>>>>>>>>> code to create a single item collection and pass it in. Since
>>>>>>>>>> subscription is extremely lightweight
>>>>>>>>>> invoking it multiple times also seems like a workable solution,
>> no?
>>>>>>>>>> 
>>>>>>>>>> 2. It would be good to document that the following apis are
>> mutually
>>>>>>>>>> exclusive. Also, if the partition level subscription is specified,
>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>> no group management. Finally, unsubscribe() can only be used to
>>>>>> cancel
>>>>>>>>>> subscriptions with the same pattern. For example, you can't
>>>>>> unsubscribe
>>>>>>>>>> at
>>>>>>>>>> the partition level if the subscription is done at the topic
>> level.
>>>>>>>>>> 
>>>>>>>>>> *subscribe*(java.lang.String... topics)
>>>>>>>>>> *subscribe*(java.lang.String topic, int... partitions)
>>>>>>>>>> 
>>>>>>>>>> Makes sense. Made the suggested improvements to the docs<
>>>>>>> 
>>>>>> 
>>>> 
>> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 3.commit(): The following comment in the doc should probably say
>>>>>>> "commit
>>>>>>>>>> offsets for partitions assigned to this consumer".
>>>>>>>>>> 
>>>>>>>>>> If no partitions are specified, commits offsets for the subscribed
>>>>>>> list
>>>>>>>>>> of
>>>>>>>>>> topics and partitions to Kafka.
>>>>>>>>>> 
>>>>>>>>>> Could you give more context on this suggestion? Here is the entire
>>>>>> doc
>>>>>>> -
>>>>>>>>>> 
>>>>>>>>>> Synchronously commits the specified offsets for the specified list
>>>> of
>>>>>>>>>> topics and partitions to *Kafka*. If no partitions are specified,
>>>>>>>>>> commits offsets for the subscribed list of topics and partitions.
>>>>>>>>>> 
>>>>>>>>>> The hope is to convey that if no partitions are specified, offsets
>>>>>> will
>>>>>>>>>> be committed for the subscribed list of partitions. One
>> improvement
>>>>>>> could
>>>>>>>>>> be to
>>>>>>>>>> explicitly state that the offsets returned on the last poll will
>> be
>>>>>>>>>> committed. I updated this to -
>>>>>>>>

Re: New Consumer API discussion

2014-02-27 Thread Neha Narkhede
gt;>>>>> A similar argument can be made for arguably the more common use
> case
> >>>> of
> >>>>>>>> subscribing to a single topic as well. In these cases, user is
> >>>> required
> >>>>>>>> to write more
> >>>>>>>> code to create a single item collection and pass it in. Since
> >>>>>>>> subscription is extremely lightweight
> >>>>>>>> invoking it multiple times also seems like a workable solution,
> no?
> >>>>>>>>
> >>>>>>>> 2. It would be good to document that the following apis are
> mutually
> >>>>>>>> exclusive. Also, if the partition level subscription is specified,
> >>>>> there
> >>>>>>>> is
> >>>>>>>> no group management. Finally, unsubscribe() can only be used to
> >>>> cancel
> >>>>>>>> subscriptions with the same pattern. For example, you can't
> >>>> unsubscribe
> >>>>>>>> at
> >>>>>>>> the partition level if the subscription is done at the topic
> level.
> >>>>>>>>
> >>>>>>>> *subscribe*(java.lang.String... topics)
> >>>>>>>> *subscribe*(java.lang.String topic, int... partitions)
> >>>>>>>>
> >>>>>>>> Makes sense. Made the suggested improvements to the docs<
> >>>>>
> >>>>
> >>
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
> >>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3.commit(): The following comment in the doc should probably say
> >>>>> "commit
> >>>>>>>> offsets for partitions assigned to this consumer".
> >>>>>>>>
> >>>>>>>> If no partitions are specified, commits offsets for the subscribed
> >>>>> list
> >>>>>>>> of
> >>>>>>>> topics and partitions to Kafka.
> >>>>>>>>
> >>>>>>>> Could you give more context on this suggestion? Here is the entire
> >>>> doc
> >>>>> -
> >>>>>>>>
> >>>>>>>> Synchronously commits the specified offsets for the specified list
> >> of
> >>>>>>>> topics and partitions to *Kafka*. If no partitions are specified,
> >>>>>>>> commits offsets for the subscribed list of topics and partitions.
> >>>>>>>>
> >>>>>>>> The hope is to convey that if no partitions are specified, offsets
> >>>> will
> >>>>>>>> be committed for the subscribed list of partitions. One
> improvement
> >>>>> could
> >>>>>>>> be to
> >>>>>>>> explicitly state that the offsets returned on the last poll will
> be
> >>>>>>>> committed. I updated this to -
> >>>>>>>>
> >>>>>>>> Synchronously commits the specified offsets for the specified list
> >> of
> >>>>>>>> topics and partitions to *Kafka*. If no offsets are specified,
> >>>> commits
> >>>>>>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()}
> >> for
> >>>>>>>> the subscribed list of topics and partitions.
> >>>>>>>>
> >>>>>>>> 4. There is inconsistency in specifying partitions. Sometimes we
> use
> >>>>>>>> TopicPartition and some other times we use String and int (see
> >>>>>>>> examples below).
> >>>>>>>>
> >>>>>>>> void onPartitionsAssigned(Consumer consumer,
> >>>>>>>> TopicPartition...partitions)
> >>>>>>>>
> >>>>>>>> public void *subscribe*(java.lang.String topic, int... partitions)
> >>>>>>>>
> >>>>>>>> Yes, this was discussed previously. I think generally the
> consensus
> >>>>>>>> seems to be to use the higher level
> >>>&

Re: New Consumer API discussion

2014-02-27 Thread Robert Withers
ents to the docs<
>>>>> 
>>>> 
>> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 3.commit(): The following comment in the doc should probably say
>>>>> "commit
>>>>>>>> offsets for partitions assigned to this consumer".
>>>>>>>> 
>>>>>>>> If no partitions are specified, commits offsets for the subscribed
>>>>> list
>>>>>>>> of
>>>>>>>> topics and partitions to Kafka.
>>>>>>>> 
>>>>>>>> Could you give more context on this suggestion? Here is the entire
>>>> doc
>>>>> -
>>>>>>>> 
>>>>>>>> Synchronously commits the specified offsets for the specified list
>> of
>>>>>>>> topics and partitions to *Kafka*. If no partitions are specified,
>>>>>>>> commits offsets for the subscribed list of topics and partitions.
>>>>>>>> 
>>>>>>>> The hope is to convey that if no partitions are specified, offsets
>>>> will
>>>>>>>> be committed for the subscribed list of partitions. One improvement
>>>>> could
>>>>>>>> be to
>>>>>>>> explicitly state that the offsets returned on the last poll will be
>>>>>>>> committed. I updated this to -
>>>>>>>> 
>>>>>>>> Synchronously commits the specified offsets for the specified list
>> of
>>>>>>>> topics and partitions to *Kafka*. If no offsets are specified,
>>>> commits
>>>>>>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()}
>> for
>>>>>>>> the subscribed list of topics and partitions.
>>>>>>>> 
>>>>>>>> 4. There is inconsistency in specifying partitions. Sometimes we use
>>>>>>>> TopicPartition and some other times we use String and int (see
>>>>>>>> examples below).
>>>>>>>> 
>>>>>>>> void onPartitionsAssigned(Consumer consumer,
>>>>>>>> TopicPartition...partitions)
>>>>>>>> 
>>>>>>>> public void *subscribe*(java.lang.String topic, int... partitions)
>>>>>>>> 
>>>>>>>> Yes, this was discussed previously. I think generally the consensus
>>>>>>>> seems to be to use the higher level
>>>>>>>> classes everywhere. Made those changes.
>>>>>>>> 
>>>>>>>> What's the use case of position()? Isn't that just the nextOffset()
>>>> on
>>>>>>>> the
>>>>>>>> last message returned from poll()?
>>>>>>>> 
>>>>>>>> Yes, except in the case where a rebalance is triggered and poll() is
>>>>> not
>>>>>>>> yet invoked. Here, you would use position() to get the new fetch
>>>>> position
>>>>>>>> for the specific partition. Even if this is not a common use case,
>>>> IMO
>>>>> it
>>>>>>>> is much easier to use position() to get the fetch offset than
>>>> invoking
>>>>>>>> nextOffset() on the last message. This also keeps the APIs
>> symmetric,
>>>>> which
>>>>>>>> is nice.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
>>>>>>>> robert.with...@dish.com> wrote:
>>>>>>>> 
>>>>>>>>> That's wonderful.  Thanks for kafka.
>>>>>>>>> 
>>>>>>>>> Rob
>>>>>>>>> 
>>>>>>>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang >>>> >>>>>>>> wangg...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Robert,
>>>>>>>>> 
>>>>>>>>> Yes, you can check out the 

Re: New Consumer API discussion

2014-02-27 Thread Neha Narkhede
>>>>> topics and partitions to *Kafka*. If no partitions are specified,
> >>>>>> commits offsets for the subscribed list of topics and partitions.
> >>>>>>
> >>>>>> The hope is to convey that if no partitions are specified, offsets
> >> will
> >>>>>> be committed for the subscribed list of partitions. One improvement
> >>> could
> >>>>>> be to
> >>>>>> explicitly state that the offsets returned on the last poll will be
> >>>>>> committed. I updated this to -
> >>>>>>
> >>>>>> Synchronously commits the specified offsets for the specified list
> of
> >>>>>> topics and partitions to *Kafka*. If no offsets are specified,
> >> commits
> >>>>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()}
> for
> >>>>>> the subscribed list of topics and partitions.
> >>>>>>
> >>>>>> 4. There is inconsistency in specifying partitions. Sometimes we use
> >>>>>> TopicPartition and some other times we use String and int (see
> >>>>>> examples below).
> >>>>>>
> >>>>>> void onPartitionsAssigned(Consumer consumer,
> >>>>>> TopicPartition...partitions)
> >>>>>>
> >>>>>> public void *subscribe*(java.lang.String topic, int... partitions)
> >>>>>>
> >>>>>> Yes, this was discussed previously. I think generally the consensus
> >>>>>> seems to be to use the higher level
> >>>>>> classes everywhere. Made those changes.
> >>>>>>
> >>>>>> What's the use case of position()? Isn't that just the nextOffset()
> >> on
> >>>>>> the
> >>>>>> last message returned from poll()?
> >>>>>>
> >>>>>> Yes, except in the case where a rebalance is triggered and poll() is
> >>> not
> >>>>>> yet invoked. Here, you would use position() to get the new fetch
> >>> position
> >>>>>> for the specific partition. Even if this is not a common use case,
> >> IMO
> >>> it
> >>>>>> is much easier to use position() to get the fetch offset than
> >> invoking
> >>>>>> nextOffset() on the last message. This also keeps the APIs
> symmetric,
> >>> which
> >>>>>> is nice.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
> >>>>>> robert.with...@dish.com> wrote:
> >>>>>>
> >>>>>>> That's wonderful.  Thanks for kafka.
> >>>>>>>
> >>>>>>> Rob
> >>>>>>>
> >>>>>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang  >>>  >>>>>>> wangg...@gmail.com>> wrote:
> >>>>>>>
> >>>>>>> Hi Robert,
> >>>>>>>
> >>>>>>> Yes, you can check out the callback functions in the new API
> >>>>>>>
> >>>>>>> onPartitionDesigned
> >>>>>>> onPartitionAssigned
> >>>>>>>
> >>>>>>> and see if they meet your needs.
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
> >>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
> >>>>>>>
> >>>>>>> Jun,
> >>>>>>>
> >>>>>>> Are you saying it is possible to get events from the high-level
> >>> consumer
> >>>>>>> regarding various state machine changes?  For instance, can we get
> a
> >>>>>>> notification when a rebalance starts and ends, when a partition is
> >>>>>>> assigned/unassigned, when an offset is committed on a partition,
> >> when
> >>> a
> >>>>>>> leader changes and so on?  I call this OOB traffic, since they are
> >> not
> >>>>>>> t

Re: New Consumer API discussion

2014-02-26 Thread Robert Withers
umer.poll(timeout,
>>>>> TimeUnit.MILLISECONDS);
>>>>> process(messages, endOffset);  // processes messages
>> until
>>>>> endOffset
>>>>> }
>>>>> 
>>>>> Does that make sense?
>>>>> 
>>>>> 
>>>>> On Tue, Feb 25, 2014 at 9:49 AM, Neha Narkhede <
>> neha.narkh...@gmail.com
>>>> wrote:
>>>>> 
>>>>>> Thanks for the review, Jun. Here are some comments -
>>>>>> 
>>>>>> 
>>>>>> 1. The using of ellipsis: This may make passing a list of items from
>> a
>>>>>> collection to the api a bit harder. Suppose that you have a list of
>>>>>> topics
>>>>>> stored in
>>>>>> 
>>>>>> ArrayList topics;
>>>>>> 
>>>>>> If you want subscribe to all topics in one call, you will have to do:
>>>>>> 
>>>>>> String[] topicArray = new String[topics.size()];
>>>>>> consumer.subscribe(topics.
>>>>>> toArray(topicArray));
>>>>>> 
>>>>>> A similar argument can be made for arguably the more common use case
>> of
>>>>>> subscribing to a single topic as well. In these cases, user is
>> required
>>>>>> to write more
>>>>>> code to create a single item collection and pass it in. Since
>>>>>> subscription is extremely lightweight
>>>>>> invoking it multiple times also seems like a workable solution, no?
>>>>>> 
>>>>>> 2. It would be good to document that the following apis are mutually
>>>>>> exclusive. Also, if the partition level subscription is specified,
>>> there
>>>>>> is
>>>>>> no group management. Finally, unsubscribe() can only be used to
>> cancel
>>>>>> subscriptions with the same pattern. For example, you can't
>> unsubscribe
>>>>>> at
>>>>>> the partition level if the subscription is done at the topic level.
>>>>>> 
>>>>>> *subscribe*(java.lang.String... topics)
>>>>>> *subscribe*(java.lang.String topic, int... partitions)
>>>>>> 
>>>>>> Makes sense. Made the suggested improvements to the docs<
>>> 
>> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
>>>> 
>>>>>> 
>>>>>> 
>>>>>> 3.commit(): The following comment in the doc should probably say
>>> "commit
>>>>>> offsets for partitions assigned to this consumer".
>>>>>> 
>>>>>> If no partitions are specified, commits offsets for the subscribed
>>> list
>>>>>> of
>>>>>> topics and partitions to Kafka.
>>>>>> 
>>>>>> Could you give more context on this suggestion? Here is the entire
>> doc
>>> -
>>>>>> 
>>>>>> Synchronously commits the specified offsets for the specified list of
>>>>>> topics and partitions to *Kafka*. If no partitions are specified,
>>>>>> commits offsets for the subscribed list of topics and partitions.
>>>>>> 
>>>>>> The hope is to convey that if no partitions are specified, offsets
>> will
>>>>>> be committed for the subscribed list of partitions. One improvement
>>> could
>>>>>> be to
>>>>>> explicitly state that the offsets returned on the last poll will be
>>>>>> committed. I updated this to -
>>>>>> 
>>>>>> Synchronously commits the specified offsets for the specified list of
>>>>>> topics and partitions to *Kafka*. If no offsets are specified,
>> commits
>>>>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()} for
>>>>>> the subscribed list of topics and partitions.
>>>>>> 
>>>>>> 4. There is inconsistency in specifying partitions. Sometimes we use
>>>>>> TopicPartition and some other times we use String and int (see
>>>>>> examples below).
>>>>>> 
>>>>>> void onPartitionsAssigned(Consumer consumer,
>>>>>> TopicPartition...partitions)
>>>

Re: New Consumer API discussion

2014-02-25 Thread Neha Narkhede
>>>
> > >>>
> > >>> 1. The using of ellipsis: This may make passing a list of items from
> a
> > >>> collection to the api a bit harder. Suppose that you have a list of
> > >>> topics
> > >>> stored in
> > >>>
> > >>> ArrayList topics;
> > >>>
> > >>> If you want subscribe to all topics in one call, you will have to do:
> > >>>
> > >>> String[] topicArray = new String[topics.size()];
> > >>> consumer.subscribe(topics.
> > >>> toArray(topicArray));
> > >>>
> > >>> A similar argument can be made for arguably the more common use case
> of
> > >>> subscribing to a single topic as well. In these cases, user is
> required
> > >>> to write more
> > >>> code to create a single item collection and pass it in. Since
> > >>> subscription is extremely lightweight
> > >>> invoking it multiple times also seems like a workable solution, no?
> > >>>
> > >>> 2. It would be good to document that the following apis are mutually
> > >>> exclusive. Also, if the partition level subscription is specified,
> > there
> > >>> is
> > >>> no group management. Finally, unsubscribe() can only be used to
> cancel
> > >>> subscriptions with the same pattern. For example, you can't
> unsubscribe
> > >>> at
> > >>> the partition level if the subscription is done at the topic level.
> > >>>
> > >>> *subscribe*(java.lang.String... topics)
> > >>> *subscribe*(java.lang.String topic, int... partitions)
> > >>>
> > >>> Makes sense. Made the suggested improvements to the docs<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
> > >
> > >>>
> > >>>
> > >>> 3.commit(): The following comment in the doc should probably say
> > "commit
> > >>> offsets for partitions assigned to this consumer".
> > >>>
> > >>>  If no partitions are specified, commits offsets for the subscribed
> > list
> > >>> of
> > >>> topics and partitions to Kafka.
> > >>>
> > >>> Could you give more context on this suggestion? Here is the entire
> doc
> > -
> > >>>
> > >>> Synchronously commits the specified offsets for the specified list of
> > >>> topics and partitions to *Kafka*. If no partitions are specified,
> > >>> commits offsets for the subscribed list of topics and partitions.
> > >>>
> > >>> The hope is to convey that if no partitions are specified, offsets
> will
> > >>> be committed for the subscribed list of partitions. One improvement
> > could
> > >>> be to
> > >>> explicitly state that the offsets returned on the last poll will be
> > >>> committed. I updated this to -
> > >>>
> > >>> Synchronously commits the specified offsets for the specified list of
> > >>> topics and partitions to *Kafka*. If no offsets are specified,
> commits
> > >>> offsets returned on the last {@link #poll(long, TimeUnit) poll()} for
> > >>> the subscribed list of topics and partitions.
> > >>>
> > >>> 4. There is inconsistency in specifying partitions. Sometimes we use
> > >>> TopicPartition and some other times we use String and int (see
> > >>> examples below).
> > >>>
> > >>> void onPartitionsAssigned(Consumer consumer,
> > >>> TopicPartition...partitions)
> > >>>
> > >>> public void *subscribe*(java.lang.String topic, int... partitions)
> > >>>
> > >>> Yes, this was discussed previously. I think generally the consensus
> > >>> seems to be to use the higher level
> > >>> classes everywhere. Made those changes.
> > >>>
> > >>> What's the use case of position()? Isn't that just the nextOffset()
> on
> > >>> the
> > >>> last message returned from poll()?
> > >>>
> > >>> Yes, except in the case where a rebalance is triggered and poll() is
> > not
> > >>> yet invoked. Here, you would use position() to get the new fetch
> > 

Re: New Consumer API discussion

2014-02-25 Thread Jay Kreps
, you can't unsubscribe
> >>> at
> >>> the partition level if the subscription is done at the topic level.
> >>>
> >>> *subscribe*(java.lang.String... topics)
> >>> *subscribe*(java.lang.String topic, int... partitions)
> >>>
> >>> Makes sense. Made the suggested improvements to the docs<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
> >
> >>>
> >>>
> >>> 3.commit(): The following comment in the doc should probably say
> "commit
> >>> offsets for partitions assigned to this consumer".
> >>>
> >>>  If no partitions are specified, commits offsets for the subscribed
> list
> >>> of
> >>> topics and partitions to Kafka.
> >>>
> >>> Could you give more context on this suggestion? Here is the entire doc
> -
> >>>
> >>> Synchronously commits the specified offsets for the specified list of
> >>> topics and partitions to *Kafka*. If no partitions are specified,
> >>> commits offsets for the subscribed list of topics and partitions.
> >>>
> >>> The hope is to convey that if no partitions are specified, offsets will
> >>> be committed for the subscribed list of partitions. One improvement
> could
> >>> be to
> >>> explicitly state that the offsets returned on the last poll will be
> >>> committed. I updated this to -
> >>>
> >>> Synchronously commits the specified offsets for the specified list of
> >>> topics and partitions to *Kafka*. If no offsets are specified, commits
> >>> offsets returned on the last {@link #poll(long, TimeUnit) poll()} for
> >>> the subscribed list of topics and partitions.
> >>>
> >>> 4. There is inconsistency in specifying partitions. Sometimes we use
> >>> TopicPartition and some other times we use String and int (see
> >>> examples below).
> >>>
> >>> void onPartitionsAssigned(Consumer consumer,
> >>> TopicPartition...partitions)
> >>>
> >>> public void *subscribe*(java.lang.String topic, int... partitions)
> >>>
> >>> Yes, this was discussed previously. I think generally the consensus
> >>> seems to be to use the higher level
> >>> classes everywhere. Made those changes.
> >>>
> >>> What's the use case of position()? Isn't that just the nextOffset() on
> >>> the
> >>> last message returned from poll()?
> >>>
> >>> Yes, except in the case where a rebalance is triggered and poll() is
> not
> >>> yet invoked. Here, you would use position() to get the new fetch
> position
> >>> for the specific partition. Even if this is not a common use case, IMO
> it
> >>> is much easier to use position() to get the fetch offset than invoking
> >>> nextOffset() on the last message. This also keeps the APIs symmetric,
> which
> >>> is nice.
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
> >>> robert.with...@dish.com> wrote:
> >>>
> >>>> That's wonderful.  Thanks for kafka.
> >>>>
> >>>> Rob
> >>>>
> >>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang   >>>> wangg...@gmail.com>> wrote:
> >>>>
> >>>> Hi Robert,
> >>>>
> >>>> Yes, you can check out the callback functions in the new API
> >>>>
> >>>> onPartitionDesigned
> >>>> onPartitionAssigned
> >>>>
> >>>> and see if they meet your needs.
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
> >>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
> >>>>
> >>>> Jun,
> >>>>
> >>>> Are you saying it is possible to get events from the high-level
> consumer
> >>>> regarding various state machine changes?  For instance, can we get a
> >>>> notification when a rebalance starts and ends, when a partition is
> >>>> assigned/unassigned, when an offset is committed on a partition, when
> a
> >>>> leader changes and so on?  I cal

Re: New Consumer API discussion

2014-02-25 Thread Jay Kreps
asses everywhere. Made those changes.
> >
> > What's the use case of position()? Isn't that just the nextOffset() on
> the
> > last message returned from poll()?
> >
> > Yes, except in the case where a rebalance is triggered and poll() is not
> > yet invoked. Here, you would use position() to get the new fetch position
> > for the specific partition. Even if this is not a common use case, IMO it
> > is much easier to use position() to get the fetch offset than invoking
> > nextOffset() on the last message. This also keeps the APIs symmetric,
> which
> > is nice.
> >
> >
> >
> >
> > On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
> robert.with...@dish.com>wrote:
> >
> >> That's wonderful.  Thanks for kafka.
> >>
> >> Rob
> >>
> >> On Feb 24, 2014, at 9:58 AM, Guozhang Wang  >> wangg...@gmail.com>> wrote:
> >>
> >> Hi Robert,
> >>
> >> Yes, you can check out the callback functions in the new API
> >>
> >> onPartitionDesigned
> >> onPartitionAssigned
> >>
> >> and see if they meet your needs.
> >>
> >> Guozhang
> >>
> >>
> >> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
> robert.with...@dish.com
> >> <mailto:robert.with...@dish.com>>wrote:
> >>
> >> Jun,
> >>
> >> Are you saying it is possible to get events from the high-level consumer
> >> regarding various state machine changes?  For instance, can we get a
> >> notification when a rebalance starts and ends, when a partition is
> >> assigned/unassigned, when an offset is committed on a partition, when a
> >> leader changes and so on?  I call this OOB traffic, since they are not
> the
> >> core messages streaming, but side-band events, yet they are still
> >> potentially useful to consumers.
> >>
> >> Thank you,
> >> Robert
> >>
> >>
> >> Robert Withers
> >> Staff Analyst/Developer
> >> o: (720) 514-8963
> >> c:  (571) 262-1873
> >>
> >>
> >>
> >> -Original Message-
> >> From: Jun Rao [mailto:jun...@gmail.com]
> >> Sent: Sunday, February 23, 2014 4:19 PM
> >> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
> >> Subject: Re: New Consumer API discussion
> >>
> >> Robert,
> >>
> >> For the push orient api, you can potentially implement your own
> >> MessageHandler with those methods. In the main loop of our new consumer
> >> api, you can just call those methods based on the events you get.
> >>
> >> Also, we already have an api to get the first and the last offset of a
> >> partition (getOffsetBefore).
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >>
> >> On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
> >> mailto:robert.with...@dish.com>>wrote:
> >>
> >> This is a good idea, too.  I would modify it to include stream
> >> marking, then you can have:
> >>
> >> long end = consumer.lastOffset(tp);
> >> consumer.setMark(end);
> >> while(consumer.beforeMark()) {
> >>   process(consumer.pollToMark());
> >> }
> >>
> >> or
> >>
> >> long end = consumer.lastOffset(tp);
> >> consumer.setMark(end);
> >> for(Object msg : consumer.iteratorToMark()) {
> >>   process(msg);
> >> }
> >>
> >> I actually have 4 suggestions, then:
> >>
> >> *   pull: stream marking
> >> *   pull: finite streams, bound by time range (up-to-now, yesterday) or
> >> offset
> >> *   pull: async api
> >> *   push: KafkaMessageSource, for a push model, with msg and OOB events.
> >> Build one in either individual or chunk mode and have a listener for
> >> each msg or a listener for a chunk of msgs.  Make it composable and
> >> policy driven (chunked, range, commitOffsets policy, retry policy,
> >> transactional)
> >>
> >> Thank you,
> >> Robert
> >>
> >> On Feb 22, 2014, at 11:21 AM, Jay Kreps  >> jay.kr...@gmail.com> >> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
> >>
> >> I think what Robert is saying is that we need to think through the
> >> offset API to enable "batch processing" of topic data. Think of a
> >> process that periodically kicks off to compute a data summary

Re: New Consumer API discussion

2014-02-25 Thread Jun Rao
tion is
> > assigned/unassigned, when an offset is committed on a partition, when a
> > leader changes and so on?  I call this OOB traffic, since they are not
> the
> > core messages streaming, but side-band events, yet they are still
> > potentially useful to consumers.
> >
> > Thank you,
> > Robert
> >
> >
> > Robert Withers
> > Staff Analyst/Developer
> > o: (720) 514-8963
> > c:  (571) 262-1873
> >
> >
> >
> > -Original Message-
> > From: Jun Rao [mailto:jun...@gmail.com]
> > Sent: Sunday, February 23, 2014 4:19 PM
> > To: users@kafka.apache.org<mailto:users@kafka.apache.org>
> > Subject: Re: New Consumer API discussion
> >
> > Robert,
> >
> > For the push orient api, you can potentially implement your own
> > MessageHandler with those methods. In the main loop of our new consumer
> > api, you can just call those methods based on the events you get.
> >
> > Also, we already have an api to get the first and the last offset of a
> > partition (getOffsetBefore).
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
> > mailto:robert.with...@dish.com>>wrote:
> >
> > This is a good idea, too.  I would modify it to include stream
> > marking, then you can have:
> >
> > long end = consumer.lastOffset(tp);
> > consumer.setMark(end);
> > while(consumer.beforeMark()) {
> >   process(consumer.pollToMark());
> > }
> >
> > or
> >
> > long end = consumer.lastOffset(tp);
> > consumer.setMark(end);
> > for(Object msg : consumer.iteratorToMark()) {
> >   process(msg);
> > }
> >
> > I actually have 4 suggestions, then:
> >
> > *   pull: stream marking
> > *   pull: finite streams, bound by time range (up-to-now, yesterday) or
> > offset
> > *   pull: async api
> > *   push: KafkaMessageSource, for a push model, with msg and OOB events.
> > Build one in either individual or chunk mode and have a listener for
> > each msg or a listener for a chunk of msgs.  Make it composable and
> > policy driven (chunked, range, commitOffsets policy, retry policy,
> > transactional)
> >
> > Thank you,
> > Robert
> >
> > On Feb 22, 2014, at 11:21 AM, Jay Kreps  > jay.kr...@gmail.com> > jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
> >
> > I think what Robert is saying is that we need to think through the
> > offset API to enable "batch processing" of topic data. Think of a
> > process that periodically kicks off to compute a data summary or do a
> > data load or something like that. I think what we need to support this
> > is an api to fetch the last offset from the server for a partition.
> > Something like
> >  long lastOffset(TopicPartition tp)
> > and for symmetry
> >  long firstOffset(TopicPartition tp)
> >
> > Likely this would have to be batched. Essentially we should add this
> > use case to our set of code examples to write and think through.
> >
> > The usage would be something like
> >
> > long end = consumer.lastOffset(tp);
> > while(consumer.position < end)
> >   process(consumer.poll());
> >
> > -Jay
> >
> >
> > On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
> > mailto:robert.with...@dish.com>
> > <mailto:robert.with...@dish.com>>wrote:
> >
> > Jun,
> >
> > I was originally thinking a non-blocking read from a distributed
> > stream should distinguish between "no local messages, but a fetch is
> > occurring"
> > versus "you have drained the stream".  The reason this may be valuable
> > to me is so I can write consumers that read all known traffic then
> > terminate.
> > You caused me to reconsider and I think I am conflating 2 things.  One
> > is a sync/async api while the other is whether to have an infinite or
> > finite stream.  Is it possible to build a finite KafkaStream on a
> > range of messages?
> >
> > Perhaps a Simple Consumer would do just fine and then I could start
> > off getting the writeOffset from zookeeper and tell it to read a
> > specified range per partition.  I've done this and forked a simple
> > consumer runnable for each partition, for one of our analyzers.  The
> > great thing about the high-level consumer is that rebalance, so I can
> > fork however many stream readers I want and you just figure it out for
> > me.  In that way you offer us the control over the resour

Re: New Consumer API discussion

2014-02-25 Thread Neha Narkhede
er".
>>>
>>>  If no partitions are specified, commits offsets for the subscribed list
>>> of
>>> topics and partitions to Kafka.
>>>
>>> Could you give more context on this suggestion? Here is the entire doc -
>>>
>>> Synchronously commits the specified offsets for the specified list of
>>> topics and partitions to *Kafka*. If no partitions are specified,
>>> commits offsets for the subscribed list of topics and partitions.
>>>
>>> The hope is to convey that if no partitions are specified, offsets will
>>> be committed for the subscribed list of partitions. One improvement could
>>> be to
>>> explicitly state that the offsets returned on the last poll will be
>>> committed. I updated this to -
>>>
>>> Synchronously commits the specified offsets for the specified list of
>>> topics and partitions to *Kafka*. If no offsets are specified, commits
>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()} for
>>> the subscribed list of topics and partitions.
>>>
>>> 4. There is inconsistency in specifying partitions. Sometimes we use
>>> TopicPartition and some other times we use String and int (see
>>> examples below).
>>>
>>> void onPartitionsAssigned(Consumer consumer,
>>> TopicPartition...partitions)
>>>
>>> public void *subscribe*(java.lang.String topic, int... partitions)
>>>
>>> Yes, this was discussed previously. I think generally the consensus
>>> seems to be to use the higher level
>>> classes everywhere. Made those changes.
>>>
>>> What's the use case of position()? Isn't that just the nextOffset() on
>>> the
>>> last message returned from poll()?
>>>
>>> Yes, except in the case where a rebalance is triggered and poll() is not
>>> yet invoked. Here, you would use position() to get the new fetch position
>>> for the specific partition. Even if this is not a common use case, IMO it
>>> is much easier to use position() to get the fetch offset than invoking
>>> nextOffset() on the last message. This also keeps the APIs symmetric, which
>>> is nice.
>>>
>>>
>>>
>>>
>>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
>>> robert.with...@dish.com> wrote:
>>>
>>>> That's wonderful.  Thanks for kafka.
>>>>
>>>> Rob
>>>>
>>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang >>> wangg...@gmail.com>> wrote:
>>>>
>>>> Hi Robert,
>>>>
>>>> Yes, you can check out the callback functions in the new API
>>>>
>>>> onPartitionDesigned
>>>> onPartitionAssigned
>>>>
>>>> and see if they meet your needs.
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
>>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
>>>>
>>>> Jun,
>>>>
>>>> Are you saying it is possible to get events from the high-level consumer
>>>> regarding various state machine changes?  For instance, can we get a
>>>> notification when a rebalance starts and ends, when a partition is
>>>> assigned/unassigned, when an offset is committed on a partition, when a
>>>> leader changes and so on?  I call this OOB traffic, since they are not
>>>> the
>>>> core messages streaming, but side-band events, yet they are still
>>>> potentially useful to consumers.
>>>>
>>>> Thank you,
>>>> Robert
>>>>
>>>>
>>>> Robert Withers
>>>> Staff Analyst/Developer
>>>> o: (720) 514-8963
>>>> c:  (571) 262-1873
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Jun Rao [mailto:jun...@gmail.com]
>>>> Sent: Sunday, February 23, 2014 4:19 PM
>>>> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
>>>> Subject: Re: New Consumer API discussion
>>>>
>>>> Robert,
>>>>
>>>> For the push orient api, you can potentially implement your own
>>>> MessageHandler with those methods. In the main loop of our new consumer
>>>> api, you can just call those methods based on the events you get.
>>>>
>>>> Also, we already have an api to get the first and the la

Re: New Consumer API discussion

2014-02-25 Thread Neha Narkhede
imeUnit) poll()} for
>> the subscribed list of topics and partitions.
>>
>> 4. There is inconsistency in specifying partitions. Sometimes we use
>> TopicPartition and some other times we use String and int (see
>> examples below).
>>
>> void onPartitionsAssigned(Consumer consumer, TopicPartition...partitions)
>>
>> public void *subscribe*(java.lang.String topic, int... partitions)
>>
>> Yes, this was discussed previously. I think generally the consensus seems
>> to be to use the higher level
>> classes everywhere. Made those changes.
>>
>> What's the use case of position()? Isn't that just the nextOffset() on the
>> last message returned from poll()?
>>
>> Yes, except in the case where a rebalance is triggered and poll() is not
>> yet invoked. Here, you would use position() to get the new fetch position
>> for the specific partition. Even if this is not a common use case, IMO it
>> is much easier to use position() to get the fetch offset than invoking
>> nextOffset() on the last message. This also keeps the APIs symmetric, which
>> is nice.
>>
>>
>>
>>
>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert > > wrote:
>>
>>> That's wonderful.  Thanks for kafka.
>>>
>>> Rob
>>>
>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang >> wangg...@gmail.com>> wrote:
>>>
>>> Hi Robert,
>>>
>>> Yes, you can check out the callback functions in the new API
>>>
>>> onPartitionDesigned
>>> onPartitionAssigned
>>>
>>> and see if they meet your needs.
>>>
>>> Guozhang
>>>
>>>
>>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
>>>
>>> Jun,
>>>
>>> Are you saying it is possible to get events from the high-level consumer
>>> regarding various state machine changes?  For instance, can we get a
>>> notification when a rebalance starts and ends, when a partition is
>>> assigned/unassigned, when an offset is committed on a partition, when a
>>> leader changes and so on?  I call this OOB traffic, since they are not
>>> the
>>> core messages streaming, but side-band events, yet they are still
>>> potentially useful to consumers.
>>>
>>> Thank you,
>>> Robert
>>>
>>>
>>> Robert Withers
>>> Staff Analyst/Developer
>>> o: (720) 514-8963
>>> c:  (571) 262-1873
>>>
>>>
>>>
>>> -Original Message-
>>> From: Jun Rao [mailto:jun...@gmail.com]
>>> Sent: Sunday, February 23, 2014 4:19 PM
>>> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
>>> Subject: Re: New Consumer API discussion
>>>
>>> Robert,
>>>
>>> For the push orient api, you can potentially implement your own
>>> MessageHandler with those methods. In the main loop of our new consumer
>>> api, you can just call those methods based on the events you get.
>>>
>>> Also, we already have an api to get the first and the last offset of a
>>> partition (getOffsetBefore).
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>>
>>> On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
>>> mailto:robert.with...@dish.com>>wrote:
>>>
>>> This is a good idea, too.  I would modify it to include stream
>>> marking, then you can have:
>>>
>>> long end = consumer.lastOffset(tp);
>>> consumer.setMark(end);
>>> while(consumer.beforeMark()) {
>>>   process(consumer.pollToMark());
>>> }
>>>
>>> or
>>>
>>> long end = consumer.lastOffset(tp);
>>> consumer.setMark(end);
>>> for(Object msg : consumer.iteratorToMark()) {
>>>   process(msg);
>>> }
>>>
>>> I actually have 4 suggestions, then:
>>>
>>> *   pull: stream marking
>>> *   pull: finite streams, bound by time range (up-to-now, yesterday) or
>>> offset
>>> *   pull: async api
>>> *   push: KafkaMessageSource, for a push model, with msg and OOB events.
>>> Build one in either individual or chunk mode and have a listener for
>>> each msg or a listener for a chunk of msgs.  Make it composable and
>>> policy driven (chunked, range, commitOffsets policy, retry policy,
>>> transactional)
>>>
>>> Thank you,
>>> Robert
>&g

Re: New Consumer API discussion

2014-02-25 Thread Neha Narkhede
at 9:58 AM, Guozhang Wang > wangg...@gmail.com>> wrote:
>>
>> Hi Robert,
>>
>> Yes, you can check out the callback functions in the new API
>>
>> onPartitionDesigned
>> onPartitionAssigned
>>
>> and see if they meet your needs.
>>
>> Guozhang
>>
>>
>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert > <mailto:robert.with...@dish.com>>wrote:
>>
>> Jun,
>>
>> Are you saying it is possible to get events from the high-level consumer
>> regarding various state machine changes?  For instance, can we get a
>> notification when a rebalance starts and ends, when a partition is
>> assigned/unassigned, when an offset is committed on a partition, when a
>> leader changes and so on?  I call this OOB traffic, since they are not the
>> core messages streaming, but side-band events, yet they are still
>> potentially useful to consumers.
>>
>> Thank you,
>> Robert
>>
>>
>> Robert Withers
>> Staff Analyst/Developer
>> o: (720) 514-8963
>> c:  (571) 262-1873
>>
>>
>>
>> -Original Message-
>> From: Jun Rao [mailto:jun...@gmail.com]
>> Sent: Sunday, February 23, 2014 4:19 PM
>> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
>> Subject: Re: New Consumer API discussion
>>
>> Robert,
>>
>> For the push orient api, you can potentially implement your own
>> MessageHandler with those methods. In the main loop of our new consumer
>> api, you can just call those methods based on the events you get.
>>
>> Also, we already have an api to get the first and the last offset of a
>> partition (getOffsetBefore).
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
>> mailto:robert.with...@dish.com>>wrote:
>>
>> This is a good idea, too.  I would modify it to include stream
>> marking, then you can have:
>>
>> long end = consumer.lastOffset(tp);
>> consumer.setMark(end);
>> while(consumer.beforeMark()) {
>>   process(consumer.pollToMark());
>> }
>>
>> or
>>
>> long end = consumer.lastOffset(tp);
>> consumer.setMark(end);
>> for(Object msg : consumer.iteratorToMark()) {
>>   process(msg);
>> }
>>
>> I actually have 4 suggestions, then:
>>
>> *   pull: stream marking
>> *   pull: finite streams, bound by time range (up-to-now, yesterday) or
>> offset
>> *   pull: async api
>> *   push: KafkaMessageSource, for a push model, with msg and OOB events.
>> Build one in either individual or chunk mode and have a listener for
>> each msg or a listener for a chunk of msgs.  Make it composable and
>> policy driven (chunked, range, commitOffsets policy, retry policy,
>> transactional)
>>
>> Thank you,
>> Robert
>>
>> On Feb 22, 2014, at 11:21 AM, Jay Kreps > jay.kr...@gmail.com>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
>>
>> I think what Robert is saying is that we need to think through the
>> offset API to enable "batch processing" of topic data. Think of a
>> process that periodically kicks off to compute a data summary or do a
>> data load or something like that. I think what we need to support this
>> is an api to fetch the last offset from the server for a partition.
>> Something like
>>  long lastOffset(TopicPartition tp)
>> and for symmetry
>>  long firstOffset(TopicPartition tp)
>>
>> Likely this would have to be batched. Essentially we should add this
>> use case to our set of code examples to write and think through.
>>
>> The usage would be something like
>>
>> long end = consumer.lastOffset(tp);
>> while(consumer.position < end)
>>   process(consumer.poll());
>>
>> -Jay
>>
>>
>> On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
>> mailto:robert.with...@dish.com>
>> <mailto:robert.with...@dish.com>>wrote:
>>
>> Jun,
>>
>> I was originally thinking a non-blocking read from a distributed
>> stream should distinguish between "no local messages, but a fetch is
>> occurring"
>> versus "you have drained the stream".  The reason this may be valuable
>> to me is so I can write consumers that read all known traffic then
>> terminate.
>> You caused me to reconsider and I think I am conflating 2 things.  One
>> is a sync/async api while the other is whether to have an infinite or
>> finite stream.  Is it possible to build a finite 

Re: New Consumer API discussion

2014-02-25 Thread Neha Narkhede
Thanks for the review, Jun. Here are some comments -

1. The using of ellipsis: This may make passing a list of items from a
collection to the api a bit harder. Suppose that you have a list of topics
stored in

ArrayList topics;

If you want subscribe to all topics in one call, you will have to do:

String[] topicArray = new String[topics.size()];
consumer.subscribe(topics.
toArray(topicArray));

A similar argument can be made for arguably the more common use case of
subscribing to a single topic as well. In these cases, user is required to
write more
code to create a single item collection and pass it in. Since subscription
is extremely lightweight
invoking it multiple times also seems like a workable solution, no?

2. It would be good to document that the following apis are mutually
exclusive. Also, if the partition level subscription is specified, there is
no group management. Finally, unsubscribe() can only be used to cancel
subscriptions with the same pattern. For example, you can't unsubscribe at
the partition level if the subscription is done at the topic level.

*subscribe*(java.lang.String... topics)
*subscribe*(java.lang.String topic, int... partitions)

Makes sense. Made the suggested improvements to the
docs<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29>

3.commit(): The following comment in the doc should probably say "commit
offsets for partitions assigned to this consumer".

 If no partitions are specified, commits offsets for the subscribed list of
topics and partitions to Kafka.

Could you give more context on this suggestion? Here is the entire doc -

Synchronously commits the specified offsets for the specified list of
topics and partitions to *Kafka*. If no partitions are specified, commits
offsets for the subscribed list of topics and partitions.

The hope is to convey that if no partitions are specified, offsets will be
committed for the subscribed list of partitions. One improvement could be to
explicitly state that the offsets returned on the last poll will be
committed. I updated this to -

Synchronously commits the specified offsets for the specified list of
topics and partitions to *Kafka*. If no offsets are specified, commits
offsets returned on the last {@link #poll(long, TimeUnit) poll()} for the
subscribed list of topics and partitions.

4. There is inconsistency in specifying partitions. Sometimes we use
TopicPartition and some other times we use String and int (see
examples below).

void onPartitionsAssigned(Consumer consumer, TopicPartition...partitions)

public void *subscribe*(java.lang.String topic, int... partitions)

Yes, this was discussed previously. I think generally the consensus seems
to be to use the higher level
classes everywhere. Made those changes.

What's the use case of position()? Isn't that just the nextOffset() on the
last message returned from poll()?

Yes, except in the case where a rebalance is triggered and poll() is not
yet invoked. Here, you would use position() to get the new fetch position
for the specific partition. Even if this is not a common use case, IMO it
is much easier to use position() to get the fetch offset than invoking
nextOffset() on the last message. This also keeps the APIs symmetric, which
is nice.




On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert wrote:

> That's wonderful.  Thanks for kafka.
>
> Rob
>
> On Feb 24, 2014, at 9:58 AM, Guozhang Wang  wangg...@gmail.com>> wrote:
>
> Hi Robert,
>
> Yes, you can check out the callback functions in the new API
>
> onPartitionDesigned
> onPartitionAssigned
>
> and see if they meet your needs.
>
> Guozhang
>
>
> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert  <mailto:robert.with...@dish.com>>wrote:
>
> Jun,
>
> Are you saying it is possible to get events from the high-level consumer
> regarding various state machine changes?  For instance, can we get a
> notification when a rebalance starts and ends, when a partition is
> assigned/unassigned, when an offset is committed on a partition, when a
> leader changes and so on?  I call this OOB traffic, since they are not the
> core messages streaming, but side-band events, yet they are still
> potentially useful to consumers.
>
> Thank you,
> Robert
>
>
> Robert Withers
> Staff Analyst/Developer
> o: (720) 514-8963
> c:  (571) 262-1873
>
>
>
> -Original Message-----
> From: Jun Rao [mailto:jun...@gmail.com]
> Sent: Sunday, February 23, 2014 4:19 PM
> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
> Subject: Re: New Consumer API discussion
>
> Robert,
>
> For the push orient api, you can potentially implement your own
> MessageHandler with those methods. In the main loop of our new consumer
> api, you can just call tho

Re: New Consumer API discussion

2014-02-24 Thread Withers, Robert
That’s wonderful.  Thanks for kafka.

Rob

On Feb 24, 2014, at 9:58 AM, Guozhang Wang 
mailto:wangg...@gmail.com>> wrote:

Hi Robert,

Yes, you can check out the callback functions in the new API

onPartitionDesigned
onPartitionAssigned

and see if they meet your needs.

Guozhang


On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert 
mailto:robert.with...@dish.com>>wrote:

Jun,

Are you saying it is possible to get events from the high-level consumer
regarding various state machine changes?  For instance, can we get a
notification when a rebalance starts and ends, when a partition is
assigned/unassigned, when an offset is committed on a partition, when a
leader changes and so on?  I call this OOB traffic, since they are not the
core messages streaming, but side-band events, yet they are still
potentially useful to consumers.

Thank you,
Robert


Robert Withers
Staff Analyst/Developer
o: (720) 514-8963
c:  (571) 262-1873



-Original Message-
From: Jun Rao [mailto:jun...@gmail.com]
Sent: Sunday, February 23, 2014 4:19 PM
To: users@kafka.apache.org<mailto:users@kafka.apache.org>
Subject: Re: New Consumer API discussion

Robert,

For the push orient api, you can potentially implement your own
MessageHandler with those methods. In the main loop of our new consumer
api, you can just call those methods based on the events you get.

Also, we already have an api to get the first and the last offset of a
partition (getOffsetBefore).

Thanks,

Jun


On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
mailto:robert.with...@dish.com>>wrote:

This is a good idea, too.  I would modify it to include stream
marking, then you can have:

long end = consumer.lastOffset(tp);
consumer.setMark(end);
while(consumer.beforeMark()) {
  process(consumer.pollToMark());
}

or

long end = consumer.lastOffset(tp);
consumer.setMark(end);
for(Object msg : consumer.iteratorToMark()) {
  process(msg);
}

I actually have 4 suggestions, then:

*   pull: stream marking
*   pull: finite streams, bound by time range (up-to-now, yesterday) or
offset
*   pull: async api
*   push: KafkaMessageSource, for a push model, with msg and OOB events.
Build one in either individual or chunk mode and have a listener for
each msg or a listener for a chunk of msgs.  Make it composable and
policy driven (chunked, range, commitOffsets policy, retry policy,
transactional)

Thank you,
Robert

On Feb 22, 2014, at 11:21 AM, Jay Kreps 
mailto:jay.kr...@gmail.com>mailto:jay.kr...@gmail.com>>> wrote:

I think what Robert is saying is that we need to think through the
offset API to enable "batch processing" of topic data. Think of a
process that periodically kicks off to compute a data summary or do a
data load or something like that. I think what we need to support this
is an api to fetch the last offset from the server for a partition.
Something like
 long lastOffset(TopicPartition tp)
and for symmetry
 long firstOffset(TopicPartition tp)

Likely this would have to be batched. Essentially we should add this
use case to our set of code examples to write and think through.

The usage would be something like

long end = consumer.lastOffset(tp);
while(consumer.position < end)
  process(consumer.poll());

-Jay


On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
mailto:robert.with...@dish.com>
<mailto:robert.with...@dish.com>>wrote:

Jun,

I was originally thinking a non-blocking read from a distributed
stream should distinguish between "no local messages, but a fetch is
occurring"
versus "you have drained the stream".  The reason this may be valuable
to me is so I can write consumers that read all known traffic then
terminate.
You caused me to reconsider and I think I am conflating 2 things.  One
is a sync/async api while the other is whether to have an infinite or
finite stream.  Is it possible to build a finite KafkaStream on a
range of messages?

Perhaps a Simple Consumer would do just fine and then I could start
off getting the writeOffset from zookeeper and tell it to read a
specified range per partition.  I've done this and forked a simple
consumer runnable for each partition, for one of our analyzers.  The
great thing about the high-level consumer is that rebalance, so I can
fork however many stream readers I want and you just figure it out for
me.  In that way you offer us the control over the resource
consumption within a pull model.  This is best to regulate message
pressure, they say.

Combining that high-level rebalance ability with a ranged partition
drain could be really nice...build the stream with an ending position
and it is a finite stream, but retain the high-level rebalance.  With
a finite stream, you would know the difference of the 2 async
scenarios: fetch-in-progress versus end-of-stream.  With an infinite
stream, you never get end-of-stream.

Aside from a high-level consumer over a finite range within each
partition, the other feature I can think of is mor

Re: New Consumer API discussion

2014-02-24 Thread Guozhang Wang
Hi Robert,

Yes, you can check out the callback functions in the new API

onPartitionDesigned
onPartitionAssigned

and see if they meet your needs.

Guozhang


On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert wrote:

> Jun,
>
> Are you saying it is possible to get events from the high-level consumer
> regarding various state machine changes?  For instance, can we get a
> notification when a rebalance starts and ends, when a partition is
> assigned/unassigned, when an offset is committed on a partition, when a
> leader changes and so on?  I call this OOB traffic, since they are not the
> core messages streaming, but side-band events, yet they are still
> potentially useful to consumers.
>
> Thank you,
> Robert
>
>
> Robert Withers
> Staff Analyst/Developer
> o: (720) 514-8963
> c:  (571) 262-1873
>
>
>
> -Original Message-
> From: Jun Rao [mailto:jun...@gmail.com]
> Sent: Sunday, February 23, 2014 4:19 PM
> To: users@kafka.apache.org
> Subject: Re: New Consumer API discussion
>
> Robert,
>
> For the push orient api, you can potentially implement your own
> MessageHandler with those methods. In the main loop of our new consumer
> api, you can just call those methods based on the events you get.
>
> Also, we already have an api to get the first and the last offset of a
> partition (getOffsetBefore).
>
> Thanks,
>
> Jun
>
>
> On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
> wrote:
>
> > This is a good idea, too.  I would modify it to include stream
> > marking, then you can have:
> >
> > long end = consumer.lastOffset(tp);
> > consumer.setMark(end);
> > while(consumer.beforeMark()) {
> >process(consumer.pollToMark());
> > }
> >
> > or
> >
> > long end = consumer.lastOffset(tp);
> > consumer.setMark(end);
> > for(Object msg : consumer.iteratorToMark()) {
> >process(msg);
> > }
> >
> > I actually have 4 suggestions, then:
> >
> >  *   pull: stream marking
> >  *   pull: finite streams, bound by time range (up-to-now, yesterday) or
> > offset
> >  *   pull: async api
> >  *   push: KafkaMessageSource, for a push model, with msg and OOB events.
> >  Build one in either individual or chunk mode and have a listener for
> > each msg or a listener for a chunk of msgs.  Make it composable and
> > policy driven (chunked, range, commitOffsets policy, retry policy,
> > transactional)
> >
> > Thank you,
> > Robert
> >
> > On Feb 22, 2014, at 11:21 AM, Jay Kreps  > jay.kr...@gmail.com>> wrote:
> >
> > I think what Robert is saying is that we need to think through the
> > offset API to enable "batch processing" of topic data. Think of a
> > process that periodically kicks off to compute a data summary or do a
> > data load or something like that. I think what we need to support this
> > is an api to fetch the last offset from the server for a partition.
> Something like
> >   long lastOffset(TopicPartition tp)
> > and for symmetry
> >   long firstOffset(TopicPartition tp)
> >
> > Likely this would have to be batched. Essentially we should add this
> > use case to our set of code examples to write and think through.
> >
> > The usage would be something like
> >
> > long end = consumer.lastOffset(tp);
> > while(consumer.position < end)
> >process(consumer.poll());
> >
> > -Jay
> >
> >
> > On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
> >  > <mailto:robert.with...@dish.com>>wrote:
> >
> > Jun,
> >
> > I was originally thinking a non-blocking read from a distributed
> > stream should distinguish between "no local messages, but a fetch is
> occurring"
> > versus "you have drained the stream".  The reason this may be valuable
> > to me is so I can write consumers that read all known traffic then
> terminate.
> > You caused me to reconsider and I think I am conflating 2 things.  One
> > is a sync/async api while the other is whether to have an infinite or
> > finite stream.  Is it possible to build a finite KafkaStream on a
> > range of messages?
> >
> > Perhaps a Simple Consumer would do just fine and then I could start
> > off getting the writeOffset from zookeeper and tell it to read a
> > specified range per partition.  I've done this and forked a simple
> > consumer runnable for each partition, for one of our analyzers.  The
> > great thing about the high-level consumer is that rebalance, so I can
> > fork however many stream readers I want and you just f

RE: New Consumer API discussion

2014-02-24 Thread Withers, Robert
Jun,

Are you saying it is possible to get events from the high-level consumer 
regarding various state machine changes?  For instance, can we get a 
notification when a rebalance starts and ends, when a partition is 
assigned/unassigned, when an offset is committed on a partition, when a leader 
changes and so on?  I call this OOB traffic, since they are not the core 
messages streaming, but side-band events, yet they are still potentially useful 
to consumers.

Thank you,
Robert


Robert Withers
Staff Analyst/Developer
o: (720) 514-8963
c:  (571) 262-1873



-Original Message-
From: Jun Rao [mailto:jun...@gmail.com]
Sent: Sunday, February 23, 2014 4:19 PM
To: users@kafka.apache.org
Subject: Re: New Consumer API discussion

Robert,

For the push orient api, you can potentially implement your own MessageHandler 
with those methods. In the main loop of our new consumer api, you can just call 
those methods based on the events you get.

Also, we already have an api to get the first and the last offset of a 
partition (getOffsetBefore).

Thanks,

Jun


On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
wrote:

> This is a good idea, too.  I would modify it to include stream
> marking, then you can have:
>
> long end = consumer.lastOffset(tp);
> consumer.setMark(end);
> while(consumer.beforeMark()) {
>process(consumer.pollToMark());
> }
>
> or
>
> long end = consumer.lastOffset(tp);
> consumer.setMark(end);
> for(Object msg : consumer.iteratorToMark()) {
>process(msg);
> }
>
> I actually have 4 suggestions, then:
>
>  *   pull: stream marking
>  *   pull: finite streams, bound by time range (up-to-now, yesterday) or
> offset
>  *   pull: async api
>  *   push: KafkaMessageSource, for a push model, with msg and OOB events.
>  Build one in either individual or chunk mode and have a listener for
> each msg or a listener for a chunk of msgs.  Make it composable and
> policy driven (chunked, range, commitOffsets policy, retry policy,
> transactional)
>
> Thank you,
> Robert
>
> On Feb 22, 2014, at 11:21 AM, Jay Kreps  jay.kr...@gmail.com>> wrote:
>
> I think what Robert is saying is that we need to think through the
> offset API to enable "batch processing" of topic data. Think of a
> process that periodically kicks off to compute a data summary or do a
> data load or something like that. I think what we need to support this
> is an api to fetch the last offset from the server for a partition. Something 
> like
>   long lastOffset(TopicPartition tp)
> and for symmetry
>   long firstOffset(TopicPartition tp)
>
> Likely this would have to be batched. Essentially we should add this
> use case to our set of code examples to write and think through.
>
> The usage would be something like
>
> long end = consumer.lastOffset(tp);
> while(consumer.position < end)
>process(consumer.poll());
>
> -Jay
>
>
> On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
>  <mailto:robert.with...@dish.com>>wrote:
>
> Jun,
>
> I was originally thinking a non-blocking read from a distributed
> stream should distinguish between "no local messages, but a fetch is 
> occurring"
> versus "you have drained the stream".  The reason this may be valuable
> to me is so I can write consumers that read all known traffic then terminate.
> You caused me to reconsider and I think I am conflating 2 things.  One
> is a sync/async api while the other is whether to have an infinite or
> finite stream.  Is it possible to build a finite KafkaStream on a
> range of messages?
>
> Perhaps a Simple Consumer would do just fine and then I could start
> off getting the writeOffset from zookeeper and tell it to read a
> specified range per partition.  I've done this and forked a simple
> consumer runnable for each partition, for one of our analyzers.  The
> great thing about the high-level consumer is that rebalance, so I can
> fork however many stream readers I want and you just figure it out for
> me.  In that way you offer us the control over the resource
> consumption within a pull model.  This is best to regulate message pressure, 
> they say.
>
> Combining that high-level rebalance ability with a ranged partition
> drain could be really nice...build the stream with an ending position
> and it is a finite stream, but retain the high-level rebalance.  With
> a finite stream, you would know the difference of the 2 async
> scenarios: fetch-in-progress versus end-of-stream.  With an infinite
> stream, you never get end-of-stream.
>
> Aside from a high-level consumer over a finite range within each
> partition, the other feature I can think of is more complicated.  A
> high-level consumer has state machine changes that the client can

Re: New Consumer API discussion

2014-02-23 Thread Jun Rao
at is just our use, but instead of a pull-oriented KafkaStream,
> is there any sense in your providing a push-oriented KafkaMessageSource
> publishing OOB messages?
>
> thank you,
> Robert
>
> On Feb 21, 2014, at 5:59 PM, Jun Rao  jun...@gmail.com> jun...@gmail.com<mailto:jun...@gmail.com>>> wrote:
>
> Robert,
>
> Could you explain why you want to distinguish btw
> FetchingInProgressException
> and NoMessagePendingException? The nextMsgs() method that you want is
> exactly what poll() does.
>
> Thanks,
>
> Jun
>
>
> On Wed, Feb 19, 2014 at 8:45 AM, Withers, Robert  <mailto:robert.with...@dish.com>
> <mailto:robert.with...@dish.com>>wrote:
>
> I am not clear on why the consumer stream should be positionable,
> especially if it is limited to the in-memory fetched messages.  Could
> someone explain to me, please?  I really like the idea of committing the
> offset specifically on those partitions with changed read offsets, only.
>
>
>
> 2 items I would like to see added to the KafkaStream are:
>
> * a non-blocking next(), throws several exceptions
> (FetchingInProgressException and a NoMessagePendingException or something)
> to differentiate between fetching or no messages left.
>
> * A nextMsgs() method which returns all locally available messages
> and kicks off a fetch for the next chunk.
>
>
>
> If you are trying to add transactional features, then formally define a
> DTP capability and pull in other server frameworks to share the
> implementation.  Should it be XA/Open?  How about a new peer2peer DTP
> protocol?
>
>
>
> Thank you,
>
> Robert
>
>
>
> Robert Withers
>
> Staff Analyst/Developer
>
> o: (720) 514-8963
>
> c:  (571) 262-1873
>
>
>
> -Original Message-
> From: Jay Kreps [mailto:jay.kr...@gmail.com]
> Sent: Sunday, February 16, 2014 10:13 AM
> To: users@kafka.apache.org<mailto:users@kafka.apache.org> users@kafka.apache.org>
> Subject: Re: New Consumer API discussion
>
>
>
> +1 I think those are good. It is a little weird that changing the fetch
>
> point is not batched but changing the commit point is, but I suppose there
> is no helping that.
>
>
>
> -Jay
>
>
>
>
>
> On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede  <mailto:neha.narkh...@gmail.com>
> <mailto:neha.narkh...@gmail.com>
> <mailto:neha.narkh...@gmail.com>>wrote:
>
>
>
> Jay,
>
>
>
> That makes sense. position/seek deal with changing the consumers
>
> in-memory data, so there is no remote rpc there. For some reason, I
>
> got committed and seek mixed up in my head at that time :)
>
>
>
> So we still end up with
>
>
>
>  long position(TopicPartition tp)
>
>  void seek(TopicPartitionOffset p)
>
>  Map committed(TopicPartition tp);
>
>  void commit(TopicPartitionOffset...);
>
>
>
> Thanks,
>
> Neha
>
>
>
> On Friday, February 14, 2014, Jay Kreps  jay.kr...@gmail.com> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com><mailto:jay.kr...@gmail.com>>>
> wrote:
>
>
>
> Oh, interesting. So I am assuming the following implementation:
>
> 1. We have an in-memory fetch position which controls the next fetch
>
> offset.
>
> 2. Changing this has no effect until you poll again at which point
>
> your fetch request will be from the newly specified offset 3. We
>
> then have an in-memory but also remotely stored committed offset.
>
> 4. Calling commit has the effect of saving the fetch position as
>
> both the in memory committed position and in the remote store 5.
>
> Auto-commit is the same as periodically calling commit on all
>
> positions.
>
>
>
> So batching on commit as well as getting the committed position
>
> makes sense, but batching the fetch position wouldn't, right? I
>
> think you are actually thinking of a different approach.
>
>
>
> -Jay
>
>
>
>
>
> On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede
>
> mailto:neha.narkh...@gmail.com> neha.narkh...@gmail.com>
>
> 
>
> wrote:
>
>
>
> I think you are saying both, i.e. if you have committed on a
>
> partition it returns you that value but if you
>
> haven't
>
> it does a remote lookup?
>
>
>
> Correct.
>
>
>
> The other argument for making committed batched is that commit()
>
> is batched, so there is symmetry.
>
>
>
> position() and seek() are always in memory changes (I assume) so
>
> there
>
> is
>
> no need to batch them.
>

Re: New Consumer API discussion

2014-02-23 Thread Withers, Robert
We use kafka as a durable buffer for 3rd party event traffic.  It acts as the 
event source in a lambda architecture.  We want it to be exactly once and we 
are close, though we can lose messages aggregating for Hadoop.  To really tie 
this all together, I think there should be an Apache project to implement a 
proper 3-phase distributed transaction capability, which the Kafka and Hadoop 
communities could implement together.  This paper looks promising.  It is a 3 
RTT protocol, but it is non-blocking.  This could be a part of a new consumer 
api, at some point.

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1703048

regards,
Rob

Re: New Consumer API discussion

2014-02-22 Thread Withers, Robert
th changed read offsets, only.



2 items I would like to see added to the KafkaStream are:

* a non-blocking next(), throws several exceptions
(FetchingInProgressException and a NoMessagePendingException or something)
to differentiate between fetching or no messages left.

* A nextMsgs() method which returns all locally available messages
and kicks off a fetch for the next chunk.



If you are trying to add transactional features, then formally define a
DTP capability and pull in other server frameworks to share the
implementation.  Should it be XA/Open?  How about a new peer2peer DTP
protocol?



Thank you,

Robert



Robert Withers

Staff Analyst/Developer

o: (720) 514-8963

c:  (571) 262-1873



-Original Message-
From: Jay Kreps [mailto:jay.kr...@gmail.com]
Sent: Sunday, February 16, 2014 10:13 AM
To: 
users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org>
Subject: Re: New Consumer API discussion



+1 I think those are good. It is a little weird that changing the fetch

point is not batched but changing the commit point is, but I suppose there
is no helping that.



-Jay





On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede 
mailto:neha.narkh...@gmail.com>
<mailto:neha.narkh...@gmail.com>
<mailto:neha.narkh...@gmail.com>>wrote:



Jay,



That makes sense. position/seek deal with changing the consumers

in-memory data, so there is no remote rpc there. For some reason, I

got committed and seek mixed up in my head at that time :)



So we still end up with



 long position(TopicPartition tp)

 void seek(TopicPartitionOffset p)

 Map committed(TopicPartition tp);

 void commit(TopicPartitionOffset...);



Thanks,

Neha



On Friday, February 14, 2014, Jay Kreps 
mailto:jay.kr...@gmail.com>mailto:jay.kr...@gmail.com>>mailto:jay.kr...@gmail.com><mailto:jay.kr...@gmail.com>>> 
wrote:



Oh, interesting. So I am assuming the following implementation:

1. We have an in-memory fetch position which controls the next fetch

offset.

2. Changing this has no effect until you poll again at which point

your fetch request will be from the newly specified offset 3. We

then have an in-memory but also remotely stored committed offset.

4. Calling commit has the effect of saving the fetch position as

both the in memory committed position and in the remote store 5.

Auto-commit is the same as periodically calling commit on all

positions.



So batching on commit as well as getting the committed position

makes sense, but batching the fetch position wouldn't, right? I

think you are actually thinking of a different approach.



-Jay





On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede

mailto:neha.narkh...@gmail.com><mailto:neha.narkh...@gmail.com>



wrote:



I think you are saying both, i.e. if you have committed on a

partition it returns you that value but if you

haven't

it does a remote lookup?



Correct.



The other argument for making committed batched is that commit()

is batched, so there is symmetry.



position() and seek() are always in memory changes (I assume) so

there

is

no need to batch them.



I'm not as sure as you are about that assumption being true.

Basically

in

my example above, the batching argument for committed() also

applies to

position() since one purpose of fetching a partition's offset is

to use

it

to set the position of the consumer to that offset. Since that

might

lead

to a remote OffsetRequest call, I think we probably would be

better off batching it.



Another option for naming would be position/reposition instead of

position/seek.



I think position/seek is better since it aligns with Java file APIs.



I also think your suggestion about ConsumerPosition makes sense.



Thanks,

Neha

On Feb 13, 2014 9:22 PM, "Jay Kreps" 
mailto:jay.kr...@gmail.com>mailto:jay.kr...@gmail.com>>mailto:jay.kr...@gmail.com><mailto:jay.kr...@gmail.com>>> 
wrote:



Hey Neha,



I actually wasn't proposing the name TopicOffsetPosition, that

was

just a

typo. I meant TopicPartitionOffset, and I was just referencing

what

was

in

the javadoc. So to restate my proposal without the typo, using

just

the

existing classes (that naming is a separate question):

 long position(TopicPartition tp)

 void seek(TopicPartitionOffset p)

 long committed(TopicPartition tp)

 void commit(TopicPartitionOffset...);



So I may be unclear on committed() (AKA lastCommittedOffset). Is

it returning the in-memory value from the last commit by this

consumer,

or

is

it doing a remote fetch, or both? I think you are saying both, i.e.

if

you

have committed on a partition it returns you that value but if

you

haven't

it does a remote lookup?



The other argument for making committed batched is that commit()

is batched, so there is symmetry.



position() and seek() are always in memory changes (I assume) so

there

is

n

Re: New Consumer API discussion

2014-02-22 Thread Jay Kreps
 new peer2peer DTP
> protocol?
>
>
>
> Thank you,
>
> Robert
>
>
>
> Robert Withers
>
> Staff Analyst/Developer
>
> o: (720) 514-8963
>
> c:  (571) 262-1873
>
>
>
> -Original Message-
> From: Jay Kreps [mailto:jay.kr...@gmail.com]
> Sent: Sunday, February 16, 2014 10:13 AM
> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
> Subject: Re: New Consumer API discussion
>
>
>
> +1 I think those are good. It is a little weird that changing the fetch
>
> point is not batched but changing the commit point is, but I suppose there
> is no helping that.
>
>
>
> -Jay
>
>
>
>
>
> On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede  <mailto:neha.narkh...@gmail.com>
> <mailto:neha.narkh...@gmail.com>>wrote:
>
>
>
> Jay,
>
>
>
> That makes sense. position/seek deal with changing the consumers
>
> in-memory data, so there is no remote rpc there. For some reason, I
>
> got committed and seek mixed up in my head at that time :)
>
>
>
> So we still end up with
>
>
>
>   long position(TopicPartition tp)
>
>   void seek(TopicPartitionOffset p)
>
>   Map committed(TopicPartition tp);
>
>   void commit(TopicPartitionOffset...);
>
>
>
> Thanks,
>
> Neha
>
>
>
> On Friday, February 14, 2014, Jay Kreps  jay.kr...@gmail.com> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
>
>
>
> Oh, interesting. So I am assuming the following implementation:
>
> 1. We have an in-memory fetch position which controls the next fetch
>
> offset.
>
> 2. Changing this has no effect until you poll again at which point
>
> your fetch request will be from the newly specified offset 3. We
>
> then have an in-memory but also remotely stored committed offset.
>
> 4. Calling commit has the effect of saving the fetch position as
>
> both the in memory committed position and in the remote store 5.
>
> Auto-commit is the same as periodically calling commit on all
>
> positions.
>
>
>
> So batching on commit as well as getting the committed position
>
> makes sense, but batching the fetch position wouldn't, right? I
>
> think you are actually thinking of a different approach.
>
>
>
> -Jay
>
>
>
>
>
> On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede
>
> mailto:neha.narkh...@gmail.com>
>
> 
>
> wrote:
>
>
>
> I think you are saying both, i.e. if you have committed on a
>
> partition it returns you that value but if you
>
> haven't
>
> it does a remote lookup?
>
>
>
> Correct.
>
>
>
> The other argument for making committed batched is that commit()
>
> is batched, so there is symmetry.
>
>
>
> position() and seek() are always in memory changes (I assume) so
>
> there
>
> is
>
> no need to batch them.
>
>
>
> I'm not as sure as you are about that assumption being true.
>
> Basically
>
> in
>
> my example above, the batching argument for committed() also
>
> applies to
>
> position() since one purpose of fetching a partition's offset is
>
> to use
>
> it
>
> to set the position of the consumer to that offset. Since that
>
> might
>
> lead
>
> to a remote OffsetRequest call, I think we probably would be
>
> better off batching it.
>
>
>
> Another option for naming would be position/reposition instead of
>
> position/seek.
>
>
>
> I think position/seek is better since it aligns with Java file APIs.
>
>
>
> I also think your suggestion about ConsumerPosition makes sense.
>
>
>
> Thanks,
>
> Neha
>
> On Feb 13, 2014 9:22 PM, "Jay Kreps"  jay.kr...@gmail.com> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
>
>
>
> Hey Neha,
>
>
>
> I actually wasn't proposing the name TopicOffsetPosition, that
>
> was
>
> just a
>
> typo. I meant TopicPartitionOffset, and I was just referencing
>
> what
>
> was
>
> in
>
> the javadoc. So to restate my proposal without the typo, using
>
> just
>
> the
>
> existing classes (that naming is a separate question):
>
>   long position(TopicPartition tp)
>
>   void seek(TopicPartitionOffset p)
>
>   long committed(TopicPartition tp)
>
>   void commit(TopicPartitionOffset...);
>
>
>
> So I may be unclear on committed() (AKA lastCommittedOffset). Is
>
> it returning the in-memory value from the last commit by this
>
> consumer,
>
> or
>
> is
>
> it doing a remote fetch, or both? I think yo

Re: New Consumer API discussion

2014-02-22 Thread Withers, Robert
Jun,

I was originally thinking a non-blocking read from a distributed stream should 
distinguish between "no local messages, but a fetch is occurring” versus “you 
have drained the stream”.  The reason this may be valuable to me is so I can 
write consumers that read all known traffic then terminate.  You caused me to 
reconsider and I think I am conflating 2 things.  One is a sync/async api while 
the other is whether to have an infinite or finite stream.  Is it possible to 
build a finite KafkaStream on a range of messages?

Perhaps a Simple Consumer would do just fine and then I could start off getting 
the writeOffset from zookeeper and tell it to read a specified range per 
partition.  I’ve done this and forked a simple consumer runnable for each 
partition, for one of our analyzers.  The great thing about the high-level 
consumer is that rebalance, so I can fork however many stream readers I want 
and you just figure it out for me.  In that way you offer us the control over 
the resource consumption within a pull model.  This is best to regulate message 
pressure, they say.

Combining that high-level rebalance ability with a ranged partition drain could 
be really nice…build the stream with an ending position and it is a finite 
stream, but retain the high-level rebalance.  With a finite stream, you would 
know the difference of the 2 async scenarios: fetch-in-progress versus 
end-of-stream.  With an infinite stream, you never get end-of-stream.

Aside from a high-level consumer over a finite range within each partition, the 
other feature I can think of is more complicated.  A high-level consumer has 
state machine changes that the client cannot access, to my knowledge.  Our use 
of kafka has us invoke a message handler with each message we consumer from the 
KafkaStream, so we convert a pull-model to a push-model.  Including the idea of 
receiving notifications from state machine changes, what would be really nice 
is to have a KafkaMessageSource, that is an eventful push model.  If it were 
thread-safe, then we could register listeners for various events:

 *   opening-stream
 *   closing-stream
 *   message-arrived
 *   end-of-stream/no-more-messages-in-partition (for finite streams)
 *   rebalance started
 *   partition assigned
 *   partition unassigned
 *   rebalance finished
 *   partition-offset-committed

Perhaps that is just our use, but instead of a pull-oriented KafkaStream, is 
there any sense in your providing a push-oriented KafkaMessageSource publishing 
OOB messages?

thank you,
Robert

On Feb 21, 2014, at 5:59 PM, Jun Rao 
mailto:jun...@gmail.com>> wrote:

Robert,

Could you explain why you want to distinguish btw FetchingInProgressException
and NoMessagePendingException? The nextMsgs() method that you want is
exactly what poll() does.

Thanks,

Jun


On Wed, Feb 19, 2014 at 8:45 AM, Withers, Robert 
mailto:robert.with...@dish.com>>wrote:

I am not clear on why the consumer stream should be positionable,
especially if it is limited to the in-memory fetched messages.  Could
someone explain to me, please?  I really like the idea of committing the
offset specifically on those partitions with changed read offsets, only.



2 items I would like to see added to the KafkaStream are:

* a non-blocking next(), throws several exceptions
(FetchingInProgressException and a NoMessagePendingException or something)
to differentiate between fetching or no messages left.

* A nextMsgs() method which returns all locally available messages
and kicks off a fetch for the next chunk.



If you are trying to add transactional features, then formally define a
DTP capability and pull in other server frameworks to share the
implementation.  Should it be XA/Open?  How about a new peer2peer DTP
protocol?



Thank you,

Robert



Robert Withers

Staff Analyst/Developer

o: (720) 514-8963

c:  (571) 262-1873



-Original Message-
From: Jay Kreps [mailto:jay.kr...@gmail.com]
Sent: Sunday, February 16, 2014 10:13 AM
To: users@kafka.apache.org<mailto:users@kafka.apache.org>
Subject: Re: New Consumer API discussion



+1 I think those are good. It is a little weird that changing the fetch

point is not batched but changing the commit point is, but I suppose there
is no helping that.



-Jay





On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede 
mailto:neha.narkh...@gmail.com>
<mailto:neha.narkh...@gmail.com>>wrote:



Jay,



That makes sense. position/seek deal with changing the consumers

in-memory data, so there is no remote rpc there. For some reason, I

got committed and seek mixed up in my head at that time :)



So we still end up with



  long position(TopicPartition tp)

  void seek(TopicPartitionOffset p)

  Map committed(TopicPartition tp);

  void commit(TopicPartitionOffset...);



Thanks,

Neha



On Friday, February 14, 2014, Jay Kreps 
mailto:jay.kr...@gmail.com>mailto:jay.kr...@gmail.com>>> wrote:



Oh, interesting. So I a

Re: New Consumer API discussion

2014-02-21 Thread Jay Kreps
Yes but the problem is that poll() actually has side effects if you are
using auto commit. So you have to do an awkward thing were you track the
last offset you've seen and somehow keep this up to date as the partitions
you own changes. Likewise if you want this value prior to reading any
messages that won't work.

-Jay


On Fri, Feb 21, 2014 at 4:56 PM, Jun Rao  wrote:

> What's the use case of position()? Isn't that just the nextOffset() on the
> last message returned from poll()?
>
> Thanks,
>
> Jun
>
>
> On Sun, Feb 16, 2014 at 9:12 AM, Jay Kreps  wrote:
>
> > +1 I think those are good. It is a little weird that changing the fetch
> > point is not batched but changing the commit point is, but I suppose
> there
> > is no helping that.
> >
> > -Jay
> >
> >
> > On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede  > >wrote:
> >
> > > Jay,
> > >
> > > That makes sense. position/seek deal with changing the consumers
> > in-memory
> > > data, so there is no remote rpc there. For some reason, I got committed
> > and
> > > seek mixed up in my head at that time :)
> > >
> > > So we still end up with
> > >
> > >long position(TopicPartition tp)
> > >void seek(TopicPartitionOffset p)
> > >Map committed(TopicPartition tp);
> > >void commit(TopicPartitionOffset...);
> > >
> > > Thanks,
> > > Neha
> > >
> > > On Friday, February 14, 2014, Jay Kreps  wrote:
> > >
> > > > Oh, interesting. So I am assuming the following implementation:
> > > > 1. We have an in-memory fetch position which controls the next fetch
> > > > offset.
> > > > 2. Changing this has no effect until you poll again at which point
> your
> > > > fetch request will be from the newly specified offset
> > > > 3. We then have an in-memory but also remotely stored committed
> offset.
> > > > 4. Calling commit has the effect of saving the fetch position as both
> > the
> > > > in memory committed position and in the remote store
> > > > 5. Auto-commit is the same as periodically calling commit on all
> > > positions.
> > > >
> > > > So batching on commit as well as getting the committed position makes
> > > > sense, but batching the fetch position wouldn't, right? I think you
> are
> > > > actually thinking of a different approach.
> > > >
> > > > -Jay
> > > >
> > > >
> > > > On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede <
> > neha.narkh...@gmail.com
> > > 
> > > > >wrote:
> > > >
> > > > > I think you are saying both, i.e. if you
> > > > > have committed on a partition it returns you that value but if you
> > > > haven't
> > > > > it does a remote lookup?
> > > > >
> > > > > Correct.
> > > > >
> > > > > The other argument for making committed batched is that commit() is
> > > > > batched, so there is symmetry.
> > > > >
> > > > > position() and seek() are always in memory changes (I assume) so
> > there
> > > is
> > > > > no need to batch them.
> > > > >
> > > > > I'm not as sure as you are about that assumption being true.
> > Basically
> > > in
> > > > > my example above, the batching argument for committed() also
> applies
> > to
> > > > > position() since one purpose of fetching a partition's offset is to
> > use
> > > > it
> > > > > to set the position of the consumer to that offset. Since that
> might
> > > lead
> > > > > to a remote OffsetRequest call, I think we probably would be better
> > off
> > > > > batching it.
> > > > >
> > > > > Another option for naming would be position/reposition instead
> > > > > of position/seek.
> > > > >
> > > > > I think position/seek is better since it aligns with Java file
> APIs.
> > > > >
> > > > > I also think your suggestion about ConsumerPosition makes sense.
> > > > >
> > > > > Thanks,
> > > > > Neha
> > > > > On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:
> > > > >
> > > > > > Hey Neha,
> > > > > >
> > > > > > I actually wasn't proposing the name TopicOffsetPosition, that
> was
> > > > just a
> > > > > > typo. I meant TopicPartitionOffset, and I was just referencing
> what
> > > was
> > > > > in
> > > > > > the javadoc. So to restate my proposal without the typo, using
> just
> > > the
> > > > > > existing classes (that naming is a separate question):
> > > > > >long position(TopicPartition tp)
> > > > > >void seek(TopicPartitionOffset p)
> > > > > >long committed(TopicPartition tp)
> > > > > >void commit(TopicPartitionOffset...);
> > > > > >
> > > > > > So I may be unclear on committed() (AKA lastCommittedOffset). Is
> it
> > > > > > returning the in-memory value from the last commit by this
> > consumer,
> > > or
> > > > > is
> > > > > > it doing a remote fetch, or both? I think you are saying both,
> i.e.
> > > if
> > > > > you
> > > > > > have committed on a partition it returns you that value but if
> you
> > > > > haven't
> > > > > > it does a remote lookup?
> > > > > >
> > > > > > The other argument for making committed batched is that commit()
> is
> > > > > > batched, so there is symmetry.
> > > > > >
> > > > > > position() and seek() are always in memory changes (I assum

Re: New Consumer API discussion

2014-02-21 Thread Jun Rao
Robert,

Could you explain why you want to distinguish btw FetchingInProgressException
and NoMessagePendingException? The nextMsgs() method that you want is
exactly what poll() does.

Thanks,

Jun


On Wed, Feb 19, 2014 at 8:45 AM, Withers, Robert wrote:

> I am not clear on why the consumer stream should be positionable,
> especially if it is limited to the in-memory fetched messages.  Could
> someone explain to me, please?  I really like the idea of committing the
> offset specifically on those partitions with changed read offsets, only.
>
>
>
> 2 items I would like to see added to the KafkaStream are:
>
> * a non-blocking next(), throws several exceptions
> (FetchingInProgressException and a NoMessagePendingException or something)
> to differentiate between fetching or no messages left.
>
> * A nextMsgs() method which returns all locally available messages
> and kicks off a fetch for the next chunk.
>
>
>
> If you are trying to add transactional features, then formally define a
> DTP capability and pull in other server frameworks to share the
> implementation.  Should it be XA/Open?  How about a new peer2peer DTP
> protocol?
>
>
>
> Thank you,
>
> Robert
>
>
>
> Robert Withers
>
> Staff Analyst/Developer
>
> o: (720) 514-8963
>
> c:  (571) 262-1873
>
>
>
> -Original Message-
> From: Jay Kreps [mailto:jay.kr...@gmail.com]
> Sent: Sunday, February 16, 2014 10:13 AM
> To: users@kafka.apache.org
> Subject: Re: New Consumer API discussion
>
>
>
> +1 I think those are good. It is a little weird that changing the fetch
>
> point is not batched but changing the commit point is, but I suppose there
> is no helping that.
>
>
>
> -Jay
>
>
>
>
>
> On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede  <mailto:neha.narkh...@gmail.com>>wrote:
>
>
>
> > Jay,
>
> >
>
> > That makes sense. position/seek deal with changing the consumers
>
> > in-memory data, so there is no remote rpc there. For some reason, I
>
> > got committed and seek mixed up in my head at that time :)
>
> >
>
> > So we still end up with
>
> >
>
> >long position(TopicPartition tp)
>
> >void seek(TopicPartitionOffset p)
>
> >Map committed(TopicPartition tp);
>
> >void commit(TopicPartitionOffset...);
>
> >
>
> > Thanks,
>
> > Neha
>
> >
>
> > On Friday, February 14, 2014, Jay Kreps  jay.kr...@gmail.com>> wrote:
>
> >
>
> > > Oh, interesting. So I am assuming the following implementation:
>
> > > 1. We have an in-memory fetch position which controls the next fetch
>
> > > offset.
>
> > > 2. Changing this has no effect until you poll again at which point
>
> > > your fetch request will be from the newly specified offset 3. We
>
> > > then have an in-memory but also remotely stored committed offset.
>
> > > 4. Calling commit has the effect of saving the fetch position as
>
> > > both the in memory committed position and in the remote store 5.
>
> > > Auto-commit is the same as periodically calling commit on all
>
> > positions.
>
> > >
>
> > > So batching on commit as well as getting the committed position
>
> > > makes sense, but batching the fetch position wouldn't, right? I
>
> > > think you are actually thinking of a different approach.
>
> > >
>
> > > -Jay
>
> > >
>
> > >
>
> > > On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede
>
> > > 
> > 
>
> > > >wrote:
>
> > >
>
> > > > I think you are saying both, i.e. if you have committed on a
>
> > > > partition it returns you that value but if you
>
> > > haven't
>
> > > > it does a remote lookup?
>
> > > >
>
> > > > Correct.
>
> > > >
>
> > > > The other argument for making committed batched is that commit()
>
> > > > is batched, so there is symmetry.
>
> > > >
>
> > > > position() and seek() are always in memory changes (I assume) so
>
> > > > there
>
> > is
>
> > > > no need to batch them.
>
> > > >
>
> > > > I'm not as sure as you are about that assumption being true.
>
> > > > Basically
>
> > in
>
> > > > my example above, the batching argument for committed() also
>
> > > > applies to
>
> > > > position() since one purpose of fe

Re: New Consumer API discussion

2014-02-21 Thread Jun Rao
What's the use case of position()? Isn't that just the nextOffset() on the
last message returned from poll()?

Thanks,

Jun


On Sun, Feb 16, 2014 at 9:12 AM, Jay Kreps  wrote:

> +1 I think those are good. It is a little weird that changing the fetch
> point is not batched but changing the commit point is, but I suppose there
> is no helping that.
>
> -Jay
>
>
> On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede  >wrote:
>
> > Jay,
> >
> > That makes sense. position/seek deal with changing the consumers
> in-memory
> > data, so there is no remote rpc there. For some reason, I got committed
> and
> > seek mixed up in my head at that time :)
> >
> > So we still end up with
> >
> >long position(TopicPartition tp)
> >void seek(TopicPartitionOffset p)
> >Map committed(TopicPartition tp);
> >void commit(TopicPartitionOffset...);
> >
> > Thanks,
> > Neha
> >
> > On Friday, February 14, 2014, Jay Kreps  wrote:
> >
> > > Oh, interesting. So I am assuming the following implementation:
> > > 1. We have an in-memory fetch position which controls the next fetch
> > > offset.
> > > 2. Changing this has no effect until you poll again at which point your
> > > fetch request will be from the newly specified offset
> > > 3. We then have an in-memory but also remotely stored committed offset.
> > > 4. Calling commit has the effect of saving the fetch position as both
> the
> > > in memory committed position and in the remote store
> > > 5. Auto-commit is the same as periodically calling commit on all
> > positions.
> > >
> > > So batching on commit as well as getting the committed position makes
> > > sense, but batching the fetch position wouldn't, right? I think you are
> > > actually thinking of a different approach.
> > >
> > > -Jay
> > >
> > >
> > > On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede <
> neha.narkh...@gmail.com
> > 
> > > >wrote:
> > >
> > > > I think you are saying both, i.e. if you
> > > > have committed on a partition it returns you that value but if you
> > > haven't
> > > > it does a remote lookup?
> > > >
> > > > Correct.
> > > >
> > > > The other argument for making committed batched is that commit() is
> > > > batched, so there is symmetry.
> > > >
> > > > position() and seek() are always in memory changes (I assume) so
> there
> > is
> > > > no need to batch them.
> > > >
> > > > I'm not as sure as you are about that assumption being true.
> Basically
> > in
> > > > my example above, the batching argument for committed() also applies
> to
> > > > position() since one purpose of fetching a partition's offset is to
> use
> > > it
> > > > to set the position of the consumer to that offset. Since that might
> > lead
> > > > to a remote OffsetRequest call, I think we probably would be better
> off
> > > > batching it.
> > > >
> > > > Another option for naming would be position/reposition instead
> > > > of position/seek.
> > > >
> > > > I think position/seek is better since it aligns with Java file APIs.
> > > >
> > > > I also think your suggestion about ConsumerPosition makes sense.
> > > >
> > > > Thanks,
> > > > Neha
> > > > On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:
> > > >
> > > > > Hey Neha,
> > > > >
> > > > > I actually wasn't proposing the name TopicOffsetPosition, that was
> > > just a
> > > > > typo. I meant TopicPartitionOffset, and I was just referencing what
> > was
> > > > in
> > > > > the javadoc. So to restate my proposal without the typo, using just
> > the
> > > > > existing classes (that naming is a separate question):
> > > > >long position(TopicPartition tp)
> > > > >void seek(TopicPartitionOffset p)
> > > > >long committed(TopicPartition tp)
> > > > >void commit(TopicPartitionOffset...);
> > > > >
> > > > > So I may be unclear on committed() (AKA lastCommittedOffset). Is it
> > > > > returning the in-memory value from the last commit by this
> consumer,
> > or
> > > > is
> > > > > it doing a remote fetch, or both? I think you are saying both, i.e.
> > if
> > > > you
> > > > > have committed on a partition it returns you that value but if you
> > > > haven't
> > > > > it does a remote lookup?
> > > > >
> > > > > The other argument for making committed batched is that commit() is
> > > > > batched, so there is symmetry.
> > > > >
> > > > > position() and seek() are always in memory changes (I assume) so
> > there
> > > is
> > > > > no need to batch them.
> > > > >
> > > > > So taking all that into account what if we revise it to
> > > > >long position(TopicPartition tp)
> > > > >void seek(TopicPartitionOffset p)
> > > > >Map committed(TopicPartition tp);
> > > > >void commit(TopicPartitionOffset...);
> > > > >
> > > > > This is not symmetric between position/seek and commit/committed
> but
> > it
> > > > is
> > > > > convenient. Another option for naming would be position/reposition
> > > > instead
> > > > > of position/seek.
> > > > >
> > > > > With respect to the name TopicPartitionOffset, what I was trying to
> > say
> > > > is
> > >

RE: New Consumer API discussion

2014-02-19 Thread Withers, Robert
I am not clear on why the consumer stream should be positionable, especially if 
it is limited to the in-memory fetched messages.  Could someone explain to me, 
please?  I really like the idea of committing the offset specifically on those 
partitions with changed read offsets, only.



2 items I would like to see added to the KafkaStream are:

* a non-blocking next(), throws several exceptions 
(FetchingInProgressException and a NoMessagePendingException or something) to 
differentiate between fetching or no messages left.

* A nextMsgs() method which returns all locally available messages and 
kicks off a fetch for the next chunk.



If you are trying to add transactional features, then formally define a DTP 
capability and pull in other server frameworks to share the implementation.  
Should it be XA/Open?  How about a new peer2peer DTP protocol?



Thank you,

Robert



Robert Withers

Staff Analyst/Developer

o: (720) 514-8963

c:  (571) 262-1873



-Original Message-
From: Jay Kreps [mailto:jay.kr...@gmail.com]
Sent: Sunday, February 16, 2014 10:13 AM
To: users@kafka.apache.org
Subject: Re: New Consumer API discussion



+1 I think those are good. It is a little weird that changing the fetch

point is not batched but changing the commit point is, but I suppose there is 
no helping that.



-Jay





On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede 
mailto:neha.narkh...@gmail.com>>wrote:



> Jay,

>

> That makes sense. position/seek deal with changing the consumers

> in-memory data, so there is no remote rpc there. For some reason, I

> got committed and seek mixed up in my head at that time :)

>

> So we still end up with

>

>long position(TopicPartition tp)

>void seek(TopicPartitionOffset p)

>Map committed(TopicPartition tp);

>void commit(TopicPartitionOffset...);

>

> Thanks,

> Neha

>

> On Friday, February 14, 2014, Jay Kreps 
> mailto:jay.kr...@gmail.com>> wrote:

>

> > Oh, interesting. So I am assuming the following implementation:

> > 1. We have an in-memory fetch position which controls the next fetch

> > offset.

> > 2. Changing this has no effect until you poll again at which point

> > your fetch request will be from the newly specified offset 3. We

> > then have an in-memory but also remotely stored committed offset.

> > 4. Calling commit has the effect of saving the fetch position as

> > both the in memory committed position and in the remote store 5.

> > Auto-commit is the same as periodically calling commit on all

> positions.

> >

> > So batching on commit as well as getting the committed position

> > makes sense, but batching the fetch position wouldn't, right? I

> > think you are actually thinking of a different approach.

> >

> > -Jay

> >

> >

> > On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede

> >  

> > >wrote:

> >

> > > I think you are saying both, i.e. if you have committed on a

> > > partition it returns you that value but if you

> > haven't

> > > it does a remote lookup?

> > >

> > > Correct.

> > >

> > > The other argument for making committed batched is that commit()

> > > is batched, so there is symmetry.

> > >

> > > position() and seek() are always in memory changes (I assume) so

> > > there

> is

> > > no need to batch them.

> > >

> > > I'm not as sure as you are about that assumption being true.

> > > Basically

> in

> > > my example above, the batching argument for committed() also

> > > applies to

> > > position() since one purpose of fetching a partition's offset is

> > > to use

> > it

> > > to set the position of the consumer to that offset. Since that

> > > might

> lead

> > > to a remote OffsetRequest call, I think we probably would be

> > > better off batching it.

> > >

> > > Another option for naming would be position/reposition instead of

> > > position/seek.

> > >

> > > I think position/seek is better since it aligns with Java file APIs.

> > >

> > > I also think your suggestion about ConsumerPosition makes sense.

> > >

> > > Thanks,

> > > Neha

> > > On Feb 13, 2014 9:22 PM, "Jay Kreps" 
> > > mailto:jay.kr...@gmail.com>> wrote:

> > >

> > > > Hey Neha,

> > > >

> > > > I actually wasn't proposing the name TopicOffsetPosition, that

> > > > was

> > just a

> > > > typo. I meant TopicPartitionOffset, and I was just refer

Re: New Consumer API discussion

2014-02-16 Thread Jay Kreps
+1 I think those are good. It is a little weird that changing the fetch
point is not batched but changing the commit point is, but I suppose there
is no helping that.

-Jay


On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede wrote:

> Jay,
>
> That makes sense. position/seek deal with changing the consumers in-memory
> data, so there is no remote rpc there. For some reason, I got committed and
> seek mixed up in my head at that time :)
>
> So we still end up with
>
>long position(TopicPartition tp)
>void seek(TopicPartitionOffset p)
>Map committed(TopicPartition tp);
>void commit(TopicPartitionOffset...);
>
> Thanks,
> Neha
>
> On Friday, February 14, 2014, Jay Kreps  wrote:
>
> > Oh, interesting. So I am assuming the following implementation:
> > 1. We have an in-memory fetch position which controls the next fetch
> > offset.
> > 2. Changing this has no effect until you poll again at which point your
> > fetch request will be from the newly specified offset
> > 3. We then have an in-memory but also remotely stored committed offset.
> > 4. Calling commit has the effect of saving the fetch position as both the
> > in memory committed position and in the remote store
> > 5. Auto-commit is the same as periodically calling commit on all
> positions.
> >
> > So batching on commit as well as getting the committed position makes
> > sense, but batching the fetch position wouldn't, right? I think you are
> > actually thinking of a different approach.
> >
> > -Jay
> >
> >
> > On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede  
> > >wrote:
> >
> > > I think you are saying both, i.e. if you
> > > have committed on a partition it returns you that value but if you
> > haven't
> > > it does a remote lookup?
> > >
> > > Correct.
> > >
> > > The other argument for making committed batched is that commit() is
> > > batched, so there is symmetry.
> > >
> > > position() and seek() are always in memory changes (I assume) so there
> is
> > > no need to batch them.
> > >
> > > I'm not as sure as you are about that assumption being true. Basically
> in
> > > my example above, the batching argument for committed() also applies to
> > > position() since one purpose of fetching a partition's offset is to use
> > it
> > > to set the position of the consumer to that offset. Since that might
> lead
> > > to a remote OffsetRequest call, I think we probably would be better off
> > > batching it.
> > >
> > > Another option for naming would be position/reposition instead
> > > of position/seek.
> > >
> > > I think position/seek is better since it aligns with Java file APIs.
> > >
> > > I also think your suggestion about ConsumerPosition makes sense.
> > >
> > > Thanks,
> > > Neha
> > > On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:
> > >
> > > > Hey Neha,
> > > >
> > > > I actually wasn't proposing the name TopicOffsetPosition, that was
> > just a
> > > > typo. I meant TopicPartitionOffset, and I was just referencing what
> was
> > > in
> > > > the javadoc. So to restate my proposal without the typo, using just
> the
> > > > existing classes (that naming is a separate question):
> > > >long position(TopicPartition tp)
> > > >void seek(TopicPartitionOffset p)
> > > >long committed(TopicPartition tp)
> > > >void commit(TopicPartitionOffset...);
> > > >
> > > > So I may be unclear on committed() (AKA lastCommittedOffset). Is it
> > > > returning the in-memory value from the last commit by this consumer,
> or
> > > is
> > > > it doing a remote fetch, or both? I think you are saying both, i.e.
> if
> > > you
> > > > have committed on a partition it returns you that value but if you
> > > haven't
> > > > it does a remote lookup?
> > > >
> > > > The other argument for making committed batched is that commit() is
> > > > batched, so there is symmetry.
> > > >
> > > > position() and seek() are always in memory changes (I assume) so
> there
> > is
> > > > no need to batch them.
> > > >
> > > > So taking all that into account what if we revise it to
> > > >long position(TopicPartition tp)
> > > >void seek(TopicPartitionOffset p)
> > > >Map committed(TopicPartition tp);
> > > >void commit(TopicPartitionOffset...);
> > > >
> > > > This is not symmetric between position/seek and commit/committed but
> it
> > > is
> > > > convenient. Another option for naming would be position/reposition
> > > instead
> > > > of position/seek.
> > > >
> > > > With respect to the name TopicPartitionOffset, what I was trying to
> say
> > > is
> > > > that I recommend we change that to something shorter. I think
> > > TopicPosition
> > > > or ConsumerPosition might be better. Position does not refer to the
> > > > variables in the object, it refers to the meaning of the object--it
> > > > represents a position within a topic. The offset field in that object
> > is
> > > > still called the offset. TopicOffset, PartitionOffset, or
> > ConsumerOffset
> > > > would all be workable too. Basically I am just objecting to
> > concatenating

Re: New Consumer API discussion

2014-02-15 Thread Neha Narkhede
Jay,

That makes sense. position/seek deal with changing the consumers in-memory
data, so there is no remote rpc there. For some reason, I got committed and
seek mixed up in my head at that time :)

So we still end up with

   long position(TopicPartition tp)
   void seek(TopicPartitionOffset p)
   Map committed(TopicPartition tp);
   void commit(TopicPartitionOffset...);

Thanks,
Neha

On Friday, February 14, 2014, Jay Kreps  wrote:

> Oh, interesting. So I am assuming the following implementation:
> 1. We have an in-memory fetch position which controls the next fetch
> offset.
> 2. Changing this has no effect until you poll again at which point your
> fetch request will be from the newly specified offset
> 3. We then have an in-memory but also remotely stored committed offset.
> 4. Calling commit has the effect of saving the fetch position as both the
> in memory committed position and in the remote store
> 5. Auto-commit is the same as periodically calling commit on all positions.
>
> So batching on commit as well as getting the committed position makes
> sense, but batching the fetch position wouldn't, right? I think you are
> actually thinking of a different approach.
>
> -Jay
>
>
> On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede 
> 
> >wrote:
>
> > I think you are saying both, i.e. if you
> > have committed on a partition it returns you that value but if you
> haven't
> > it does a remote lookup?
> >
> > Correct.
> >
> > The other argument for making committed batched is that commit() is
> > batched, so there is symmetry.
> >
> > position() and seek() are always in memory changes (I assume) so there is
> > no need to batch them.
> >
> > I'm not as sure as you are about that assumption being true. Basically in
> > my example above, the batching argument for committed() also applies to
> > position() since one purpose of fetching a partition's offset is to use
> it
> > to set the position of the consumer to that offset. Since that might lead
> > to a remote OffsetRequest call, I think we probably would be better off
> > batching it.
> >
> > Another option for naming would be position/reposition instead
> > of position/seek.
> >
> > I think position/seek is better since it aligns with Java file APIs.
> >
> > I also think your suggestion about ConsumerPosition makes sense.
> >
> > Thanks,
> > Neha
> > On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:
> >
> > > Hey Neha,
> > >
> > > I actually wasn't proposing the name TopicOffsetPosition, that was
> just a
> > > typo. I meant TopicPartitionOffset, and I was just referencing what was
> > in
> > > the javadoc. So to restate my proposal without the typo, using just the
> > > existing classes (that naming is a separate question):
> > >long position(TopicPartition tp)
> > >void seek(TopicPartitionOffset p)
> > >long committed(TopicPartition tp)
> > >void commit(TopicPartitionOffset...);
> > >
> > > So I may be unclear on committed() (AKA lastCommittedOffset). Is it
> > > returning the in-memory value from the last commit by this consumer, or
> > is
> > > it doing a remote fetch, or both? I think you are saying both, i.e. if
> > you
> > > have committed on a partition it returns you that value but if you
> > haven't
> > > it does a remote lookup?
> > >
> > > The other argument for making committed batched is that commit() is
> > > batched, so there is symmetry.
> > >
> > > position() and seek() are always in memory changes (I assume) so there
> is
> > > no need to batch them.
> > >
> > > So taking all that into account what if we revise it to
> > >long position(TopicPartition tp)
> > >void seek(TopicPartitionOffset p)
> > >Map committed(TopicPartition tp);
> > >void commit(TopicPartitionOffset...);
> > >
> > > This is not symmetric between position/seek and commit/committed but it
> > is
> > > convenient. Another option for naming would be position/reposition
> > instead
> > > of position/seek.
> > >
> > > With respect to the name TopicPartitionOffset, what I was trying to say
> > is
> > > that I recommend we change that to something shorter. I think
> > TopicPosition
> > > or ConsumerPosition might be better. Position does not refer to the
> > > variables in the object, it refers to the meaning of the object--it
> > > represents a position within a topic. The offset field in that object
> is
> > > still called the offset. TopicOffset, PartitionOffset, or
> ConsumerOffset
> > > would all be workable too. Basically I am just objecting to
> concatenating
> > > three nouns together. :-)
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Feb 13, 2014 at 1:54 PM, Neha Narkhede <
> neha.narkh...@gmail.com
> > > >wrote:
> > >
> > > > 2. It returns a list of results. But how can you use the list? The
> only
> > > way
> > > > to use the list is to make a map of tp=>offset and then look up
> results
> > > in
> > > > this map (or do a for loop over the list for the partition you
> want). I
> > > > recommend that if this is an in-memor

Re: New Consumer API discussion

2014-02-14 Thread Jay Kreps
Oh, interesting. So I am assuming the following implementation:
1. We have an in-memory fetch position which controls the next fetch
offset.
2. Changing this has no effect until you poll again at which point your
fetch request will be from the newly specified offset
3. We then have an in-memory but also remotely stored committed offset.
4. Calling commit has the effect of saving the fetch position as both the
in memory committed position and in the remote store
5. Auto-commit is the same as periodically calling commit on all positions.

So batching on commit as well as getting the committed position makes
sense, but batching the fetch position wouldn't, right? I think you are
actually thinking of a different approach.

-Jay


On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede wrote:

> I think you are saying both, i.e. if you
> have committed on a partition it returns you that value but if you haven't
> it does a remote lookup?
>
> Correct.
>
> The other argument for making committed batched is that commit() is
> batched, so there is symmetry.
>
> position() and seek() are always in memory changes (I assume) so there is
> no need to batch them.
>
> I'm not as sure as you are about that assumption being true. Basically in
> my example above, the batching argument for committed() also applies to
> position() since one purpose of fetching a partition's offset is to use it
> to set the position of the consumer to that offset. Since that might lead
> to a remote OffsetRequest call, I think we probably would be better off
> batching it.
>
> Another option for naming would be position/reposition instead
> of position/seek.
>
> I think position/seek is better since it aligns with Java file APIs.
>
> I also think your suggestion about ConsumerPosition makes sense.
>
> Thanks,
> Neha
> On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:
>
> > Hey Neha,
> >
> > I actually wasn't proposing the name TopicOffsetPosition, that was just a
> > typo. I meant TopicPartitionOffset, and I was just referencing what was
> in
> > the javadoc. So to restate my proposal without the typo, using just the
> > existing classes (that naming is a separate question):
> >long position(TopicPartition tp)
> >void seek(TopicPartitionOffset p)
> >long committed(TopicPartition tp)
> >void commit(TopicPartitionOffset...);
> >
> > So I may be unclear on committed() (AKA lastCommittedOffset). Is it
> > returning the in-memory value from the last commit by this consumer, or
> is
> > it doing a remote fetch, or both? I think you are saying both, i.e. if
> you
> > have committed on a partition it returns you that value but if you
> haven't
> > it does a remote lookup?
> >
> > The other argument for making committed batched is that commit() is
> > batched, so there is symmetry.
> >
> > position() and seek() are always in memory changes (I assume) so there is
> > no need to batch them.
> >
> > So taking all that into account what if we revise it to
> >long position(TopicPartition tp)
> >void seek(TopicPartitionOffset p)
> >Map committed(TopicPartition tp);
> >void commit(TopicPartitionOffset...);
> >
> > This is not symmetric between position/seek and commit/committed but it
> is
> > convenient. Another option for naming would be position/reposition
> instead
> > of position/seek.
> >
> > With respect to the name TopicPartitionOffset, what I was trying to say
> is
> > that I recommend we change that to something shorter. I think
> TopicPosition
> > or ConsumerPosition might be better. Position does not refer to the
> > variables in the object, it refers to the meaning of the object--it
> > represents a position within a topic. The offset field in that object is
> > still called the offset. TopicOffset, PartitionOffset, or ConsumerOffset
> > would all be workable too. Basically I am just objecting to concatenating
> > three nouns together. :-)
> >
> > -Jay
> >
> >
> >
> >
> >
> > On Thu, Feb 13, 2014 at 1:54 PM, Neha Narkhede  > >wrote:
> >
> > > 2. It returns a list of results. But how can you use the list? The only
> > way
> > > to use the list is to make a map of tp=>offset and then look up results
> > in
> > > this map (or do a for loop over the list for the partition you want). I
> > > recommend that if this is an in-memory check we just do one at a time.
> > E.g.
> > > long committedPosition(
> > > TopicPosition).
> > >
> > > This was discussed in the previous emails. There is a choice between
> > > returning a map or a list. Some people found the map to be more usable.
> > >
> > > What if we made it:
> > >long position(TopicPartition tp)
> > >void seek(TopicOffsetPosition p)
> > >long committed(TopicPartition tp)
> > >void commit(TopicOffsetPosition...);
> > >
> > > This is fine, but TopicOffsetPosition doesn't make sense. Offset and
> > > Position is confusing. Also both fetch and commit positions are related
> > to
> > > partitions, not topics. Some more options are TopicPartitionPosition or
> > > TopicPart

Re: New Consumer API discussion

2014-02-13 Thread Neha Narkhede
I think you are saying both, i.e. if you
have committed on a partition it returns you that value but if you haven't
it does a remote lookup?

Correct.

The other argument for making committed batched is that commit() is
batched, so there is symmetry.

position() and seek() are always in memory changes (I assume) so there is
no need to batch them.

I'm not as sure as you are about that assumption being true. Basically in
my example above, the batching argument for committed() also applies to
position() since one purpose of fetching a partition's offset is to use it
to set the position of the consumer to that offset. Since that might lead
to a remote OffsetRequest call, I think we probably would be better off
batching it.

Another option for naming would be position/reposition instead
of position/seek.

I think position/seek is better since it aligns with Java file APIs.

I also think your suggestion about ConsumerPosition makes sense.

Thanks,
Neha
On Feb 13, 2014 9:22 PM, "Jay Kreps"  wrote:

> Hey Neha,
>
> I actually wasn't proposing the name TopicOffsetPosition, that was just a
> typo. I meant TopicPartitionOffset, and I was just referencing what was in
> the javadoc. So to restate my proposal without the typo, using just the
> existing classes (that naming is a separate question):
>long position(TopicPartition tp)
>void seek(TopicPartitionOffset p)
>long committed(TopicPartition tp)
>void commit(TopicPartitionOffset...);
>
> So I may be unclear on committed() (AKA lastCommittedOffset). Is it
> returning the in-memory value from the last commit by this consumer, or is
> it doing a remote fetch, or both? I think you are saying both, i.e. if you
> have committed on a partition it returns you that value but if you haven't
> it does a remote lookup?
>
> The other argument for making committed batched is that commit() is
> batched, so there is symmetry.
>
> position() and seek() are always in memory changes (I assume) so there is
> no need to batch them.
>
> So taking all that into account what if we revise it to
>long position(TopicPartition tp)
>void seek(TopicPartitionOffset p)
>Map committed(TopicPartition tp);
>void commit(TopicPartitionOffset...);
>
> This is not symmetric between position/seek and commit/committed but it is
> convenient. Another option for naming would be position/reposition instead
> of position/seek.
>
> With respect to the name TopicPartitionOffset, what I was trying to say is
> that I recommend we change that to something shorter. I think TopicPosition
> or ConsumerPosition might be better. Position does not refer to the
> variables in the object, it refers to the meaning of the object--it
> represents a position within a topic. The offset field in that object is
> still called the offset. TopicOffset, PartitionOffset, or ConsumerOffset
> would all be workable too. Basically I am just objecting to concatenating
> three nouns together. :-)
>
> -Jay
>
>
>
>
>
> On Thu, Feb 13, 2014 at 1:54 PM, Neha Narkhede  >wrote:
>
> > 2. It returns a list of results. But how can you use the list? The only
> way
> > to use the list is to make a map of tp=>offset and then look up results
> in
> > this map (or do a for loop over the list for the partition you want). I
> > recommend that if this is an in-memory check we just do one at a time.
> E.g.
> > long committedPosition(
> > TopicPosition).
> >
> > This was discussed in the previous emails. There is a choice between
> > returning a map or a list. Some people found the map to be more usable.
> >
> > What if we made it:
> >long position(TopicPartition tp)
> >void seek(TopicOffsetPosition p)
> >long committed(TopicPartition tp)
> >void commit(TopicOffsetPosition...);
> >
> > This is fine, but TopicOffsetPosition doesn't make sense. Offset and
> > Position is confusing. Also both fetch and commit positions are related
> to
> > partitions, not topics. Some more options are TopicPartitionPosition or
> > TopicPartitionOffset. And we should use either position everywhere in
> Kafka
> > or offset but having both is confusing.
> >
> >void seek(TopicOffsetPosition p)
> >long committed(TopicPartition tp)
> >
> > Whether these are batched or not really depends on how flexible we want
> > these APIs to be. The question is whether we allow a consumer to fetch or
> > set the offsets for partitions that it doesn't own or consume. For
> example,
> > if I choose to skip group management and do my own partition assignment
> but
> > choose Kafka based offset management. I could imagine a use case where I
> > want to change the partition assignment on the fly, and to do that, I
> would
> > need to fetch the last committed offsets of partitions that I currently
> > don't consume.
> >
> > If we want to allow this, these APIs would be more performant if batched.
> > And would probably look like -
> >Map positions(TopicPartition... tp)
> >void seek(TopicOffsetPosition... p)
> >Map committed(TopicPartit

Re: New Consumer API discussion

2014-02-13 Thread Jay Kreps
Hey Neha,

I actually wasn't proposing the name TopicOffsetPosition, that was just a
typo. I meant TopicPartitionOffset, and I was just referencing what was in
the javadoc. So to restate my proposal without the typo, using just the
existing classes (that naming is a separate question):
   long position(TopicPartition tp)
   void seek(TopicPartitionOffset p)
   long committed(TopicPartition tp)
   void commit(TopicPartitionOffset...);

So I may be unclear on committed() (AKA lastCommittedOffset). Is it
returning the in-memory value from the last commit by this consumer, or is
it doing a remote fetch, or both? I think you are saying both, i.e. if you
have committed on a partition it returns you that value but if you haven't
it does a remote lookup?

The other argument for making committed batched is that commit() is
batched, so there is symmetry.

position() and seek() are always in memory changes (I assume) so there is
no need to batch them.

So taking all that into account what if we revise it to
   long position(TopicPartition tp)
   void seek(TopicPartitionOffset p)
   Map committed(TopicPartition tp);
   void commit(TopicPartitionOffset...);

This is not symmetric between position/seek and commit/committed but it is
convenient. Another option for naming would be position/reposition instead
of position/seek.

With respect to the name TopicPartitionOffset, what I was trying to say is
that I recommend we change that to something shorter. I think TopicPosition
or ConsumerPosition might be better. Position does not refer to the
variables in the object, it refers to the meaning of the object--it
represents a position within a topic. The offset field in that object is
still called the offset. TopicOffset, PartitionOffset, or ConsumerOffset
would all be workable too. Basically I am just objecting to concatenating
three nouns together. :-)

-Jay





On Thu, Feb 13, 2014 at 1:54 PM, Neha Narkhede wrote:

> 2. It returns a list of results. But how can you use the list? The only way
> to use the list is to make a map of tp=>offset and then look up results in
> this map (or do a for loop over the list for the partition you want). I
> recommend that if this is an in-memory check we just do one at a time. E.g.
> long committedPosition(
> TopicPosition).
>
> This was discussed in the previous emails. There is a choice between
> returning a map or a list. Some people found the map to be more usable.
>
> What if we made it:
>long position(TopicPartition tp)
>void seek(TopicOffsetPosition p)
>long committed(TopicPartition tp)
>void commit(TopicOffsetPosition...);
>
> This is fine, but TopicOffsetPosition doesn't make sense. Offset and
> Position is confusing. Also both fetch and commit positions are related to
> partitions, not topics. Some more options are TopicPartitionPosition or
> TopicPartitionOffset. And we should use either position everywhere in Kafka
> or offset but having both is confusing.
>
>void seek(TopicOffsetPosition p)
>long committed(TopicPartition tp)
>
> Whether these are batched or not really depends on how flexible we want
> these APIs to be. The question is whether we allow a consumer to fetch or
> set the offsets for partitions that it doesn't own or consume. For example,
> if I choose to skip group management and do my own partition assignment but
> choose Kafka based offset management. I could imagine a use case where I
> want to change the partition assignment on the fly, and to do that, I would
> need to fetch the last committed offsets of partitions that I currently
> don't consume.
>
> If we want to allow this, these APIs would be more performant if batched.
> And would probably look like -
>Map positions(TopicPartition... tp)
>void seek(TopicOffsetPosition... p)
>Map committed(TopicPartition... tp)
>void commit(TopicOffsetPosition...)
>
> These are definitely more clunky than the non batched ones though.
>
> Thanks,
> Neha
>
>
>
> On Thu, Feb 13, 2014 at 1:24 PM, Jay Kreps  wrote:
>
> > Hey guys,
> >
> > One thing that bugs me is the lack of symmetric for the different
> position
> > calls. The way I see it there are two positions we maintain: the fetch
> > position and the last commit position. There are two things you can do to
> > these positions: get the current value or change the current value. But
> the
> > names somewhat obscure this:
> >   Fetch position:
> > - No get
> > - set by positions(TopicOffsetPosition...)
> >   Committed position:
> > - get by List lastCommittedPosition(
> > TopicPartition...)
> > - set by commit or commitAsync
> >
> > The lastCommittedPosition is particular bothersome because:
> > 1. The name is weird and long
> > 2. It returns a list of results. But how can you use the list? The only
> way
> > to use the list is to make a map of tp=>offset and then look up results
> in
> > this map (or do a for loop over the list for the partition you want). I
> > recommend that if this is an in-memory check we just

Re: New Consumer API discussion

2014-02-13 Thread Tom Brown
Conceptually, do the position methods only apply to topics you've
subscribed to, or do they apply to all topics in the cluster?

E.g., could I retrieve or set the committed position of any partition?

The positive use case for having access to all partition information would
be to setup an active monitoring system (that can feed the positions to a
pretty GUI, for instance).

A downside is that you could have invalid partition offsets committed
(perhaps being reset to 0 by an overzealous client).

--Tom


On Thu, Feb 13, 2014 at 5:15 PM, Pradeep Gollakota wrote:

> Hi Neha,
>
>6. It seems like #4 can be avoided by using Map >> Long> or Map as the argument type.
> >
> > How? lastCommittedOffsets() is independent of positions(). I'm not sure I
> > understood your suggestion.
>
> I think of subscription as you're subscribing to a Set of TopicPartitions.
> Because the argument to positions() is TopicPartitionOffset ... it's
> conceivable that the method can be called with two offsets for the same
> TopicPartition. One way to handle this, is to accept either the first or
> the last offset for a TopicPartition. However, if the argument type is
> changed to Map it precludes the possibility of
> getting duplicate offsets of the same TopicPartition.
>
>7. To address #3, maybe we can return List that
> are
> >> invalid.
> >
> > I don't particularly see the advantage of returning a list of invalid
>
> partitions from position(). It seems a bit awkward to return a list to
>
> indicate what is obviously a bug. Prefer throwing an error since the user
> > should just fix that logic.
>
> I'm not sure if an Exception is needed or desirable here. I don't see this
> as a catastrophic failure or a non-recoverable failure. Even if we just
> write the bad offsets to a log file and call it a day, I'm ok with that.
> But my main goal is to communicate to the API users somehow that they've
> provided bad offests which are simply being ignored.
>
> Hi Jay,
>
> I would also like to shorten the name TopicOffsetPosition. Offset and
> > Position are duplicative of each other. So perhaps we could call it a
> > PartitionOffset or a TopicPosition or something like that. In general
> class
> > names that are just a concatenation of the fields (e.g.
> > TopicAndPartitionAndOffset) seem kind of lazy to me since the name
> doesn't
> > really describe it just enumerates. But that is more of a nit pick.
>
>
>1. Did you mean to say TopicPartitionOffset instead of
>TopicOffsetPosition?
>2. +1 on PartitionOffset
>
> The lastCommittedPosition is particular bothersome because:
> > 1. The name is weird and long
> > 2. It returns a list of results. But how can you use the list? The only
> way
> > to use the list is to make a map of tp=>offset and then look up results
> in
> > this map (or do a for loop over the list for the partition you want).
>
> This is sort of what I was talking about in my previous email. My
> suggestion was to change the return type to Map.
>
> What if we made it:
> >long position(TopicPartition tp)
> >void seek(TopicOffsetPosition p)
> >long committed(TopicPartition tp)
> >void commit(TopicOffsetPosition...);
>
>
>1. Absolutely love the idea of position(TopicPartition tp).
>2. I think we also need to provide a method for accessing all positions
>positions() which maybe returns a Map?
>3. What is the difference between position(TopicPartition tp) and
> committed(TopicPartition
>tp)?
>4. +1 on commit(PartitionOffset...)
>5. +1 on seek(PartitionOffset p)
>6. We should also provide a seek(PartitionOffset... offsets)
>
> Finally, in all the methods where we're using varargs, we should use an
> appropriate Collection data structure. For example, for the
> subscribe(TopicPartition...
> partitions) method, I think a more accurate API would be
> subscribe(Set
> partitions). This allows for the code to be self-documenting.
>


Re: New Consumer API discussion

2014-02-13 Thread Pradeep Gollakota
Hi Neha,

   6. It seems like #4 can be avoided by using Map> Long> or Map as the argument type.
>
> How? lastCommittedOffsets() is independent of positions(). I'm not sure I
> understood your suggestion.

I think of subscription as you're subscribing to a Set of TopicPartitions.
Because the argument to positions() is TopicPartitionOffset ... it's
conceivable that the method can be called with two offsets for the same
TopicPartition. One way to handle this, is to accept either the first or
the last offset for a TopicPartition. However, if the argument type is
changed to Map it precludes the possibility of
getting duplicate offsets of the same TopicPartition.

   7. To address #3, maybe we can return List that are
>> invalid.
>
> I don't particularly see the advantage of returning a list of invalid

partitions from position(). It seems a bit awkward to return a list to

indicate what is obviously a bug. Prefer throwing an error since the user
> should just fix that logic.

I'm not sure if an Exception is needed or desirable here. I don't see this
as a catastrophic failure or a non-recoverable failure. Even if we just
write the bad offsets to a log file and call it a day, I'm ok with that.
But my main goal is to communicate to the API users somehow that they've
provided bad offests which are simply being ignored.

Hi Jay,

I would also like to shorten the name TopicOffsetPosition. Offset and
> Position are duplicative of each other. So perhaps we could call it a
> PartitionOffset or a TopicPosition or something like that. In general class
> names that are just a concatenation of the fields (e.g.
> TopicAndPartitionAndOffset) seem kind of lazy to me since the name doesn't
> really describe it just enumerates. But that is more of a nit pick.


   1. Did you mean to say TopicPartitionOffset instead of
   TopicOffsetPosition?
   2. +1 on PartitionOffset

The lastCommittedPosition is particular bothersome because:
> 1. The name is weird and long
> 2. It returns a list of results. But how can you use the list? The only way
> to use the list is to make a map of tp=>offset and then look up results in
> this map (or do a for loop over the list for the partition you want).

This is sort of what I was talking about in my previous email. My
suggestion was to change the return type to Map.

What if we made it:
>long position(TopicPartition tp)
>void seek(TopicOffsetPosition p)
>long committed(TopicPartition tp)
>void commit(TopicOffsetPosition...);


   1. Absolutely love the idea of position(TopicPartition tp).
   2. I think we also need to provide a method for accessing all positions
   positions() which maybe returns a Map?
   3. What is the difference between position(TopicPartition tp) and
committed(TopicPartition
   tp)?
   4. +1 on commit(PartitionOffset...)
   5. +1 on seek(PartitionOffset p)
   6. We should also provide a seek(PartitionOffset... offsets)

Finally, in all the methods where we're using varargs, we should use an
appropriate Collection data structure. For example, for the
subscribe(TopicPartition...
partitions) method, I think a more accurate API would be
subscribe(Set
partitions). This allows for the code to be self-documenting.


Re: New Consumer API discussion

2014-02-13 Thread Neha Narkhede
2. It returns a list of results. But how can you use the list? The only way
to use the list is to make a map of tp=>offset and then look up results in
this map (or do a for loop over the list for the partition you want). I
recommend that if this is an in-memory check we just do one at a time. E.g.
long committedPosition(
TopicPosition).

This was discussed in the previous emails. There is a choice between
returning a map or a list. Some people found the map to be more usable.

What if we made it:
   long position(TopicPartition tp)
   void seek(TopicOffsetPosition p)
   long committed(TopicPartition tp)
   void commit(TopicOffsetPosition...);

This is fine, but TopicOffsetPosition doesn't make sense. Offset and
Position is confusing. Also both fetch and commit positions are related to
partitions, not topics. Some more options are TopicPartitionPosition or
TopicPartitionOffset. And we should use either position everywhere in Kafka
or offset but having both is confusing.

   void seek(TopicOffsetPosition p)
   long committed(TopicPartition tp)

Whether these are batched or not really depends on how flexible we want
these APIs to be. The question is whether we allow a consumer to fetch or
set the offsets for partitions that it doesn't own or consume. For example,
if I choose to skip group management and do my own partition assignment but
choose Kafka based offset management. I could imagine a use case where I
want to change the partition assignment on the fly, and to do that, I would
need to fetch the last committed offsets of partitions that I currently
don't consume.

If we want to allow this, these APIs would be more performant if batched.
And would probably look like -
   Map positions(TopicPartition... tp)
   void seek(TopicOffsetPosition... p)
   Map committed(TopicPartition... tp)
   void commit(TopicOffsetPosition...)

These are definitely more clunky than the non batched ones though.

Thanks,
Neha



On Thu, Feb 13, 2014 at 1:24 PM, Jay Kreps  wrote:

> Hey guys,
>
> One thing that bugs me is the lack of symmetric for the different position
> calls. The way I see it there are two positions we maintain: the fetch
> position and the last commit position. There are two things you can do to
> these positions: get the current value or change the current value. But the
> names somewhat obscure this:
>   Fetch position:
> - No get
> - set by positions(TopicOffsetPosition...)
>   Committed position:
> - get by List lastCommittedPosition(
> TopicPartition...)
> - set by commit or commitAsync
>
> The lastCommittedPosition is particular bothersome because:
> 1. The name is weird and long
> 2. It returns a list of results. But how can you use the list? The only way
> to use the list is to make a map of tp=>offset and then look up results in
> this map (or do a for loop over the list for the partition you want). I
> recommend that if this is an in-memory check we just do one at a time. E.g.
> long committedPosition(TopicPosition).
>
> What if we made it:
>long position(TopicPartition tp)
>void seek(TopicOffsetPosition p)
>long committed(TopicPartition tp)
>void commit(TopicOffsetPosition...);
>
> This still isn't terribly consistent, but I think it is better.
>
> I would also like to shorten the name TopicOffsetPosition. Offset and
> Position are duplicative of each other. So perhaps we could call it a
> PartitionOffset or a TopicPosition or something like that. In general class
> names that are just a concatenation of the fields (e.g.
> TopicAndPartitionAndOffset) seem kind of lazy to me since the name doesn't
> really describe it just enumerates. But that is more of a nit pick.
>
> -Jay
>
>
> On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede  >wrote:
>
> > As mentioned in previous emails, we are also working on a
> re-implementation
> > of the consumer. I would like to use this email thread to discuss the
> > details of the public API. I would also like us to be picky about this
> > public api now so it is as good as possible and we don't need to break it
> > in the future.
> >
> > The best way to get a feel for the API is actually to take a look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >,
> > the hope is to get the api docs good enough so that it is
> self-explanatory.
> > You can also take a look at the configs
> > here<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> > >
> >
> > Some background info on implementation:
> >
> > At a high level the primary difference in this consumer is that it
> removes
> > the distinction between the "high-level" and "low-level" consumer. The
> new
> > consumer API is non blocking and instead of returning a blocking
> iterator,
> > the consumer provides a poll() API that returns a list of records. We
> think
> > this is better compared to the blocking iterators since it effec

Re: New Consumer API discussion

2014-02-13 Thread Jay Kreps
Hey guys,

One thing that bugs me is the lack of symmetric for the different position
calls. The way I see it there are two positions we maintain: the fetch
position and the last commit position. There are two things you can do to
these positions: get the current value or change the current value. But the
names somewhat obscure this:
  Fetch position:
- No get
- set by positions(TopicOffsetPosition...)
  Committed position:
- get by List lastCommittedPosition(
TopicPartition...)
- set by commit or commitAsync

The lastCommittedPosition is particular bothersome because:
1. The name is weird and long
2. It returns a list of results. But how can you use the list? The only way
to use the list is to make a map of tp=>offset and then look up results in
this map (or do a for loop over the list for the partition you want). I
recommend that if this is an in-memory check we just do one at a time. E.g.
long committedPosition(TopicPosition).

What if we made it:
   long position(TopicPartition tp)
   void seek(TopicOffsetPosition p)
   long committed(TopicPartition tp)
   void commit(TopicOffsetPosition...);

This still isn't terribly consistent, but I think it is better.

I would also like to shorten the name TopicOffsetPosition. Offset and
Position are duplicative of each other. So perhaps we could call it a
PartitionOffset or a TopicPosition or something like that. In general class
names that are just a concatenation of the fields (e.g.
TopicAndPartitionAndOffset) seem kind of lazy to me since the name doesn't
really describe it just enumerates. But that is more of a nit pick.

-Jay


On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede wrote:

> As mentioned in previous emails, we are also working on a re-implementation
> of the consumer. I would like to use this email thread to discuss the
> details of the public API. I would also like us to be picky about this
> public api now so it is as good as possible and we don't need to break it
> in the future.
>
> The best way to get a feel for the API is actually to take a look at the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >,
> the hope is to get the api docs good enough so that it is self-explanatory.
> You can also take a look at the configs
> here<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> >
>
> Some background info on implementation:
>
> At a high level the primary difference in this consumer is that it removes
> the distinction between the "high-level" and "low-level" consumer. The new
> consumer API is non blocking and instead of returning a blocking iterator,
> the consumer provides a poll() API that returns a list of records. We think
> this is better compared to the blocking iterators since it effectively
> decouples the threading strategy used for processing messages from the
> consumer. It is worth noting that the consumer is entirely single threaded
> and runs in the user thread. The advantage is that it can be easily
> rewritten in less multi-threading-friendly languages. The consumer batches
> data and multiplexes I/O over TCP connections to each of the brokers it
> communicates with, for high throughput. The consumer also allows long poll
> to reduce the end-to-end message latency for low throughput data.
>
> The consumer provides a group management facility that supports the concept
> of a group with multiple consumer instances (just like the current
> consumer). This is done through a custom heartbeat and group management
> protocol transparent to the user. At the same time, it allows users the
> option to subscribe to a fixed set of partitions and not use group
> management at all. The offset management strategy defaults to Kafka based
> offset management and the API provides a way for the user to use a
> customized offset store to manage the consumer's offsets.
>
> A key difference in this consumer also is the fact that it does not depend
> on zookeeper at all.
>
> More details about the new consumer design are
> here<
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> >
>
> Please take a look at the new
> API<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >and
> give us any thoughts you may have.
>
> Thanks,
> Neha
>


Re: New Consumer API discussion

2014-02-13 Thread Neha Narkhede
Pradeep -

Thanks for your detailed comments.

1.

   subscribe(String topic, int... paritions) and unsubscribe(String topic,
   int... partitions) should be subscribe(TopicPartition...
topicPartitions)and unsubscribe(TopicPartition...
   topicPartitons)

I think that is reasonable. Overall, I'm in favor of exposing
TopicPartition and TopicPartitionOffset as public APIs. They make the APIs
more readable especially given that the consumer aims to provide a small
set of APIs to support a wide range of functionalities. I will make that
change if there are no objections.

2.

   Does it make sense to provide a convenience method to subscribe to
   topics at a particular offset directly? E.g.
subscribe(
TopicPartitionOffset...
   offsets)

 I view subscriptions a little differently. One subscribes to resources. In
this case, either topics (when you use group management) or specific
partitions. Offsets are specific to the consumption protocol and unrelated
to subscription which just expresses the user's interest in certain
resources. Also, if we have one way to specify fetch offsets (positions()),
I'd like to avoid creating *n* APIs to do the same thing, since that just
makes the consumer APIs more bulky and eventually confusing.

3.

   The javadoc makes no mention of what would happen if positions() is
   called with a TopicPartitionOffset to which the Consumer is not
   subscribed to.

 Good point. Fixed the
javadoc

4.

   The javadoc makes no mention of what would happen if positions() is
   called with two different offsets for a single TopicPartition

positions() can be called multiple times and hence with different offsets.
I think I mentioned in the latest javadoc that positions() will change the
offset on the next fetch request (poll()). Improved the javadoc to
explicitly mention this case.

5. The javadoc shows lastCommittedOffsets() return type as
   List. This should either be Map or Map

 This depends on how the user would use the committed offsets. One example
I could think off and is mentioned in the javadoc for
lastCommittedOffsets() is to rewind consumption. In this case, you may or
may not require random access to a particular partition's offset, depending
on whether you want to selectively rewind consumption or not. So it may be
fine to return a map. I'm not sure if people can think of other uses of
this API though. In any case, if we
wanted to change this to a map, I'd prefer Map.

   6. It seems like #4 can be avoided by using Map or Map as the argument type.

How? lastCommittedOffsets() is independent of positions(). I'm not sure I
understood your suggestion.

   7. To address #3, maybe we can return List that
   are invalid.

I don't particularly see the advantage of returning a list of invalid
partitions from position(). It seems a bit awkward to return a list to
indicate what is obviously a bug. Prefer throwing an error since the user
should just fix that logic.

Thanks,
Neha



On Wed, Feb 12, 2014 at 3:59 PM, Jay Kreps  wrote:

> Ah, gotcha.
>
> -Jay
>
>
> On Wed, Feb 12, 2014 at 8:40 AM, Neha Narkhede  >wrote:
>
> > Jay
> >
> > Well none kind of address the common case which is to commit all
> > partitions. For these I was thinking just
> >commit();
> > The advantage of this simpler method is that you don't need to bother
> about
> > partitions you just consume the messages given to you and then commit
> them
> >
> > This is already what the commit() API is supposed to do. Here is the
> > javadoc -
> >
> > * Synchronously commits the specified offsets for the specified list
> of
> > topics and partitions to Kafka. If no partitions are specified,
> >  * commits offsets for the subscribed list of topics and partitions
> to
> > Kafka.
> >
> > public void commit(TopicPartitionOffset... offsets);
> >
> > Could you take another look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >?
> > I've uploaded changes from the previous discussions and included some of
> > your review suggestions.
> >
> >
> >
> > On Wed, Feb 12, 2014 at 8:32 AM, Neha Narkhede  > >wrote:
> >
> > > Imran,
> > >
> > >
> > > Sorry I am probably missing
> > > something basic, but I'm not sure how a multi-threaded consumer would
> > > work.  I can imagine its either:
> > >
> > > a) I just have one thread poll kafka.  If I want to process msgs in
> > > multiple threads, than I deal w/ that after polling, eg. stick them
> into
> > a
> > > blocking queue or something, and have more threads that read from the
> > > queue.
> > >
> > > b) each thread creates its own KafkaConsumer.  They are all registered
> > the
> > > same way, and I leave it to kafka to figure out what data to give to
> each
> > > one.
> > >
> > > We designed the new consume

Re: New Consumer API discussion

2014-02-12 Thread Jay Kreps
Ah, gotcha.

-Jay


On Wed, Feb 12, 2014 at 8:40 AM, Neha Narkhede wrote:

> Jay
>
> Well none kind of address the common case which is to commit all
> partitions. For these I was thinking just
>commit();
> The advantage of this simpler method is that you don't need to bother about
> partitions you just consume the messages given to you and then commit them
>
> This is already what the commit() API is supposed to do. Here is the
> javadoc -
>
> * Synchronously commits the specified offsets for the specified list of
> topics and partitions to Kafka. If no partitions are specified,
>  * commits offsets for the subscribed list of topics and partitions to
> Kafka.
>
> public void commit(TopicPartitionOffset... offsets);
>
> Could you take another look at the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >?
> I've uploaded changes from the previous discussions and included some of
> your review suggestions.
>
>
>
> On Wed, Feb 12, 2014 at 8:32 AM, Neha Narkhede  >wrote:
>
> > Imran,
> >
> >
> > Sorry I am probably missing
> > something basic, but I'm not sure how a multi-threaded consumer would
> > work.  I can imagine its either:
> >
> > a) I just have one thread poll kafka.  If I want to process msgs in
> > multiple threads, than I deal w/ that after polling, eg. stick them into
> a
> > blocking queue or something, and have more threads that read from the
> > queue.
> >
> > b) each thread creates its own KafkaConsumer.  They are all registered
> the
> > same way, and I leave it to kafka to figure out what data to give to each
> > one.
> >
> > We designed the new consumer API to not require multi threading on
> > purpose.
> > The reason this is better than the existing ZookeeperConsumerConnector is
> > that
> > it effectively allows the user to use whatever threading and load balance
> > message
> > processing amongst those threads. For example, you might want more
> threads
> > dedicated
> > to a certain high throughput partition compared to other partitions. In
> > option a) above, you can
> > create your own thread pool and hand over the messages returned by poll
> > using a blocking
> > queue or any other approach. Option b) would work as well and the user
> > has to figure out which
> > topics each KafkaConsumer subscribes to.
> >
> >
> > (a) certainly makes things simple, but I worry about throughput -- is
> that
> > just as good as having one thread trying to consumer each partition?
> >
> > (b) makes it a bit of a pain to figure out how many threads to use.  I
> > assume there is no point in using more threads than there are partitions,
> > so first you've got to figure out how many partitions there are in each
> > topic.  Might be nice if there were some util functions to simplify this.
> >
> > The user can pick the number of threads. That is still better as only the
> > user knows how
> > slow/fast the message processing of her application is.
> >
> > Also, since the initial call to subscribe doesn't give the partition
> > assignment, does that mean the first call to poll() will always call the
> > ConsumerRebalanceCallback?
> >
> > Assuming you choose to use group management (by using subscribe(topics)),
> > poll() will invoke
> > the ConsumerRebalanceCallback on every single rebalance attempt. Improved
> > the javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerRebalanceCallback.html
> >to
> > explain that. Could you give that another look?
> >
> > If I'm on the right track, I'd like to expand this example, showing how
> > each "MyConsumer" can keep track of its partitions & offsets, even in the
> > face of rebalances.  As Jay said, I think a minimal code example could
> > really help us see the utility & faults of the api.
> >
> > Sure, please look at the javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >.
> > I've tried to include code examples there. Please help in
> > improving those or adding more. Looks like we should add some multi
> > threading examples. I avoided
> > adding those since there are many ways to handling the message processing
> > and it will not be feasible
> > to list all of those. If we list one, people might think that is the only
> > recommended approach.
> >
> > With that said, here is an example of using Option b) above -
> >
> >
> > List consumers = new ArrayList();
> > List topics = new ArrayList > // populate topics
> > assert(consumers.size == topics.size);
> >
> > for (int i = 0; i < numThreads; i++) {
> >   MyConsumer c = new MyConsumer();
> >   c.subscribe(topics(i));
> >   consumers.add(c);
> > }
> > // poll each consumer in a separate thread.
> > for (int i = 0; i < numThreads; i++) {
> >executorService.submit(new Runnable() {
> > @Override
> >  public void run() {
> >  new ProcessMessagesTask(consume

Re: New Consumer API discussion

2014-02-12 Thread Neha Narkhede
Jay

Well none kind of address the common case which is to commit all
partitions. For these I was thinking just
   commit();
The advantage of this simpler method is that you don't need to bother about
partitions you just consume the messages given to you and then commit them

This is already what the commit() API is supposed to do. Here is the
javadoc -

* Synchronously commits the specified offsets for the specified list of
topics and partitions to Kafka. If no partitions are specified,
 * commits offsets for the subscribed list of topics and partitions to
Kafka.

public void commit(TopicPartitionOffset... offsets);

Could you take another look at the
javadoc?
I've uploaded changes from the previous discussions and included some of
your review suggestions.



On Wed, Feb 12, 2014 at 8:32 AM, Neha Narkhede wrote:

> Imran,
>
>
> Sorry I am probably missing
> something basic, but I'm not sure how a multi-threaded consumer would
> work.  I can imagine its either:
>
> a) I just have one thread poll kafka.  If I want to process msgs in
> multiple threads, than I deal w/ that after polling, eg. stick them into a
> blocking queue or something, and have more threads that read from the
> queue.
>
> b) each thread creates its own KafkaConsumer.  They are all registered the
> same way, and I leave it to kafka to figure out what data to give to each
> one.
>
> We designed the new consumer API to not require multi threading on
> purpose.
> The reason this is better than the existing ZookeeperConsumerConnector is
> that
> it effectively allows the user to use whatever threading and load balance
> message
> processing amongst those threads. For example, you might want more threads
> dedicated
> to a certain high throughput partition compared to other partitions. In
> option a) above, you can
> create your own thread pool and hand over the messages returned by poll
> using a blocking
> queue or any other approach. Option b) would work as well and the user
> has to figure out which
> topics each KafkaConsumer subscribes to.
>
>
> (a) certainly makes things simple, but I worry about throughput -- is that
> just as good as having one thread trying to consumer each partition?
>
> (b) makes it a bit of a pain to figure out how many threads to use.  I
> assume there is no point in using more threads than there are partitions,
> so first you've got to figure out how many partitions there are in each
> topic.  Might be nice if there were some util functions to simplify this.
>
> The user can pick the number of threads. That is still better as only the
> user knows how
> slow/fast the message processing of her application is.
>
> Also, since the initial call to subscribe doesn't give the partition
> assignment, does that mean the first call to poll() will always call the
> ConsumerRebalanceCallback?
>
> Assuming you choose to use group management (by using subscribe(topics)),
> poll() will invoke
> the ConsumerRebalanceCallback on every single rebalance attempt. Improved
> the 
> javadocto
> explain that. Could you give that another look?
>
> If I'm on the right track, I'd like to expand this example, showing how
> each "MyConsumer" can keep track of its partitions & offsets, even in the
> face of rebalances.  As Jay said, I think a minimal code example could
> really help us see the utility & faults of the api.
>
> Sure, please look at the 
> javadoc.
> I've tried to include code examples there. Please help in
> improving those or adding more. Looks like we should add some multi
> threading examples. I avoided
> adding those since there are many ways to handling the message processing
> and it will not be feasible
> to list all of those. If we list one, people might think that is the only
> recommended approach.
>
> With that said, here is an example of using Option b) above -
>
>
> List consumers = new ArrayList();
> List topics = new ArrayList // populate topics
> assert(consumers.size == topics.size);
>
> for (int i = 0; i < numThreads; i++) {
>   MyConsumer c = new MyConsumer();
>   c.subscribe(topics(i));
>   consumers.add(c);
> }
> // poll each consumer in a separate thread.
> for (int i = 0; i < numThreads; i++) {
>executorService.submit(new Runnable() {
> @Override
>  public void run() {
>  new ProcessMessagesTask(consumers(i));
>  }
>});
> }
>
> Let me know what you think.
>
> Thanks,
> Neha
>
> On Tue, Feb 11, 2014 at 3:54 PM, Jay Kreps  wrote:
>
>> Comments inline:
>>
>>
>> On Mon, Feb 10, 2014 at 2:31 PM, Guozhang Wang 
>> wrote:
>>
>> > Hello Jay,
>> >
>> > Thanks for the detailed comments.
>> >
>> > 1. Yeah we could

Re: New Consumer API discussion

2014-02-12 Thread Neha Narkhede
Imran,

Sorry I am probably missing
something basic, but I'm not sure how a multi-threaded consumer would
work.  I can imagine its either:

a) I just have one thread poll kafka.  If I want to process msgs in
multiple threads, than I deal w/ that after polling, eg. stick them into a
blocking queue or something, and have more threads that read from the queue.

b) each thread creates its own KafkaConsumer.  They are all registered the
same way, and I leave it to kafka to figure out what data to give to each
one.

We designed the new consumer API to not require multi threading on purpose.
The reason this is better than the existing ZookeeperConsumerConnector is
that
it effectively allows the user to use whatever threading and load balance
message
processing amongst those threads. For example, you might want more threads
dedicated
to a certain high throughput partition compared to other partitions. In
option a) above, you can
create your own thread pool and hand over the messages returned by poll
using a blocking
queue or any other approach. Option b) would work as well and the user has
to figure out which
topics each KafkaConsumer subscribes to.

(a) certainly makes things simple, but I worry about throughput -- is that
just as good as having one thread trying to consumer each partition?

(b) makes it a bit of a pain to figure out how many threads to use.  I
assume there is no point in using more threads than there are partitions,
so first you've got to figure out how many partitions there are in each
topic.  Might be nice if there were some util functions to simplify this.

The user can pick the number of threads. That is still better as only the
user knows how
slow/fast the message processing of her application is.

Also, since the initial call to subscribe doesn't give the partition
assignment, does that mean the first call to poll() will always call the
ConsumerRebalanceCallback?

Assuming you choose to use group management (by using subscribe(topics)),
poll() will invoke
the ConsumerRebalanceCallback on every single rebalance attempt. Improved
the 
javadocto
explain that. Could you give that another look?

If I'm on the right track, I'd like to expand this example, showing how
each "MyConsumer" can keep track of its partitions & offsets, even in the
face of rebalances.  As Jay said, I think a minimal code example could
really help us see the utility & faults of the api.

Sure, please look at the
javadoc.
I've tried to include code examples there. Please help in
improving those or adding more. Looks like we should add some multi
threading examples. I avoided
adding those since there are many ways to handling the message processing
and it will not be feasible
to list all of those. If we list one, people might think that is the only
recommended approach.

With that said, here is an example of using Option b) above -

List consumers = new ArrayList();
List topics = new ArrayList wrote:

> Comments inline:
>
>
> On Mon, Feb 10, 2014 at 2:31 PM, Guozhang Wang  wrote:
>
> > Hello Jay,
> >
> > Thanks for the detailed comments.
> >
> > 1. Yeah we could discuss a bit more on that.
> >
> > 2. Since subscribe() is incremental, adding one topic-partition is OK,
> and
> > personally I think it is cleaner than subscribe(String topic,
> > int...partition)?
> >
> I am not too particular. Have you actually tried this? I think writing
> actual sample code is important.
>
>
> > 3. Originally I was thinking about two interfaces:
> >
> > getOffsets() // offsets for all partitions that I am consuming now
> >
> > getOffset(topc-partition) // offset of the specified topic-partition,
> will
> > throw exception if it is not currently consumed.
> >
> > What do you think about these?
> >
>
> The naming needs to distinguish committed offset position versus fetch
> offset position. Also we aren't using the getX convention.
>
>
> > 4. Yes, that remains a config.
> >
>
> Does that make sense given that you change your position via an api now?
>
>
> > 5. Agree.
> >
> > 6. If the time out value is null then it will "logically" return
> > immediately with whatever data is available. I think an indefinitely
> poll()
> > function could be replaced with just
> >
> > while (true) poll(some-time)?
> >
>
> That is fine but we should provide a no arg poll for that, poll(null) isn't
> clear. We should add the timeunit as per the post java 5 convention as that
> makes the call more readable. E.g.
>poll(5) vs poll(5, TimeUnit.MILLISECONDS)
>
>
> > 7. I am open with either approach.
> >
>
> Cool.
>
> 8. I was thinking about two interfaces for the commit functionality:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> >
> > Do those sound better?
> >
>
> Well none k

Re: New Consumer API discussion

2014-02-11 Thread Jay Kreps
Comments inline:


On Mon, Feb 10, 2014 at 2:31 PM, Guozhang Wang  wrote:

> Hello Jay,
>
> Thanks for the detailed comments.
>
> 1. Yeah we could discuss a bit more on that.
>
> 2. Since subscribe() is incremental, adding one topic-partition is OK, and
> personally I think it is cleaner than subscribe(String topic,
> int...partition)?
>
I am not too particular. Have you actually tried this? I think writing
actual sample code is important.


> 3. Originally I was thinking about two interfaces:
>
> getOffsets() // offsets for all partitions that I am consuming now
>
> getOffset(topc-partition) // offset of the specified topic-partition, will
> throw exception if it is not currently consumed.
>
> What do you think about these?
>

The naming needs to distinguish committed offset position versus fetch
offset position. Also we aren't using the getX convention.


> 4. Yes, that remains a config.
>

Does that make sense given that you change your position via an api now?


> 5. Agree.
>
> 6. If the time out value is null then it will "logically" return
> immediately with whatever data is available. I think an indefinitely poll()
> function could be replaced with just
>
> while (true) poll(some-time)?
>

That is fine but we should provide a no arg poll for that, poll(null) isn't
clear. We should add the timeunit as per the post java 5 convention as that
makes the call more readable. E.g.
   poll(5) vs poll(5, TimeUnit.MILLISECONDS)


> 7. I am open with either approach.
>

Cool.

8. I was thinking about two interfaces for the commit functionality:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
>
> Do those sound better?
>

Well none kind of address the common case which is to commit all
partitions. For these I was thinking just
   commit();
The advantage of this simpler method is that you don't need to bother about
partitions you just consume the messages given to you and then commit them.

9. Currently I think about un-subscribe as "close and re-subscribe", and
> would like to hear people's opinion about it.
>

Hmm, I think it is a little weird if there is a subscribe which can be
called at any time but no unsubscribe. Would this be hard to do.


> 10. Yes. Position() is an API function, and as and API it means "be called
> at any time" and will change the next fetching starting offset.
>

Cool.


> 11. The ConsumerRecord would have the offset info of the message. Is that
> what you want?
>

But that is only after I have gotten a message. I'm not sure if that covers
all cases or not.


> About use cases: great point. I will add some more examples of using the
> API functions in the wiki pages.
>
> Guozhang
>
>
>
>
> On Mon, Feb 10, 2014 at 12:20 PM, Jay Kreps  wrote:
>
> > A few items:
> > 1. ConsumerRebalanceCallback
> >a. onPartitionsRevoked would be a better name.
> >b. We should discuss the possibility of splitting this into two
> > interfaces. The motivation would be that in Java 8 single method
> interfaces
> > can directly take methods which might be more intuitive.
> >c. If we stick with a single interface I would prefer the name
> > RebalanceCallback as its more concise
> > 2. Should subscribe(String topic, int partition) should be
> subscribe(String
> > topic, int...partition)?
> > 3. Is lastCommittedOffset call just a local access? If so it would be
> more
> > convenient not to batch it.
> > 4. How are we going to handle the earliest/latest starting position
> > functionality we currently have. Does that remain a config?
> > 5. Do we need to expose the general ability to get known positions from
> the
> > log? E.g. the functionality in the OffsetRequest...? That would make the
> > ability to change position a little easier.
> > 6. Should poll(java.lang.Long timeout) be poll(long timeout, TimeUnit
> > unit)? Is it Long because it allows null? If so should we just add a
> poll()
> > that polls indefinitely?
> > 7. I recommend we remove the boolean parameter from commit as it is
> really
> > hard to read code that has boolean parameters without named arguments.
> Can
> > we make it something like commit(...) and commitAsync(...)?
> > 8. What about the common case where you just want to commit the current
> > position for all partitions?
> > 9. How do you unsubscribe?
> > 10. You say in a few places that positions() only impacts the starting
> > position, but surely that isn't the case, right? Surely it controls the
> > fetch position for that partition and can be called at any time?
> Otherwise
> > it is a pretty weird api, right?
> > 11. How do I get my current position? Not the committed position but the
> > offset of the next message that will be given to me?
> >
> > One thing that I really found helpful for the API design was writing out
> > actual code for different scenarios against the API. I think it might be
> > good to do that for this too--i.e. enumerate the various use cases and
> code
> > that use case up to see how it looks. I'm not sure

Re: New Consumer API discussion

2014-02-11 Thread Guozhang Wang
Hi Imran,

1. I think choosing between a) and b) is really dependent on the consuming
traffic. We decided to make the consumer client single-threaded and let
users to decide using one or multiple clients based on traffic mainly
because with a multi-thread client, the fetcher thread could die silently
while the user thread still works and gets blocked forever.

2. Yes. If the subcription is a list of topics, which means it relies on
Kafka to assign partitions, then the first pool will trigger the group
management protocol and upon receiving the partitions the callback function
will be executed.

3. The wiki page (
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design)
have some example usages of the new consumer API (there might be some minor
function signature differences with the javadoc). Would you want to take a
look at give some thoughts about that?

Guozhang


On Tue, Feb 11, 2014 at 1:50 PM, Imran Rashid  wrote:

> Hi,
>
> thanks for sharing this and getting feedback.  Sorry I am probably missing
> something basic, but I'm not sure how a multi-threaded consumer would
> work.  I can imagine its either:
>
> a) I just have one thread poll kafka.  If I want to process msgs in
> multiple threads, than I deal w/ that after polling, eg. stick them into a
> blocking queue or something, and have more threads that read from the
> queue.
>
> b) each thread creates its own KafkaConsumer.  They are all registered the
> same way, and I leave it to kafka to figure out what data to give to each
> one.
>
>
> (a) certainly makes things simple, but I worry about throughput -- is that
> just as good as having one thread trying to consumer each partition?
>
> (b) makes it a bit of a pain to figure out how many threads to use.  I
> assume there is no point in using more threads than there are partitions,
> so first you've got to figure out how many partitions there are in each
> topic.  Might be nice if there were some util functions to simplify this.
>
>
> Also, since the initial call to subscribe doesn't give the partition
> assignment, does that mean the first call to poll() will always call the
> ConsumerRebalanceCallback?
>
> probably a short code-sample would clear up all my questions.  I'm
> imagining pseudo-code like:
>
>
> int numPartitions = ...
> int numThreads = min(maxThreads, numPartitions);
> //maybe should be something even more complicated, to take into account how
> many other active consumers there are right now for the given group
>
> List consumers = new ArrayList();
> for (int i = 0; i < numThreads; i++) {
>   MyConsumer c = new MyConsumer();
>   c.subscribe(...);
>   //if subscribe is expensive, then this should already happen in another
> thread
>   consumers.add(c);
> }
>
> // if each subscribe() happened in a different thread, we should put a
> barrier in here, so everybody subscribes before they begin polling
>
> //now launch a thread per consumer, where they each poll
>
>
>
> If I'm on the right track, I'd like to expand this example, showing how
> each "MyConsumer" can keep track of its partitions & offsets, even in the
> face of rebalances.  As Jay said, I think a minimal code example could
> really help us see the utility & faults of the api.
>
> overall I really like what I see, seems like a big improvement!
>
> thanks,
> Imran
>
>
>
> On Mon, Feb 10, 2014 at 12:54 PM, Neha Narkhede  >wrote:
>
> > As mentioned in previous emails, we are also working on a
> re-implementation
> > of the consumer. I would like to use this email thread to discuss the
> > details of the public API. I would also like us to be picky about this
> > public api now so it is as good as possible and we don't need to break it
> > in the future.
> >
> > The best way to get a feel for the API is actually to take a look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >,
> > the hope is to get the api docs good enough so that it is
> self-explanatory.
> > You can also take a look at the configs
> > here<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> > >
> >
> > Some background info on implementation:
> >
> > At a high level the primary difference in this consumer is that it
> removes
> > the distinction between the "high-level" and "low-level" consumer. The
> new
> > consumer API is non blocking and instead of returning a blocking
> iterator,
> > the consumer provides a poll() API that returns a list of records. We
> think
> > this is better compared to the blocking iterators since it effectively
> > decouples the threading strategy used for processing messages from the
> > consumer. It is worth noting that the consumer is entirely single
> threaded
> > and runs in the user thread. The advantage is that it can be easily
> > rewritten in less multi-threading-friendly languages. The consumer
> batches
> > data and multiplexes I/O 

Re: New Consumer API discussion

2014-02-11 Thread Imran Rashid
Hi,

thanks for sharing this and getting feedback.  Sorry I am probably missing
something basic, but I'm not sure how a multi-threaded consumer would
work.  I can imagine its either:

a) I just have one thread poll kafka.  If I want to process msgs in
multiple threads, than I deal w/ that after polling, eg. stick them into a
blocking queue or something, and have more threads that read from the queue.

b) each thread creates its own KafkaConsumer.  They are all registered the
same way, and I leave it to kafka to figure out what data to give to each
one.


(a) certainly makes things simple, but I worry about throughput -- is that
just as good as having one thread trying to consumer each partition?

(b) makes it a bit of a pain to figure out how many threads to use.  I
assume there is no point in using more threads than there are partitions,
so first you've got to figure out how many partitions there are in each
topic.  Might be nice if there were some util functions to simplify this.


Also, since the initial call to subscribe doesn't give the partition
assignment, does that mean the first call to poll() will always call the
ConsumerRebalanceCallback?

probably a short code-sample would clear up all my questions.  I'm
imagining pseudo-code like:


int numPartitions = ...
int numThreads = min(maxThreads, numPartitions);
//maybe should be something even more complicated, to take into account how
many other active consumers there are right now for the given group

List consumers = new ArrayList();
for (int i = 0; i < numThreads; i++) {
  MyConsumer c = new MyConsumer();
  c.subscribe(...);
  //if subscribe is expensive, then this should already happen in another
thread
  consumers.add(c);
}

// if each subscribe() happened in a different thread, we should put a
barrier in here, so everybody subscribes before they begin polling

//now launch a thread per consumer, where they each poll



If I'm on the right track, I'd like to expand this example, showing how
each "MyConsumer" can keep track of its partitions & offsets, even in the
face of rebalances.  As Jay said, I think a minimal code example could
really help us see the utility & faults of the api.

overall I really like what I see, seems like a big improvement!

thanks,
Imran



On Mon, Feb 10, 2014 at 12:54 PM, Neha Narkhede wrote:

> As mentioned in previous emails, we are also working on a re-implementation
> of the consumer. I would like to use this email thread to discuss the
> details of the public API. I would also like us to be picky about this
> public api now so it is as good as possible and we don't need to break it
> in the future.
>
> The best way to get a feel for the API is actually to take a look at the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >,
> the hope is to get the api docs good enough so that it is self-explanatory.
> You can also take a look at the configs
> here<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> >
>
> Some background info on implementation:
>
> At a high level the primary difference in this consumer is that it removes
> the distinction between the "high-level" and "low-level" consumer. The new
> consumer API is non blocking and instead of returning a blocking iterator,
> the consumer provides a poll() API that returns a list of records. We think
> this is better compared to the blocking iterators since it effectively
> decouples the threading strategy used for processing messages from the
> consumer. It is worth noting that the consumer is entirely single threaded
> and runs in the user thread. The advantage is that it can be easily
> rewritten in less multi-threading-friendly languages. The consumer batches
> data and multiplexes I/O over TCP connections to each of the brokers it
> communicates with, for high throughput. The consumer also allows long poll
> to reduce the end-to-end message latency for low throughput data.
>
> The consumer provides a group management facility that supports the concept
> of a group with multiple consumer instances (just like the current
> consumer). This is done through a custom heartbeat and group management
> protocol transparent to the user. At the same time, it allows users the
> option to subscribe to a fixed set of partitions and not use group
> management at all. The offset management strategy defaults to Kafka based
> offset management and the API provides a way for the user to use a
> customized offset store to manage the consumer's offsets.
>
> A key difference in this consumer also is the fact that it does not depend
> on zookeeper at all.
>
> More details about the new consumer design are
> here<
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> >
>
> Please take a look at the new
> API<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Kafk

Re: New Consumer API discussion

2014-02-11 Thread Pradeep Gollakota
Updated thoughts.

   1.

   subscribe(String topic, int... paritions) and unsubscribe(String topic,
   int... partitions) should be subscribe(TopicPartition...
topicPartitions)and unsubscribe(TopicPartition...
   topicPartitons)
2.

   Does it make sense to provide a convenience method to subscribe to
   topics at a particular offset directly? E.g.
subscribe(TopicPartitionOffset...
   offsets)
3.

   The javadoc makes no mention of what would happen if positions() is
   called with a TopicPartitionOffset to which the Consumer is not
   subscribed to.
4.

   The javadoc makes no mention of what would happen if positions() is
   called with two different offsets for a single TopicPartition
5. The javadoc shows lastCommittedOffsets() return type as
   List. This should either be Map or Map
   6. It seems like #4 can be avoided by using Map or Map as the argument type.
   7. To address #3, maybe we can return List that
   are invalid.



On Tue, Feb 11, 2014 at 12:04 PM, Neha Narkhede wrote:

> Pradeep,
>
> To be clear, we want to get feedback on the APIs from the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >since
> the wiki will be slightly behind on the APIs.
>
> 1. Regarding consistency, do you have specific feedback on which APIs
> should have different arguments/return types?
> 2. lastCommittedOffsets() does what you said in the javadoc.
>
> Thanks,
> Neha
>
>
> On Tue, Feb 11, 2014 at 11:45 AM, Pradeep Gollakota  >wrote:
>
> > Hi Jay,
> >
> > I apologize for derailing the conversation about the consumer API. We
> > should start a new discussion about hierarchical topics, if we want to
> keep
> > talking about it. My final thought on the matter is that, hierarchical
> > topics is still an important feature to have in Kafka, because it gives
> us
> > flexibility to do namespace level access controls.
> >
> > Getting back to the topic of the Consumer API:
> >
> >1. Any thoughts on consistency for method arguments and return types?
> >2. lastCommittedOffsets() method returns a
> > Listwhere as the confluence page suggested a
> > Map >Long>. I would think that a Map is the more appropriate return type.
> >
> >
> >
> > On Tue, Feb 11, 2014 at 8:04 AM, Jay Kreps  wrote:
> >
> > > Hey Pradeep,
> > >
> > > That wiki is fairly old and it predated more flexible subscription
> > > mechanisms. In the high-level consumer you currently have wildcard
> > > subscription and in the new proposed interface you can actually
> subscribe
> > > based on any logic you want to create a "union" of streams. Personally
> I
> > > think this gives you everything you would want with a hierarchy and
> more
> > > actual flexibility (since you can define groupings however you want).
> > What
> > > do you think?
> > >
> > > -Jay
> > >
> > >
> > > On Mon, Feb 10, 2014 at 3:37 PM, Pradeep Gollakota <
> pradeep...@gmail.com
> > > >wrote:
> > >
> > > > WRT to hierarchical topics, I'm referring to
> > > > KAFKA-1175.
> > > > I would just like to think through the implications for the Consumer
> > API
> > > if
> > > > and when we do implement hierarchical topics. For example, in the
> > > > proposal<
> > > >
> https://cwiki.apache.org/confluence/display/KAFKA/Hierarchical+Topics#
> > > > >written
> > > > by Jay, he says that initially wildcard subscriptions are not going
> > > > to be supported. But does that mean that they will be supported in
> v2?
> > If
> > > > that's the case, that would change the semantics of the Consumer API.
> > > >
> > > > As to having classes for Topic, PartitionId, etc. it looks like I was
> > > > referring to the TopicPartition and TopicPartitionOffset classes (I
> > > didn't
> > > > realize these were already there). I was only looking at the
> confluence
> > > > page which shows List[(String, Int, Long)] instead of
> > > > List[TopicParitionOffset] (as is shown in the javadoc). However, I
> did
> > > > notice that we're not being consistent in the Java version. E.g. we
> > have
> > > > commit(TopicPartitionOffset... offsets) and
> > > > lastCommittedOffsets(TopicPartition... partitions) on the one hand.
> On
> > > the
> > > > other hand we have subscribe(String topic, int... partitions). I
> agree
> > > that
> > > > creating a class for TopicId today would probably not make too much
> > sense
> > > > today. But with hierarchical topics, I may change my mind. This is
> > > exactly
> > > > what was done in the HBase API in 0.96 when namespaces were added.
> 0.96
> > > > HBase API introduced a class called 'TableName' to represent the
> > > namespace
> > > > and table name.
> > > >
> > > >
> > > > On Mon, Feb 10, 2014 at 3:08 PM, Neha Narkhede <
> > neha.narkh...@gmail.com
> > > > >wrote:
> > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > Mattijs -
> > > > >
> > > > > - Constructors link to
> > > > > http://kafka.apache.org/documentation.html#con

Re: New Consumer API discussion

2014-02-11 Thread Neha Narkhede
Pradeep,

To be clear, we want to get feedback on the APIs from the
javadocsince
the wiki will be slightly behind on the APIs.

1. Regarding consistency, do you have specific feedback on which APIs
should have different arguments/return types?
2. lastCommittedOffsets() does what you said in the javadoc.

Thanks,
Neha


On Tue, Feb 11, 2014 at 11:45 AM, Pradeep Gollakota wrote:

> Hi Jay,
>
> I apologize for derailing the conversation about the consumer API. We
> should start a new discussion about hierarchical topics, if we want to keep
> talking about it. My final thought on the matter is that, hierarchical
> topics is still an important feature to have in Kafka, because it gives us
> flexibility to do namespace level access controls.
>
> Getting back to the topic of the Consumer API:
>
>1. Any thoughts on consistency for method arguments and return types?
>2. lastCommittedOffsets() method returns a
> Listwhere as the confluence page suggested a
> MapLong>. I would think that a Map is the more appropriate return type.
>
>
>
> On Tue, Feb 11, 2014 at 8:04 AM, Jay Kreps  wrote:
>
> > Hey Pradeep,
> >
> > That wiki is fairly old and it predated more flexible subscription
> > mechanisms. In the high-level consumer you currently have wildcard
> > subscription and in the new proposed interface you can actually subscribe
> > based on any logic you want to create a "union" of streams. Personally I
> > think this gives you everything you would want with a hierarchy and more
> > actual flexibility (since you can define groupings however you want).
> What
> > do you think?
> >
> > -Jay
> >
> >
> > On Mon, Feb 10, 2014 at 3:37 PM, Pradeep Gollakota  > >wrote:
> >
> > > WRT to hierarchical topics, I'm referring to
> > > KAFKA-1175.
> > > I would just like to think through the implications for the Consumer
> API
> > if
> > > and when we do implement hierarchical topics. For example, in the
> > > proposal<
> > > https://cwiki.apache.org/confluence/display/KAFKA/Hierarchical+Topics#
> > > >written
> > > by Jay, he says that initially wildcard subscriptions are not going
> > > to be supported. But does that mean that they will be supported in v2?
> If
> > > that's the case, that would change the semantics of the Consumer API.
> > >
> > > As to having classes for Topic, PartitionId, etc. it looks like I was
> > > referring to the TopicPartition and TopicPartitionOffset classes (I
> > didn't
> > > realize these were already there). I was only looking at the confluence
> > > page which shows List[(String, Int, Long)] instead of
> > > List[TopicParitionOffset] (as is shown in the javadoc). However, I did
> > > notice that we're not being consistent in the Java version. E.g. we
> have
> > > commit(TopicPartitionOffset... offsets) and
> > > lastCommittedOffsets(TopicPartition... partitions) on the one hand. On
> > the
> > > other hand we have subscribe(String topic, int... partitions). I agree
> > that
> > > creating a class for TopicId today would probably not make too much
> sense
> > > today. But with hierarchical topics, I may change my mind. This is
> > exactly
> > > what was done in the HBase API in 0.96 when namespaces were added. 0.96
> > > HBase API introduced a class called 'TableName' to represent the
> > namespace
> > > and table name.
> > >
> > >
> > > On Mon, Feb 10, 2014 at 3:08 PM, Neha Narkhede <
> neha.narkh...@gmail.com
> > > >wrote:
> > >
> > > > Thanks for the feedback.
> > > >
> > > > Mattijs -
> > > >
> > > > - Constructors link to
> > > > http://kafka.apache.org/documentation.html#consumerconfigs for valid
> > > > configurations, which lists zookeeper.connect rather than
> > > > metadata.broker.list, the value for BROKER_LIST_CONFIG in
> > ConsumerConfig.
> > > > Fixed it to just point to ConsumerConfig for now until we finalize
> the
> > > new
> > > > configs
> > > > - Docs for poll(long) mention consumer.commit(true), which I can't
> find
> > > in
> > > > the Consumer docs. For a simple consumer setup, that call is
> something
> > > that
> > > > would make a lot of sense.
> > > > Missed changing the examples to use consumer.commit(true, offsets).
> The
> > > > suggestions by Jay would change it to commit(offsets) and
> > > > commitAsync(offsets), which will hopefully make it easier to
> understand
> > > > those commit APIs.
> > > > - Love the addition of MockConsumer, awesome for unittesting :)
> > > > I'm not quite satisfied with what it does as of right now, but we
> will
> > > > surely improve it as we start writing the consumer.
> > > >
> > > > Jay -
> > > >
> > > > 1. ConsumerRebalanceCallback
> > > > a. Makes sense. Renamed to onPartitionsRevoked
> > > > b. Ya, it will be good to make it forward compatible with Java 8
> > > > capabilities. We can change it to PartitionsAssignedCallback and
> > > >  Partitions

Re: New Consumer API discussion

2014-02-11 Thread Pradeep Gollakota
Hi Jay,

I apologize for derailing the conversation about the consumer API. We
should start a new discussion about hierarchical topics, if we want to keep
talking about it. My final thought on the matter is that, hierarchical
topics is still an important feature to have in Kafka, because it gives us
flexibility to do namespace level access controls.

Getting back to the topic of the Consumer API:

   1. Any thoughts on consistency for method arguments and return types?
   2. lastCommittedOffsets() method returns a
Listwhere as the confluence page suggested a
Map. I would think that a Map is the more appropriate return type.



On Tue, Feb 11, 2014 at 8:04 AM, Jay Kreps  wrote:

> Hey Pradeep,
>
> That wiki is fairly old and it predated more flexible subscription
> mechanisms. In the high-level consumer you currently have wildcard
> subscription and in the new proposed interface you can actually subscribe
> based on any logic you want to create a "union" of streams. Personally I
> think this gives you everything you would want with a hierarchy and more
> actual flexibility (since you can define groupings however you want). What
> do you think?
>
> -Jay
>
>
> On Mon, Feb 10, 2014 at 3:37 PM, Pradeep Gollakota  >wrote:
>
> > WRT to hierarchical topics, I'm referring to
> > KAFKA-1175.
> > I would just like to think through the implications for the Consumer API
> if
> > and when we do implement hierarchical topics. For example, in the
> > proposal<
> > https://cwiki.apache.org/confluence/display/KAFKA/Hierarchical+Topics#
> > >written
> > by Jay, he says that initially wildcard subscriptions are not going
> > to be supported. But does that mean that they will be supported in v2? If
> > that's the case, that would change the semantics of the Consumer API.
> >
> > As to having classes for Topic, PartitionId, etc. it looks like I was
> > referring to the TopicPartition and TopicPartitionOffset classes (I
> didn't
> > realize these were already there). I was only looking at the confluence
> > page which shows List[(String, Int, Long)] instead of
> > List[TopicParitionOffset] (as is shown in the javadoc). However, I did
> > notice that we're not being consistent in the Java version. E.g. we have
> > commit(TopicPartitionOffset... offsets) and
> > lastCommittedOffsets(TopicPartition... partitions) on the one hand. On
> the
> > other hand we have subscribe(String topic, int... partitions). I agree
> that
> > creating a class for TopicId today would probably not make too much sense
> > today. But with hierarchical topics, I may change my mind. This is
> exactly
> > what was done in the HBase API in 0.96 when namespaces were added. 0.96
> > HBase API introduced a class called 'TableName' to represent the
> namespace
> > and table name.
> >
> >
> > On Mon, Feb 10, 2014 at 3:08 PM, Neha Narkhede  > >wrote:
> >
> > > Thanks for the feedback.
> > >
> > > Mattijs -
> > >
> > > - Constructors link to
> > > http://kafka.apache.org/documentation.html#consumerconfigs for valid
> > > configurations, which lists zookeeper.connect rather than
> > > metadata.broker.list, the value for BROKER_LIST_CONFIG in
> ConsumerConfig.
> > > Fixed it to just point to ConsumerConfig for now until we finalize the
> > new
> > > configs
> > > - Docs for poll(long) mention consumer.commit(true), which I can't find
> > in
> > > the Consumer docs. For a simple consumer setup, that call is something
> > that
> > > would make a lot of sense.
> > > Missed changing the examples to use consumer.commit(true, offsets). The
> > > suggestions by Jay would change it to commit(offsets) and
> > > commitAsync(offsets), which will hopefully make it easier to understand
> > > those commit APIs.
> > > - Love the addition of MockConsumer, awesome for unittesting :)
> > > I'm not quite satisfied with what it does as of right now, but we will
> > > surely improve it as we start writing the consumer.
> > >
> > > Jay -
> > >
> > > 1. ConsumerRebalanceCallback
> > > a. Makes sense. Renamed to onPartitionsRevoked
> > > b. Ya, it will be good to make it forward compatible with Java 8
> > > capabilities. We can change it to PartitionsAssignedCallback and
> > >  PartitionsRevokedCallback or RebalanceBeginCallback and
> > > RebalanceEndCallback?
> > > c. Ya, I thought about that but then didn't name it just
> > > RebalanceCallback since there could be a conflict with a controller
> side
> > > rebalance callback if/when we have one. However, you can argue that at
> > that
> > > time we can name it ControllerRebalanceCallback instead of polluting a
> > user
> > > facing API. So agree with you here.
> > > 2. Ya, that is a good idea. Changed to subscribe(String topic,
> > > int...partitions).
> > > 3. lastCommittedOffset() is not necessarily a local access since the
> > > consumer can potentially ask for the last committed offsets of
> partitions
> > > that the consumer does not consume and maintain the offsets 

Re: New Consumer API discussion

2014-02-11 Thread Jay Kreps
Hey Pradeep,

That wiki is fairly old and it predated more flexible subscription
mechanisms. In the high-level consumer you currently have wildcard
subscription and in the new proposed interface you can actually subscribe
based on any logic you want to create a "union" of streams. Personally I
think this gives you everything you would want with a hierarchy and more
actual flexibility (since you can define groupings however you want). What
do you think?

-Jay


On Mon, Feb 10, 2014 at 3:37 PM, Pradeep Gollakota wrote:

> WRT to hierarchical topics, I'm referring to
> KAFKA-1175.
> I would just like to think through the implications for the Consumer API if
> and when we do implement hierarchical topics. For example, in the
> proposal<
> https://cwiki.apache.org/confluence/display/KAFKA/Hierarchical+Topics#
> >written
> by Jay, he says that initially wildcard subscriptions are not going
> to be supported. But does that mean that they will be supported in v2? If
> that's the case, that would change the semantics of the Consumer API.
>
> As to having classes for Topic, PartitionId, etc. it looks like I was
> referring to the TopicPartition and TopicPartitionOffset classes (I didn't
> realize these were already there). I was only looking at the confluence
> page which shows List[(String, Int, Long)] instead of
> List[TopicParitionOffset] (as is shown in the javadoc). However, I did
> notice that we're not being consistent in the Java version. E.g. we have
> commit(TopicPartitionOffset... offsets) and
> lastCommittedOffsets(TopicPartition... partitions) on the one hand. On the
> other hand we have subscribe(String topic, int... partitions). I agree that
> creating a class for TopicId today would probably not make too much sense
> today. But with hierarchical topics, I may change my mind. This is exactly
> what was done in the HBase API in 0.96 when namespaces were added. 0.96
> HBase API introduced a class called 'TableName' to represent the namespace
> and table name.
>
>
> On Mon, Feb 10, 2014 at 3:08 PM, Neha Narkhede  >wrote:
>
> > Thanks for the feedback.
> >
> > Mattijs -
> >
> > - Constructors link to
> > http://kafka.apache.org/documentation.html#consumerconfigs for valid
> > configurations, which lists zookeeper.connect rather than
> > metadata.broker.list, the value for BROKER_LIST_CONFIG in ConsumerConfig.
> > Fixed it to just point to ConsumerConfig for now until we finalize the
> new
> > configs
> > - Docs for poll(long) mention consumer.commit(true), which I can't find
> in
> > the Consumer docs. For a simple consumer setup, that call is something
> that
> > would make a lot of sense.
> > Missed changing the examples to use consumer.commit(true, offsets). The
> > suggestions by Jay would change it to commit(offsets) and
> > commitAsync(offsets), which will hopefully make it easier to understand
> > those commit APIs.
> > - Love the addition of MockConsumer, awesome for unittesting :)
> > I'm not quite satisfied with what it does as of right now, but we will
> > surely improve it as we start writing the consumer.
> >
> > Jay -
> >
> > 1. ConsumerRebalanceCallback
> > a. Makes sense. Renamed to onPartitionsRevoked
> > b. Ya, it will be good to make it forward compatible with Java 8
> > capabilities. We can change it to PartitionsAssignedCallback and
> >  PartitionsRevokedCallback or RebalanceBeginCallback and
> > RebalanceEndCallback?
> > c. Ya, I thought about that but then didn't name it just
> > RebalanceCallback since there could be a conflict with a controller side
> > rebalance callback if/when we have one. However, you can argue that at
> that
> > time we can name it ControllerRebalanceCallback instead of polluting a
> user
> > facing API. So agree with you here.
> > 2. Ya, that is a good idea. Changed to subscribe(String topic,
> > int...partitions).
> > 3. lastCommittedOffset() is not necessarily a local access since the
> > consumer can potentially ask for the last committed offsets of partitions
> > that the consumer does not consume and maintain the offsets for. That's
> the
> > reason it is batched right now.
> > 4. Yes, look at
> >
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html#AUTO_OFFSET_RESET_CONFIG
> > 5. Sure, but that is not part of the consumer API right? I think you're
> > suggesting looking at OffsetRequest to see if it would do that properly?
> > 6. Good point. Changed to poll(long timeout, TimeUnit) and poll with a
> > negative timeout will poll indefinitely?
> > 7. Good point. Changed to commit(...) and commitAsync(...)
> > 8. To commit the current position for all partitions owned by the
> consumer,
> > you can use commit(). If you don't use group management, then
> > commit(customListOfPartitions)
> > 9. Forgot to include unsubscribe. Done now.
> > 10. positions() can be called at any time and affects the next fetch on
> the
> > next 

Re: New Consumer API discussion

2014-02-10 Thread Pradeep Gollakota
WRT to hierarchical topics, I'm referring to
KAFKA-1175.
I would just like to think through the implications for the Consumer API if
and when we do implement hierarchical topics. For example, in the
proposalwritten
by Jay, he says that initially wildcard subscriptions are not going
to be supported. But does that mean that they will be supported in v2? If
that's the case, that would change the semantics of the Consumer API.

As to having classes for Topic, PartitionId, etc. it looks like I was
referring to the TopicPartition and TopicPartitionOffset classes (I didn't
realize these were already there). I was only looking at the confluence
page which shows List[(String, Int, Long)] instead of
List[TopicParitionOffset] (as is shown in the javadoc). However, I did
notice that we're not being consistent in the Java version. E.g. we have
commit(TopicPartitionOffset... offsets) and
lastCommittedOffsets(TopicPartition... partitions) on the one hand. On the
other hand we have subscribe(String topic, int... partitions). I agree that
creating a class for TopicId today would probably not make too much sense
today. But with hierarchical topics, I may change my mind. This is exactly
what was done in the HBase API in 0.96 when namespaces were added. 0.96
HBase API introduced a class called 'TableName' to represent the namespace
and table name.


On Mon, Feb 10, 2014 at 3:08 PM, Neha Narkhede wrote:

> Thanks for the feedback.
>
> Mattijs -
>
> - Constructors link to
> http://kafka.apache.org/documentation.html#consumerconfigs for valid
> configurations, which lists zookeeper.connect rather than
> metadata.broker.list, the value for BROKER_LIST_CONFIG in ConsumerConfig.
> Fixed it to just point to ConsumerConfig for now until we finalize the new
> configs
> - Docs for poll(long) mention consumer.commit(true), which I can't find in
> the Consumer docs. For a simple consumer setup, that call is something that
> would make a lot of sense.
> Missed changing the examples to use consumer.commit(true, offsets). The
> suggestions by Jay would change it to commit(offsets) and
> commitAsync(offsets), which will hopefully make it easier to understand
> those commit APIs.
> - Love the addition of MockConsumer, awesome for unittesting :)
> I'm not quite satisfied with what it does as of right now, but we will
> surely improve it as we start writing the consumer.
>
> Jay -
>
> 1. ConsumerRebalanceCallback
> a. Makes sense. Renamed to onPartitionsRevoked
> b. Ya, it will be good to make it forward compatible with Java 8
> capabilities. We can change it to PartitionsAssignedCallback and
>  PartitionsRevokedCallback or RebalanceBeginCallback and
> RebalanceEndCallback?
> c. Ya, I thought about that but then didn't name it just
> RebalanceCallback since there could be a conflict with a controller side
> rebalance callback if/when we have one. However, you can argue that at that
> time we can name it ControllerRebalanceCallback instead of polluting a user
> facing API. So agree with you here.
> 2. Ya, that is a good idea. Changed to subscribe(String topic,
> int...partitions).
> 3. lastCommittedOffset() is not necessarily a local access since the
> consumer can potentially ask for the last committed offsets of partitions
> that the consumer does not consume and maintain the offsets for. That's the
> reason it is batched right now.
> 4. Yes, look at
>
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html#AUTO_OFFSET_RESET_CONFIG
> 5. Sure, but that is not part of the consumer API right? I think you're
> suggesting looking at OffsetRequest to see if it would do that properly?
> 6. Good point. Changed to poll(long timeout, TimeUnit) and poll with a
> negative timeout will poll indefinitely?
> 7. Good point. Changed to commit(...) and commitAsync(...)
> 8. To commit the current position for all partitions owned by the consumer,
> you can use commit(). If you don't use group management, then
> commit(customListOfPartitions)
> 9. Forgot to include unsubscribe. Done now.
> 10. positions() can be called at any time and affects the next fetch on the
> next poll(). Fixed the places that said "starting fetch offsets"
> 11. Can we not look that up by going through the messages returned and
> getting the offset from the ConsumerRecord?
>
> One thing that I really found helpful for the API design was writing out
> actual code for different scenarios against the API. I think it might be
> good to do that for this too--i.e. enumerate the various use cases and code
> that use case up to see how it looks
> The javadocs include examples for almost all possible scenarios of consumer
> usage, that I could come up with. Will add more to the javadocs as I get
> more feedback from our users. The advantage of having the examples in the
> javadoc itself is to that the

Re: New Consumer API discussion

2014-02-10 Thread Neha Narkhede
Thanks for the feedback.

Mattijs -

- Constructors link to
http://kafka.apache.org/documentation.html#consumerconfigs for valid
configurations, which lists zookeeper.connect rather than
metadata.broker.list, the value for BROKER_LIST_CONFIG in ConsumerConfig.
Fixed it to just point to ConsumerConfig for now until we finalize the new
configs
- Docs for poll(long) mention consumer.commit(true), which I can't find in
the Consumer docs. For a simple consumer setup, that call is something that
would make a lot of sense.
Missed changing the examples to use consumer.commit(true, offsets). The
suggestions by Jay would change it to commit(offsets) and
commitAsync(offsets), which will hopefully make it easier to understand
those commit APIs.
- Love the addition of MockConsumer, awesome for unittesting :)
I'm not quite satisfied with what it does as of right now, but we will
surely improve it as we start writing the consumer.

Jay -

1. ConsumerRebalanceCallback
a. Makes sense. Renamed to onPartitionsRevoked
b. Ya, it will be good to make it forward compatible with Java 8
capabilities. We can change it to PartitionsAssignedCallback and
 PartitionsRevokedCallback or RebalanceBeginCallback and
RebalanceEndCallback?
c. Ya, I thought about that but then didn't name it just
RebalanceCallback since there could be a conflict with a controller side
rebalance callback if/when we have one. However, you can argue that at that
time we can name it ControllerRebalanceCallback instead of polluting a user
facing API. So agree with you here.
2. Ya, that is a good idea. Changed to subscribe(String topic,
int...partitions).
3. lastCommittedOffset() is not necessarily a local access since the
consumer can potentially ask for the last committed offsets of partitions
that the consumer does not consume and maintain the offsets for. That's the
reason it is batched right now.
4. Yes, look at
http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html#AUTO_OFFSET_RESET_CONFIG
5. Sure, but that is not part of the consumer API right? I think you're
suggesting looking at OffsetRequest to see if it would do that properly?
6. Good point. Changed to poll(long timeout, TimeUnit) and poll with a
negative timeout will poll indefinitely?
7. Good point. Changed to commit(...) and commitAsync(...)
8. To commit the current position for all partitions owned by the consumer,
you can use commit(). If you don't use group management, then
commit(customListOfPartitions)
9. Forgot to include unsubscribe. Done now.
10. positions() can be called at any time and affects the next fetch on the
next poll(). Fixed the places that said "starting fetch offsets"
11. Can we not look that up by going through the messages returned and
getting the offset from the ConsumerRecord?

One thing that I really found helpful for the API design was writing out
actual code for different scenarios against the API. I think it might be
good to do that for this too--i.e. enumerate the various use cases and code
that use case up to see how it looks
The javadocs include examples for almost all possible scenarios of consumer
usage, that I could come up with. Will add more to the javadocs as I get
more feedback from our users. The advantage of having the examples in the
javadoc itself is to that the usage is self explanatory to new users.

Pradeep -

2. Changed to poll(long, TimeUnit) and a negative value for the timeout
would block in the poll forever until there is new data
3. We don't have hierarchical topics support. Would you mind explaining
what you meant?
4. I'm not so sure that we need a class to express a topic which is a
string and a separate class for just partition id. We do have a class for
TopicPartition which uniquely identifies a partition of a topic

Thanks,
Neha


On Mon, Feb 10, 2014 at 12:36 PM, Pradeep Gollakota wrote:

> Couple of very quick thoughts.
>
> 1. +1 about renaming commit(...) and commitAsync(...)
> 2. I'd also like to extend the above for the poll()  method as well. poll()
> and pollWithTimeout(long, TimeUnit)?
> 3. Have you guys given any thought around how this API would be used with
> hierarchical topics?
> 4. Would it make sense to add classes such as TopicId, PartitionId, etc?
> Seems like it would be easier to read code with these classes as opposed to
> string and longs.
>
> - Pradeep
>
>
> On Mon, Feb 10, 2014 at 12:20 PM, Jay Kreps  wrote:
>
> > A few items:
> > 1. ConsumerRebalanceCallback
> >a. onPartitionsRevoked would be a better name.
> >b. We should discuss the possibility of splitting this into two
> > interfaces. The motivation would be that in Java 8 single method
> interfaces
> > can directly take methods which might be more intuitive.
> >c. If we stick with a single interface I would prefer the name
> > RebalanceCallback as its more concise
> > 2. Should subscribe(String topic, int partition) should be
> subscribe(String
> > topic, int...partition)?
> > 3.

Re: New Consumer API discussion

2014-02-10 Thread Guozhang Wang
Hi Mattijs:

2. As Neha said, one design of the new consumer is to have non-blocking
consuming API instead of blocking API. Do you have a strong reason in mind
to still keep the blocking API instead of just using "while(no-data)
poll(timeout)"?

3. No we have not thought about hierarchical topics. Could you elaborate on
some use cases?

4. Consumer will share some of the common code as Producer, in which the
ProduceRecord has

private final String topic;
private final Integer partition;
private final byte[] key;
private final byte[] value;

Thanks,

Guozhang


On Mon, Feb 10, 2014 at 2:31 PM, Guozhang Wang  wrote:

> Hello Jay,
>
> Thanks for the detailed comments.
>
> 1. Yeah we could discuss a bit more on that.
>
> 2. Since subscribe() is incremental, adding one topic-partition is OK, and
> personally I think it is cleaner than subscribe(String topic,
> int...partition)?
>
> 3. Originally I was thinking about two interfaces:
>
> getOffsets() // offsets for all partitions that I am consuming now
>
> getOffset(topc-partition) // offset of the specified topic-partition, will
> throw exception if it is not currently consumed.
>
> What do you think about these?
>
> 4. Yes, that remains a config.
>
> 5. Agree.
>
> 6. If the time out value is null then it will "logically" return
> immediately with whatever data is available. I think an indefinitely poll()
> function could be replaced with just
>
> while (true) poll(some-time)?
>
> 7. I am open with either approach.
>
> 8. I was thinking about two interfaces for the commit functionality:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
>
> Do those sound better?
>
> 9. Currently I think about un-subscribe as "close and re-subscribe", and
> would like to hear people's opinion about it.
>
> 10. Yes. Position() is an API function, and as and API it means "be called
> at any time" and will change the next fetching starting offset.
>
> 11. The ConsumerRecord would have the offset info of the message. Is that
> what you want?
>
> About use cases: great point. I will add some more examples of using the
> API functions in the wiki pages.
>
> Guozhang
>
>
>
>
> On Mon, Feb 10, 2014 at 12:20 PM, Jay Kreps  wrote:
>
>> A few items:
>> 1. ConsumerRebalanceCallback
>>a. onPartitionsRevoked would be a better name.
>>b. We should discuss the possibility of splitting this into two
>> interfaces. The motivation would be that in Java 8 single method
>> interfaces
>> can directly take methods which might be more intuitive.
>>c. If we stick with a single interface I would prefer the name
>> RebalanceCallback as its more concise
>> 2. Should subscribe(String topic, int partition) should be
>> subscribe(String
>> topic, int...partition)?
>> 3. Is lastCommittedOffset call just a local access? If so it would be more
>> convenient not to batch it.
>> 4. How are we going to handle the earliest/latest starting position
>> functionality we currently have. Does that remain a config?
>> 5. Do we need to expose the general ability to get known positions from
>> the
>> log? E.g. the functionality in the OffsetRequest...? That would make the
>> ability to change position a little easier.
>> 6. Should poll(java.lang.Long timeout) be poll(long timeout, TimeUnit
>> unit)? Is it Long because it allows null? If so should we just add a
>> poll()
>> that polls indefinitely?
>> 7. I recommend we remove the boolean parameter from commit as it is really
>> hard to read code that has boolean parameters without named arguments. Can
>> we make it something like commit(...) and commitAsync(...)?
>> 8. What about the common case where you just want to commit the current
>> position for all partitions?
>> 9. How do you unsubscribe?
>> 10. You say in a few places that positions() only impacts the starting
>> position, but surely that isn't the case, right? Surely it controls the
>> fetch position for that partition and can be called at any time? Otherwise
>> it is a pretty weird api, right?
>> 11. How do I get my current position? Not the committed position but the
>> offset of the next message that will be given to me?
>>
>> One thing that I really found helpful for the API design was writing out
>> actual code for different scenarios against the API. I think it might be
>> good to do that for this too--i.e. enumerate the various use cases and
>> code
>> that use case up to see how it looks. I'm not sure if it would be useful
>> to
>> collect these kinds of scenarios from people. I know they have
>> sporadically
>> popped up on the mailing list.
>>
>> -Jay
>>
>>
>> On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede > >wrote:
>>
>> > As mentioned in previous emails, we are also working on a
>> re-implementation
>> > of the consumer. I would like to use this email thread to discuss the
>> > details of the public API. I would also like us to be picky about this
>> > public api now so it is as good as possible and we don't need to break
>> it
>> > i

Re: New Consumer API discussion

2014-02-10 Thread Guozhang Wang
Hello Jay,

Thanks for the detailed comments.

1. Yeah we could discuss a bit more on that.

2. Since subscribe() is incremental, adding one topic-partition is OK, and
personally I think it is cleaner than subscribe(String topic,
int...partition)?

3. Originally I was thinking about two interfaces:

getOffsets() // offsets for all partitions that I am consuming now

getOffset(topc-partition) // offset of the specified topic-partition, will
throw exception if it is not currently consumed.

What do you think about these?

4. Yes, that remains a config.

5. Agree.

6. If the time out value is null then it will "logically" return
immediately with whatever data is available. I think an indefinitely poll()
function could be replaced with just

while (true) poll(some-time)?

7. I am open with either approach.

8. I was thinking about two interfaces for the commit functionality:

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

Do those sound better?

9. Currently I think about un-subscribe as "close and re-subscribe", and
would like to hear people's opinion about it.

10. Yes. Position() is an API function, and as and API it means "be called
at any time" and will change the next fetching starting offset.

11. The ConsumerRecord would have the offset info of the message. Is that
what you want?

About use cases: great point. I will add some more examples of using the
API functions in the wiki pages.

Guozhang




On Mon, Feb 10, 2014 at 12:20 PM, Jay Kreps  wrote:

> A few items:
> 1. ConsumerRebalanceCallback
>a. onPartitionsRevoked would be a better name.
>b. We should discuss the possibility of splitting this into two
> interfaces. The motivation would be that in Java 8 single method interfaces
> can directly take methods which might be more intuitive.
>c. If we stick with a single interface I would prefer the name
> RebalanceCallback as its more concise
> 2. Should subscribe(String topic, int partition) should be subscribe(String
> topic, int...partition)?
> 3. Is lastCommittedOffset call just a local access? If so it would be more
> convenient not to batch it.
> 4. How are we going to handle the earliest/latest starting position
> functionality we currently have. Does that remain a config?
> 5. Do we need to expose the general ability to get known positions from the
> log? E.g. the functionality in the OffsetRequest...? That would make the
> ability to change position a little easier.
> 6. Should poll(java.lang.Long timeout) be poll(long timeout, TimeUnit
> unit)? Is it Long because it allows null? If so should we just add a poll()
> that polls indefinitely?
> 7. I recommend we remove the boolean parameter from commit as it is really
> hard to read code that has boolean parameters without named arguments. Can
> we make it something like commit(...) and commitAsync(...)?
> 8. What about the common case where you just want to commit the current
> position for all partitions?
> 9. How do you unsubscribe?
> 10. You say in a few places that positions() only impacts the starting
> position, but surely that isn't the case, right? Surely it controls the
> fetch position for that partition and can be called at any time? Otherwise
> it is a pretty weird api, right?
> 11. How do I get my current position? Not the committed position but the
> offset of the next message that will be given to me?
>
> One thing that I really found helpful for the API design was writing out
> actual code for different scenarios against the API. I think it might be
> good to do that for this too--i.e. enumerate the various use cases and code
> that use case up to see how it looks. I'm not sure if it would be useful to
> collect these kinds of scenarios from people. I know they have sporadically
> popped up on the mailing list.
>
> -Jay
>
>
> On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede  >wrote:
>
> > As mentioned in previous emails, we are also working on a
> re-implementation
> > of the consumer. I would like to use this email thread to discuss the
> > details of the public API. I would also like us to be picky about this
> > public api now so it is as good as possible and we don't need to break it
> > in the future.
> >
> > The best way to get a feel for the API is actually to take a look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >,
> > the hope is to get the api docs good enough so that it is
> self-explanatory.
> > You can also take a look at the configs
> > here<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> > >
> >
> > Some background info on implementation:
> >
> > At a high level the primary difference in this consumer is that it
> removes
> > the distinction between the "high-level" and "low-level" consumer. The
> new
> > consumer API is non blocking and instead of returning a blocking
> iterator,
> > the consume

Re: New Consumer API discussion

2014-02-10 Thread Guozhang Wang
Hi Mattijs:

We have not updated the wiki pages for config yet, and it will not be
updated until we release 0.9 with these changes.

Currently consumers do have a commitOffsets function that can be called by
the users, but for most use cases auto.commit is turned on and this
function gets called by the consumer client itself.

Guozhang



On Mon, Feb 10, 2014 at 11:18 AM, Mattijs Ugen  wrote:

> Hey Neha,
>
> This looks really promising, I particularly like the ability to commit
> offsets for topic/partition tuples over just commit(). Some remarks:
>
> - Constructors link to http://kafka.apache.org/documentation.html#
> consumerconfigs for valid configurations, which lists zookeeper.connect
> rather than metadata.broker.list, the value for BROKER_LIST_CONFIG in
> ConsumerConfig.
> - Docs for poll(long) mention consumer.commit(true), which I can't find in
> the Consumer docs. For a simple consumer setup, that call is something that
> would make a lot of sense.
> - Love the addition of MockConsumer, awesome for unittesting :)
>
> Digging these open discussions on API changes on the mailing list btw,
> keep up the good work :)
>
> Kind regards,
>
> Mattijs
>



-- 
-- Guozhang


Re: New Consumer API discussion

2014-02-10 Thread Pradeep Gollakota
Couple of very quick thoughts.

1. +1 about renaming commit(...) and commitAsync(...)
2. I'd also like to extend the above for the poll()  method as well. poll()
and pollWithTimeout(long, TimeUnit)?
3. Have you guys given any thought around how this API would be used with
hierarchical topics?
4. Would it make sense to add classes such as TopicId, PartitionId, etc?
Seems like it would be easier to read code with these classes as opposed to
string and longs.

- Pradeep


On Mon, Feb 10, 2014 at 12:20 PM, Jay Kreps  wrote:

> A few items:
> 1. ConsumerRebalanceCallback
>a. onPartitionsRevoked would be a better name.
>b. We should discuss the possibility of splitting this into two
> interfaces. The motivation would be that in Java 8 single method interfaces
> can directly take methods which might be more intuitive.
>c. If we stick with a single interface I would prefer the name
> RebalanceCallback as its more concise
> 2. Should subscribe(String topic, int partition) should be subscribe(String
> topic, int...partition)?
> 3. Is lastCommittedOffset call just a local access? If so it would be more
> convenient not to batch it.
> 4. How are we going to handle the earliest/latest starting position
> functionality we currently have. Does that remain a config?
> 5. Do we need to expose the general ability to get known positions from the
> log? E.g. the functionality in the OffsetRequest...? That would make the
> ability to change position a little easier.
> 6. Should poll(java.lang.Long timeout) be poll(long timeout, TimeUnit
> unit)? Is it Long because it allows null? If so should we just add a poll()
> that polls indefinitely?
> 7. I recommend we remove the boolean parameter from commit as it is really
> hard to read code that has boolean parameters without named arguments. Can
> we make it something like commit(...) and commitAsync(...)?
> 8. What about the common case where you just want to commit the current
> position for all partitions?
> 9. How do you unsubscribe?
> 10. You say in a few places that positions() only impacts the starting
> position, but surely that isn't the case, right? Surely it controls the
> fetch position for that partition and can be called at any time? Otherwise
> it is a pretty weird api, right?
> 11. How do I get my current position? Not the committed position but the
> offset of the next message that will be given to me?
>
> One thing that I really found helpful for the API design was writing out
> actual code for different scenarios against the API. I think it might be
> good to do that for this too--i.e. enumerate the various use cases and code
> that use case up to see how it looks. I'm not sure if it would be useful to
> collect these kinds of scenarios from people. I know they have sporadically
> popped up on the mailing list.
>
> -Jay
>
>
> On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede  >wrote:
>
> > As mentioned in previous emails, we are also working on a
> re-implementation
> > of the consumer. I would like to use this email thread to discuss the
> > details of the public API. I would also like us to be picky about this
> > public api now so it is as good as possible and we don't need to break it
> > in the future.
> >
> > The best way to get a feel for the API is actually to take a look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >,
> > the hope is to get the api docs good enough so that it is
> self-explanatory.
> > You can also take a look at the configs
> > here<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> > >
> >
> > Some background info on implementation:
> >
> > At a high level the primary difference in this consumer is that it
> removes
> > the distinction between the "high-level" and "low-level" consumer. The
> new
> > consumer API is non blocking and instead of returning a blocking
> iterator,
> > the consumer provides a poll() API that returns a list of records. We
> think
> > this is better compared to the blocking iterators since it effectively
> > decouples the threading strategy used for processing messages from the
> > consumer. It is worth noting that the consumer is entirely single
> threaded
> > and runs in the user thread. The advantage is that it can be easily
> > rewritten in less multi-threading-friendly languages. The consumer
> batches
> > data and multiplexes I/O over TCP connections to each of the brokers it
> > communicates with, for high throughput. The consumer also allows long
> poll
> > to reduce the end-to-end message latency for low throughput data.
> >
> > The consumer provides a group management facility that supports the
> concept
> > of a group with multiple consumer instances (just like the current
> > consumer). This is done through a custom heartbeat and group management
> > protocol transparent to the user. At the same time, it allows users the
> >

Re: New Consumer API discussion

2014-02-10 Thread Jay Kreps
A few items:
1. ConsumerRebalanceCallback
   a. onPartitionsRevoked would be a better name.
   b. We should discuss the possibility of splitting this into two
interfaces. The motivation would be that in Java 8 single method interfaces
can directly take methods which might be more intuitive.
   c. If we stick with a single interface I would prefer the name
RebalanceCallback as its more concise
2. Should subscribe(String topic, int partition) should be subscribe(String
topic, int...partition)?
3. Is lastCommittedOffset call just a local access? If so it would be more
convenient not to batch it.
4. How are we going to handle the earliest/latest starting position
functionality we currently have. Does that remain a config?
5. Do we need to expose the general ability to get known positions from the
log? E.g. the functionality in the OffsetRequest...? That would make the
ability to change position a little easier.
6. Should poll(java.lang.Long timeout) be poll(long timeout, TimeUnit
unit)? Is it Long because it allows null? If so should we just add a poll()
that polls indefinitely?
7. I recommend we remove the boolean parameter from commit as it is really
hard to read code that has boolean parameters without named arguments. Can
we make it something like commit(...) and commitAsync(...)?
8. What about the common case where you just want to commit the current
position for all partitions?
9. How do you unsubscribe?
10. You say in a few places that positions() only impacts the starting
position, but surely that isn't the case, right? Surely it controls the
fetch position for that partition and can be called at any time? Otherwise
it is a pretty weird api, right?
11. How do I get my current position? Not the committed position but the
offset of the next message that will be given to me?

One thing that I really found helpful for the API design was writing out
actual code for different scenarios against the API. I think it might be
good to do that for this too--i.e. enumerate the various use cases and code
that use case up to see how it looks. I'm not sure if it would be useful to
collect these kinds of scenarios from people. I know they have sporadically
popped up on the mailing list.

-Jay


On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede wrote:

> As mentioned in previous emails, we are also working on a re-implementation
> of the consumer. I would like to use this email thread to discuss the
> details of the public API. I would also like us to be picky about this
> public api now so it is as good as possible and we don't need to break it
> in the future.
>
> The best way to get a feel for the API is actually to take a look at the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >,
> the hope is to get the api docs good enough so that it is self-explanatory.
> You can also take a look at the configs
> here<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> >
>
> Some background info on implementation:
>
> At a high level the primary difference in this consumer is that it removes
> the distinction between the "high-level" and "low-level" consumer. The new
> consumer API is non blocking and instead of returning a blocking iterator,
> the consumer provides a poll() API that returns a list of records. We think
> this is better compared to the blocking iterators since it effectively
> decouples the threading strategy used for processing messages from the
> consumer. It is worth noting that the consumer is entirely single threaded
> and runs in the user thread. The advantage is that it can be easily
> rewritten in less multi-threading-friendly languages. The consumer batches
> data and multiplexes I/O over TCP connections to each of the brokers it
> communicates with, for high throughput. The consumer also allows long poll
> to reduce the end-to-end message latency for low throughput data.
>
> The consumer provides a group management facility that supports the concept
> of a group with multiple consumer instances (just like the current
> consumer). This is done through a custom heartbeat and group management
> protocol transparent to the user. At the same time, it allows users the
> option to subscribe to a fixed set of partitions and not use group
> management at all. The offset management strategy defaults to Kafka based
> offset management and the API provides a way for the user to use a
> customized offset store to manage the consumer's offsets.
>
> A key difference in this consumer also is the fact that it does not depend
> on zookeeper at all.
>
> More details about the new consumer design are
> here<
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> >
>
> Please take a look at the new
> API<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >and
> give us any thoughts y

Re: New Consumer API discussion

2014-02-10 Thread Mattijs Ugen

Hey Neha,

This looks really promising, I particularly like the ability to commit 
offsets for topic/partition tuples over just commit(). Some remarks:


- Constructors link to 
http://kafka.apache.org/documentation.html#consumerconfigs for valid 
configurations, which lists zookeeper.connect rather than 
metadata.broker.list, the value for BROKER_LIST_CONFIG in ConsumerConfig.
- Docs for poll(long) mention consumer.commit(true), which I can't find 
in the Consumer docs. For a simple consumer setup, that call is 
something that would make a lot of sense.

- Love the addition of MockConsumer, awesome for unittesting :)

Digging these open discussions on API changes on the mailing list btw, 
keep up the good work :)


Kind regards,

Mattijs