Hey Guys, Also, for reference, we'll be looking to implement new Samza consumers which have these APIs:
http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or g/apache/samza/system/SystemConsumer.html http://samza.incubator.apache.org/learn/documentation/0.7.0/api/javadocs/or g/apache/samza/checkpoint/CheckpointManager.html Question (3) below is a result of having Samza's SystemConsumers poll allow specific topic/partitions to be specified. The split between consumer and checkpoint manager is the reason for question (12) below. Cheers, Chris On 3/3/14 10:19 AM, "Chris Riccomini" <criccom...@linkedin.com> wrote: >Hey Guys, > >Sorry for the late follow up. Here are my questions/thoughts on the API: > >1. Why is the config String->Object instead of String->String? > >2. Are these Java docs correct? > > KafkaConsumer(java.util.Map<java.lang.String,java.lang.Object> configs) > A consumer is instantiated by providing a set of key-value pairs as >configuration and a ConsumerRebalanceCallback implementation > >There is no ConsumerRebalanceCallback parameter. > >3. Would like to have a method: > > poll(long timeout, java.util.concurrent.TimeUnit timeUnit, >TopicPartition... topicAndPartitionsToPoll) > >I see I can effectively do this by just fiddling with subscribe and >unsubscribe before each poll. Is this a low-overhead operation? Can I just >unsubscribe from everything after each poll, then re-subscribe to a topic >the next iteration. I would probably be doing this in a fairly tight loop. > >4. The behavior of AUTO_OFFSET_RESET_CONFIG is overloaded. I think there >are use cases for decoupling "what to do when no offset exists" from "what >to do when I'm out of range". I might want to start from smallest the >first time I run, but fail if I ever get offset out of range. > >5. ENABLE_JMX could use Java docs, even though it's fairly >self-explanatory. > >6. Clarity about whether FETCH_BUFFER_CONFIG is per-topic/partition, or >across all topic/partitions is useful. I believe it's per-topic/partition, >right? That is, setting to 2 megs with two TopicAndPartitions would result >in 4 megs worth of data coming in per fetch, right? > >7. What does the consumer do if METADATA_FETCH_TIMEOUT_CONFIG times out? >Retry, or throw exception? > >8. Does RECONNECT_BACKOFF_MS_CONFIG apply to both metadata requests and >fetch requests? > >9. What does SESSION_TIMEOUT_MS default to? > >10. Is this consumer thread-safe? > >11. How do you use a different offset management strategy? Your email >implies that it's pluggable, but I don't see how. "The offset management >strategy defaults to Kafka based offset management and the API provides a >way for the user to use a customized offset store to manage the consumer's >offsets." > >12. If I wish to decouple the consumer from the offset checkpointing, is >it OK to use Joel's offset management stuff directly, rather than through >the consumer's commit API? > > >Cheers, >Chris > >On 2/10/14 10:54 AM, "Neha Narkhede" <neha.narkh...@gmail.com> wrote: > >>As mentioned in previous emails, we are also working on a >>re-implementation >>of the consumer. I would like to use this email thread to discuss the >>details of the public API. I would also like us to be picky about this >>public api now so it is as good as possible and we don't need to break it >>in the future. >> >>The best way to get a feel for the API is actually to take a look at the >>javadoc<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc >>/ >>doc/kafka/clients/consumer/KafkaConsumer.html>, >>the hope is to get the api docs good enough so that it is >>self-explanatory. >>You can also take a look at the configs >>here<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/do >>c >>/kafka/clients/consumer/ConsumerConfig.html> >> >>Some background info on implementation: >> >>At a high level the primary difference in this consumer is that it >>removes >>the distinction between the "high-level" and "low-level" consumer. The >>new >>consumer API is non blocking and instead of returning a blocking >>iterator, >>the consumer provides a poll() API that returns a list of records. We >>think >>this is better compared to the blocking iterators since it effectively >>decouples the threading strategy used for processing messages from the >>consumer. It is worth noting that the consumer is entirely single >>threaded >>and runs in the user thread. The advantage is that it can be easily >>rewritten in less multi-threading-friendly languages. The consumer >>batches >>data and multiplexes I/O over TCP connections to each of the brokers it >>communicates with, for high throughput. The consumer also allows long >>poll >>to reduce the end-to-end message latency for low throughput data. >> >>The consumer provides a group management facility that supports the >>concept >>of a group with multiple consumer instances (just like the current >>consumer). This is done through a custom heartbeat and group management >>protocol transparent to the user. At the same time, it allows users the >>option to subscribe to a fixed set of partitions and not use group >>management at all. The offset management strategy defaults to Kafka based >>offset management and the API provides a way for the user to use a >>customized offset store to manage the consumer's offsets. >> >>A key difference in this consumer also is the fact that it does not >>depend >>on zookeeper at all. >> >>More details about the new consumer design are >>here<https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer >>+ >>Rewrite+Design> >> >>Please take a look at the new >>API<http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc >>/ >>kafka/clients/consumer/KafkaConsumer.html>and >>give us any thoughts you may have. >> >>Thanks, >>Neha >