Re: New Consumer API discussion

Neha Narkhede Thu, 13 Feb 2014 13:56:02 -0800

2. It returns a list of results. But how can you use the list? The only way
to use the list is to make a map of tp=>offset and then look up results in
this map (or do a for loop over the list for the partition you want). I
recommend that if this is an in-memory check we just do one at a time. E.g.
long committedPosition(
TopicPosition).


This was discussed in the previous emails. There is a choice between
returning a map or a list. Some people found the map to be more usable.

What if we made it:
   long position(TopicPartition tp)
   void seek(TopicOffsetPosition p)
   long committed(TopicPartition tp)
   void commit(TopicOffsetPosition...);

This is fine, but TopicOffsetPosition doesn't make sense. Offset and
Position is confusing. Also both fetch and commit positions are related to
partitions, not topics. Some more options are TopicPartitionPosition or
TopicPartitionOffset. And we should use either position everywhere in Kafka
or offset but having both is confusing.

   void seek(TopicOffsetPosition p)
   long committed(TopicPartition tp)

Whether these are batched or not really depends on how flexible we want
these APIs to be. The question is whether we allow a consumer to fetch or
set the offsets for partitions that it doesn't own or consume. For example,
if I choose to skip group management and do my own partition assignment but
choose Kafka based offset management. I could imagine a use case where I
want to change the partition assignment on the fly, and to do that, I would
need to fetch the last committed offsets of partitions that I currently
don't consume.

If we want to allow this, these APIs would be more performant if batched.
And would probably look like -
   Map<TopicPartition, Long> positions(TopicPartition... tp)
   void seek(TopicOffsetPosition... p)
   Map<TopicPartition, Long> committed(TopicPartition... tp)
   void commit(TopicOffsetPosition...)

These are definitely more clunky than the non batched ones though.

Thanks,
Neha



On Thu, Feb 13, 2014 at 1:24 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> Hey guys,
>
> One thing that bugs me is the lack of symmetric for the different position
> calls. The way I see it there are two positions we maintain: the fetch
> position and the last commit position. There are two things you can do to
> these positions: get the current value or change the current value. But the
> names somewhat obscure this:
>   Fetch position:
>     - No get
>     - set by positions(TopicOffsetPosition...)
>   Committed position:
>     - get by List<TopicOffsetPosition> lastCommittedPosition(
> TopicPartition...)
>     - set by commit or commitAsync
>
> The lastCommittedPosition is particular bothersome because:
> 1. The name is weird and long
> 2. It returns a list of results. But how can you use the list? The only way
> to use the list is to make a map of tp=>offset and then look up results in
> this map (or do a for loop over the list for the partition you want). I
> recommend that if this is an in-memory check we just do one at a time. E.g.
> long committedPosition(TopicPosition).
>
> What if we made it:
>    long position(TopicPartition tp)
>    void seek(TopicOffsetPosition p)
>    long committed(TopicPartition tp)
>    void commit(TopicOffsetPosition...);
>
> This still isn't terribly consistent, but I think it is better.
>
> I would also like to shorten the name TopicOffsetPosition. Offset and
> Position are duplicative of each other. So perhaps we could call it a
> PartitionOffset or a TopicPosition or something like that. In general class
> names that are just a concatenation of the fields (e.g.
> TopicAndPartitionAndOffset) seem kind of lazy to me since the name doesn't
> really describe it just enumerates. But that is more of a nit pick.
>
> -Jay
>
>
> On Mon, Feb 10, 2014 at 10:54 AM, Neha Narkhede <neha.narkh...@gmail.com
> >wrote:
>
> > As mentioned in previous emails, we are also working on a
> re-implementation
> > of the consumer. I would like to use this email thread to discuss the
> > details of the public API. I would also like us to be picky about this
> > public api now so it is as good as possible and we don't need to break it
> > in the future.
> >
> > The best way to get a feel for the API is actually to take a look at the
> > javadoc<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >,
> > the hope is to get the api docs good enough so that it is
> self-explanatory.
> > You can also take a look at the configs
> > here<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> > >
> >
> > Some background info on implementation:
> >
> > At a high level the primary difference in this consumer is that it
> removes
> > the distinction between the "high-level" and "low-level" consumer. The
> new
> > consumer API is non blocking and instead of returning a blocking
> iterator,
> > the consumer provides a poll() API that returns a list of records. We
> think
> > this is better compared to the blocking iterators since it effectively
> > decouples the threading strategy used for processing messages from the
> > consumer. It is worth noting that the consumer is entirely single
> threaded
> > and runs in the user thread. The advantage is that it can be easily
> > rewritten in less multi-threading-friendly languages. The consumer
> batches
> > data and multiplexes I/O over TCP connections to each of the brokers it
> > communicates with, for high throughput. The consumer also allows long
> poll
> > to reduce the end-to-end message latency for low throughput data.
> >
> > The consumer provides a group management facility that supports the
> concept
> > of a group with multiple consumer instances (just like the current
> > consumer). This is done through a custom heartbeat and group management
> > protocol transparent to the user. At the same time, it allows users the
> > option to subscribe to a fixed set of partitions and not use group
> > management at all. The offset management strategy defaults to Kafka based
> > offset management and the API provides a way for the user to use a
> > customized offset store to manage the consumer's offsets.
> >
> > A key difference in this consumer also is the fact that it does not
> depend
> > on zookeeper at all.
> >
> > More details about the new consumer design are
> > here<
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> > >
> >
> > Please take a look at the new
> > API<
> >
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> > >and
> > give us any thoughts you may have.
> >
> > Thanks,
> > Neha
> >
>

Re: New Consumer API discussion

Reply via email to