Re: New Consumer API discussion

Robert Withers Wed, 26 Feb 2014 09:16:11 -0800

Neha, what does the use of the RebalanceBeginCallback and RebalanceEndCallback 
look like?


thanks,
Rob

On Feb 25, 2014, at 3:51 PM, Neha Narkhede <neha.narkh...@gmail.com> wrote:

> How do you know n? The whole point is that you need to be able to fetch the
> end offset. You can't a priori decide you will load 1m messages without
> knowing what is there.
> 
> Hmm. I think what you are pointing out is that in the new consumer API, we
> don't have a way to issue the equivalent of the existing getOffsetsBefore()
> API. Agree that is a flaw that we should fix.
> 
> Will update the docs/wiki with a few use cases that I've collected so far
> and see if the API covers those.
> 
> I would prefer PartitionsAssigned and PartitionsRevoked as that seems
> clearer to me
> 
> Well the RebalanceBeginCallback interface will have onPartitionsAssigned()
> as the callback. Similarly, the RebalanceEndCallback interface will have
> onPartitionsRevoked() as the callback. Makes sense?
> 
> Thanks,
> Neha
> 
> 
> On Tue, Feb 25, 2014 at 2:38 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
> 
>> 1. I would prefer PartitionsAssigned and PartitionsRevoked as that seems
>> clearer to me.
>> 
>> -Jay
>> 
>> 
>> On Tue, Feb 25, 2014 at 10:19 AM, Neha Narkhede <neha.narkh...@gmail.com
>>> wrote:
>> 
>>> Thanks for the reviews so far! There are a few outstanding questions -
>>> 
>>> 1.  It will be good to make the rebalance callbacks forward compatible
>> with
>>> Java 8 capabilities. We can change it to PartitionsAssignedCallback
>>> and PartitionsRevokedCallback or RebalanceBeginCallback and
>>> RebalanceEndCallback?
>>> 
>>> If there are no objections, I will change it to RebalanceBeginCallback
>> and
>>> RebalanceEndCallback.
>>> 
>>> 2.  The return type for committed() is List<TopicPartitionOffset>. There
>>> was a suggestion to change it to either be Map<TopicPartition,Long> or
>>> Map<TopicPartition, TopicPartitionOffset>
>>> 
>>> Do people have feedback on this suggestion?
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 9:56 AM, Neha Narkhede <neha.narkh...@gmail.com
>>>> wrote:
>>> 
>>>> Robert,
>>>> 
>>>> Are you saying it is possible to get events from the high-level
>>> consumerregarding various state machine changes?  For instance, can we
>> get a
>>>> notification when a rebalance starts and ends, when a partition is
>>>> assigned/unassigned, when an offset is committed on a partition, when a
>>>> leader changes and so on?  I call this OOB traffic, since they are not
>>> the
>>>> core messages streaming, but side-band events, yet they are still
>>>> potentially useful to consumers.
>>>> 
>>>> In the current proposal, you get notified when the state machine
>> changes
>>>> i.e. before and after a rebalance is triggered. Look at
>>>> ConsumerRebalanceCallback<
>>> 
>> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerRebalanceCallback.html
>>>> 
>>>> .Leader changes do not count as state machine changes for consumer
>>>> rebalance purposes.
>>>> 
>>>> Thanks,
>>>> Neha
>>>> 
>>>> 
>>>> On Tue, Feb 25, 2014 at 9:54 AM, Neha Narkhede <
>> neha.narkh...@gmail.com
>>>> wrote:
>>>> 
>>>>> Jay/Robert -
>>>>> 
>>>>> 
>>>>> I think what Robert is saying is that we need to think through the
>>> offset
>>>>> API to enable "batch processing" of topic data. Think of a process
>> that
>>>>> periodically kicks off to compute a data summary or do a data load or
>>>>> something like that. I think what we need to support this is an api to
>>>>> fetch the last offset from the server for a partition. Something like
>>>>>   long lastOffset(TopicPartition tp)
>>>>> and for symmetry
>>>>>   long firstOffset(TopicPartition tp)
>>>>> 
>>>>> Likely this would have to be batched.
>>>>> 
>>>>> A fixed range of data load can be done using the existing APIs as
>>>>> follows. This assumes you know the endOffset which can be
>> currentOffset
>>> + n
>>>>> (number of messages in the load)
>>>>> 
>>>>> long startOffset = consumer.position(partition);
>>>>> long endOffset = startOffset + n;
>>>>> while(consumer.position(partition) <= endOffset) {
>>>>>     List<ConsumerRecord> messages = consumer.poll(timeout,
>>>>> TimeUnit.MILLISECONDS);
>>>>>     process(messages, endOffset);          // processes messages
>> until
>>>>> endOffset
>>>>> }
>>>>> 
>>>>> Does that make sense?
>>>>> 
>>>>> 
>>>>> On Tue, Feb 25, 2014 at 9:49 AM, Neha Narkhede <
>> neha.narkh...@gmail.com
>>>> wrote:
>>>>> 
>>>>>> Thanks for the review, Jun. Here are some comments -
>>>>>> 
>>>>>> 
>>>>>> 1. The using of ellipsis: This may make passing a list of items from
>> a
>>>>>> collection to the api a bit harder. Suppose that you have a list of
>>>>>> topics
>>>>>> stored in
>>>>>> 
>>>>>> ArrayList<String> topics;
>>>>>> 
>>>>>> If you want subscribe to all topics in one call, you will have to do:
>>>>>> 
>>>>>> String[] topicArray = new String[topics.size()];
>>>>>> consumer.subscribe(topics.
>>>>>> toArray(topicArray));
>>>>>> 
>>>>>> A similar argument can be made for arguably the more common use case
>> of
>>>>>> subscribing to a single topic as well. In these cases, user is
>> required
>>>>>> to write more
>>>>>> code to create a single item collection and pass it in. Since
>>>>>> subscription is extremely lightweight
>>>>>> invoking it multiple times also seems like a workable solution, no?
>>>>>> 
>>>>>> 2. It would be good to document that the following apis are mutually
>>>>>> exclusive. Also, if the partition level subscription is specified,
>>> there
>>>>>> is
>>>>>> no group management. Finally, unsubscribe() can only be used to
>> cancel
>>>>>> subscriptions with the same pattern. For example, you can't
>> unsubscribe
>>>>>> at
>>>>>> the partition level if the subscription is done at the topic level.
>>>>>> 
>>>>>> *subscribe*(java.lang.String... topics)
>>>>>> *subscribe*(java.lang.String topic, int... partitions)
>>>>>> 
>>>>>> Makes sense. Made the suggested improvements to the docs<
>>> 
>> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/Consumer.html#subscribe%28java.lang.String...%29
>>>> 
>>>>>> 
>>>>>> 
>>>>>> 3.commit(): The following comment in the doc should probably say
>>> "commit
>>>>>> offsets for partitions assigned to this consumer".
>>>>>> 
>>>>>> If no partitions are specified, commits offsets for the subscribed
>>> list
>>>>>> of
>>>>>> topics and partitions to Kafka.
>>>>>> 
>>>>>> Could you give more context on this suggestion? Here is the entire
>> doc
>>> -
>>>>>> 
>>>>>> Synchronously commits the specified offsets for the specified list of
>>>>>> topics and partitions to *Kafka*. If no partitions are specified,
>>>>>> commits offsets for the subscribed list of topics and partitions.
>>>>>> 
>>>>>> The hope is to convey that if no partitions are specified, offsets
>> will
>>>>>> be committed for the subscribed list of partitions. One improvement
>>> could
>>>>>> be to
>>>>>> explicitly state that the offsets returned on the last poll will be
>>>>>> committed. I updated this to -
>>>>>> 
>>>>>> Synchronously commits the specified offsets for the specified list of
>>>>>> topics and partitions to *Kafka*. If no offsets are specified,
>> commits
>>>>>> offsets returned on the last {@link #poll(long, TimeUnit) poll()} for
>>>>>> the subscribed list of topics and partitions.
>>>>>> 
>>>>>> 4. There is inconsistency in specifying partitions. Sometimes we use
>>>>>> TopicPartition and some other times we use String and int (see
>>>>>> examples below).
>>>>>> 
>>>>>> void onPartitionsAssigned(Consumer consumer,
>>>>>> TopicPartition...partitions)
>>>>>> 
>>>>>> public void *subscribe*(java.lang.String topic, int... partitions)
>>>>>> 
>>>>>> Yes, this was discussed previously. I think generally the consensus
>>>>>> seems to be to use the higher level
>>>>>> classes everywhere. Made those changes.
>>>>>> 
>>>>>> What's the use case of position()? Isn't that just the nextOffset()
>> on
>>>>>> the
>>>>>> last message returned from poll()?
>>>>>> 
>>>>>> Yes, except in the case where a rebalance is triggered and poll() is
>>> not
>>>>>> yet invoked. Here, you would use position() to get the new fetch
>>> position
>>>>>> for the specific partition. Even if this is not a common use case,
>> IMO
>>> it
>>>>>> is much easier to use position() to get the fetch offset than
>> invoking
>>>>>> nextOffset() on the last message. This also keeps the APIs symmetric,
>>> which
>>>>>> is nice.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Feb 24, 2014 at 7:06 PM, Withers, Robert <
>>>>>> robert.with...@dish.com> wrote:
>>>>>> 
>>>>>>> That's wonderful.  Thanks for kafka.
>>>>>>> 
>>>>>>> Rob
>>>>>>> 
>>>>>>> On Feb 24, 2014, at 9:58 AM, Guozhang Wang <wangg...@gmail.com
>>> <mailto:
>>>>>>> wangg...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hi Robert,
>>>>>>> 
>>>>>>> Yes, you can check out the callback functions in the new API
>>>>>>> 
>>>>>>> onPartitionDesigned
>>>>>>> onPartitionAssigned
>>>>>>> 
>>>>>>> and see if they meet your needs.
>>>>>>> 
>>>>>>> Guozhang
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Feb 24, 2014 at 8:18 AM, Withers, Robert <
>>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
>>>>>>> 
>>>>>>> Jun,
>>>>>>> 
>>>>>>> Are you saying it is possible to get events from the high-level
>>> consumer
>>>>>>> regarding various state machine changes?  For instance, can we get a
>>>>>>> notification when a rebalance starts and ends, when a partition is
>>>>>>> assigned/unassigned, when an offset is committed on a partition,
>> when
>>> a
>>>>>>> leader changes and so on?  I call this OOB traffic, since they are
>> not
>>>>>>> the
>>>>>>> core messages streaming, but side-band events, yet they are still
>>>>>>> potentially useful to consumers.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Robert
>>>>>>> 
>>>>>>> 
>>>>>>> Robert Withers
>>>>>>> Staff Analyst/Developer
>>>>>>> o: (720) 514-8963
>>>>>>> c:  (571) 262-1873
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Jun Rao [mailto:jun...@gmail.com]
>>>>>>> Sent: Sunday, February 23, 2014 4:19 PM
>>>>>>> To: users@kafka.apache.org<mailto:users@kafka.apache.org>
>>>>>>> Subject: Re: New Consumer API discussion
>>>>>>> 
>>>>>>> Robert,
>>>>>>> 
>>>>>>> For the push orient api, you can potentially implement your own
>>>>>>> MessageHandler with those methods. In the main loop of our new
>>> consumer
>>>>>>> api, you can just call those methods based on the events you get.
>>>>>>> 
>>>>>>> Also, we already have an api to get the first and the last offset
>> of a
>>>>>>> partition (getOffsetBefore).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jun
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Feb 22, 2014 at 11:29 AM, Withers, Robert
>>>>>>> <robert.with...@dish.com<mailto:robert.with...@dish.com>>wrote:
>>>>>>> 
>>>>>>> This is a good idea, too.  I would modify it to include stream
>>>>>>> marking, then you can have:
>>>>>>> 
>>>>>>> long end = consumer.lastOffset(tp);
>>>>>>> consumer.setMark(end);
>>>>>>> while(consumer.beforeMark()) {
>>>>>>>  process(consumer.pollToMark());
>>>>>>> }
>>>>>>> 
>>>>>>> or
>>>>>>> 
>>>>>>> long end = consumer.lastOffset(tp);
>>>>>>> consumer.setMark(end);
>>>>>>> for(Object msg : consumer.iteratorToMark()) {
>>>>>>>  process(msg);
>>>>>>> }
>>>>>>> 
>>>>>>> I actually have 4 suggestions, then:
>>>>>>> 
>>>>>>> *   pull: stream marking
>>>>>>> *   pull: finite streams, bound by time range (up-to-now, yesterday)
>>> or
>>>>>>> offset
>>>>>>> *   pull: async api
>>>>>>> *   push: KafkaMessageSource, for a push model, with msg and OOB
>>> events.
>>>>>>> Build one in either individual or chunk mode and have a listener for
>>>>>>> each msg or a listener for a chunk of msgs.  Make it composable and
>>>>>>> policy driven (chunked, range, commitOffsets policy, retry policy,
>>>>>>> transactional)
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Robert
>>>>>>> 
>>>>>>> On Feb 22, 2014, at 11:21 AM, Jay Kreps <jay.kr...@gmail.com
>> <mailto:
>>>>>>> jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>>> wrote:
>>>>>>> 
>>>>>>> I think what Robert is saying is that we need to think through the
>>>>>>> offset API to enable "batch processing" of topic data. Think of a
>>>>>>> process that periodically kicks off to compute a data summary or do
>> a
>>>>>>> data load or something like that. I think what we need to support
>> this
>>>>>>> is an api to fetch the last offset from the server for a partition.
>>>>>>> Something like
>>>>>>> long lastOffset(TopicPartition tp)
>>>>>>> and for symmetry
>>>>>>> long firstOffset(TopicPartition tp)
>>>>>>> 
>>>>>>> Likely this would have to be batched. Essentially we should add this
>>>>>>> use case to our set of code examples to write and think through.
>>>>>>> 
>>>>>>> The usage would be something like
>>>>>>> 
>>>>>>> long end = consumer.lastOffset(tp);
>>>>>>> while(consumer.position < end)
>>>>>>>  process(consumer.poll());
>>>>>>> 
>>>>>>> -Jay
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Feb 22, 2014 at 1:52 AM, Withers, Robert
>>>>>>> <robert.with...@dish.com<mailto:robert.with...@dish.com>
>>>>>>> <mailto:robert.with...@dish.com>>wrote:
>>>>>>> 
>>>>>>> Jun,
>>>>>>> 
>>>>>>> I was originally thinking a non-blocking read from a distributed
>>>>>>> stream should distinguish between "no local messages, but a fetch is
>>>>>>> occurring"
>>>>>>> versus "you have drained the stream".  The reason this may be
>> valuable
>>>>>>> to me is so I can write consumers that read all known traffic then
>>>>>>> terminate.
>>>>>>> You caused me to reconsider and I think I am conflating 2 things.
>> One
>>>>>>> is a sync/async api while the other is whether to have an infinite
>> or
>>>>>>> finite stream.  Is it possible to build a finite KafkaStream on a
>>>>>>> range of messages?
>>>>>>> 
>>>>>>> Perhaps a Simple Consumer would do just fine and then I could start
>>>>>>> off getting the writeOffset from zookeeper and tell it to read a
>>>>>>> specified range per partition.  I've done this and forked a simple
>>>>>>> consumer runnable for each partition, for one of our analyzers.  The
>>>>>>> great thing about the high-level consumer is that rebalance, so I
>> can
>>>>>>> fork however many stream readers I want and you just figure it out
>> for
>>>>>>> me.  In that way you offer us the control over the resource
>>>>>>> consumption within a pull model.  This is best to regulate message
>>>>>>> pressure, they say.
>>>>>>> 
>>>>>>> Combining that high-level rebalance ability with a ranged partition
>>>>>>> drain could be really nice...build the stream with an ending
>> position
>>>>>>> and it is a finite stream, but retain the high-level rebalance.
>> With
>>>>>>> a finite stream, you would know the difference of the 2 async
>>>>>>> scenarios: fetch-in-progress versus end-of-stream.  With an infinite
>>>>>>> stream, you never get end-of-stream.
>>>>>>> 
>>>>>>> Aside from a high-level consumer over a finite range within each
>>>>>>> partition, the other feature I can think of is more complicated.  A
>>>>>>> high-level consumer has state machine changes that the client cannot
>>>>>>> access, to my knowledge.  Our use of kafka has us invoke a message
>>>>>>> handler with each message we consumer from the KafkaStream, so we
>>>>>>> convert a pull-model to a push-model.  Including the idea of
>> receiving
>>>>>>> notifications from state machine changes, what would be really nice
>> is
>>>>>>> to have a KafkaMessageSource, that is an eventful push model.  If it
>>>>>>> were thread-safe, then we could register listeners for various
>> events:
>>>>>>> 
>>>>>>> *   opening-stream
>>>>>>> *   closing-stream
>>>>>>> *   message-arrived
>>>>>>> *   end-of-stream/no-more-messages-in-partition (for finite streams)
>>>>>>> *   rebalance started
>>>>>>> *   partition assigned
>>>>>>> *   partition unassigned
>>>>>>> *   rebalance finished
>>>>>>> *   partition-offset-committed
>>>>>>> 
>>>>>>> Perhaps that is just our use, but instead of a pull-oriented
>>>>>>> KafkaStream, is there any sense in your providing a push-oriented
>>>>>>> KafkaMessageSource publishing OOB messages?
>>>>>>> 
>>>>>>> thank you,
>>>>>>> Robert
>>>>>>> 
>>>>>>> On Feb 21, 2014, at 5:59 PM, Jun Rao <jun...@gmail.com<mailto:
>>>>>>> jun...@gmail.com><mailto:
>>>>>>> jun...@gmail.com<mailto:jun...@gmail.com>><mailto:
>>>>>>> jun...@gmail.com<mailto:jun...@gmail.com><mailto:jun...@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Robert,
>>>>>>> 
>>>>>>> Could you explain why you want to distinguish btw
>>>>>>> FetchingInProgressException and NoMessagePendingException? The
>>>>>>> nextMsgs() method that you want is exactly what poll() does.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jun
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Feb 19, 2014 at 8:45 AM, Withers, Robert
>>>>>>> <robert.with...@dish.com<mailto:robert.with...@dish.com> <mailto:
>>>>>>> robert.with...@dish.com>
>>>>>>> <mailto:robert.with...@dish.com>>wrote:
>>>>>>> 
>>>>>>> I am not clear on why the consumer stream should be positionable,
>>>>>>> especially if it is limited to the in-memory fetched messages.
>> Could
>>>>>>> someone explain to me, please?  I really like the idea of committing
>>>>>>> the offset specifically on those partitions with changed read
>> offsets,
>>>>>>> only.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2 items I would like to see added to the KafkaStream are:
>>>>>>> 
>>>>>>> *         a non-blocking next(), throws several exceptions
>>>>>>> (FetchingInProgressException and a NoMessagePendingException or
>>>>>>> something) to differentiate between fetching or no messages left.
>>>>>>> 
>>>>>>> *         A nextMsgs() method which returns all locally available
>>>>>>> messages
>>>>>>> and kicks off a fetch for the next chunk.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If you are trying to add transactional features, then formally
>> define
>>>>>>> a DTP capability and pull in other server frameworks to share the
>>>>>>> implementation.  Should it be XA/Open?  How about a new peer2peer
>> DTP
>>>>>>> protocol?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> 
>>>>>>> Robert
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Robert Withers
>>>>>>> 
>>>>>>> Staff Analyst/Developer
>>>>>>> 
>>>>>>> o: (720) 514-8963
>>>>>>> 
>>>>>>> c:  (571) 262-1873
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Jay Kreps [mailto:jay.kr...@gmail.com]
>>>>>>> Sent: Sunday, February 16, 2014 10:13 AM
>>>>>>> To: users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:
>>>>>>> users@kafka.apache.org><mailto:
>>>>>>> users@kafka.apache.org<mailto:users@kafka.apache.org>>
>>>>>>> Subject: Re: New Consumer API discussion
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> +1 I think those are good. It is a little weird that changing the
>>>>>>> +fetch
>>>>>>> 
>>>>>>> point is not batched but changing the commit point is, but I suppose
>>>>>>> there is no helping that.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -Jay
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Feb 15, 2014 at 7:52 AM, Neha Narkhede
>>>>>>> <neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com> <mailto:
>>>>>>> neha.narkh...@gmail.com>
>>>>>>> <mailto:neha.narkh...@gmail.com>
>>>>>>> <mailto:neha.narkh...@gmail.com>>wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Jay,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> That makes sense. position/seek deal with changing the consumers
>>>>>>> 
>>>>>>> in-memory data, so there is no remote rpc there. For some reason, I
>>>>>>> 
>>>>>>> got committed and seek mixed up in my head at that time :)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> So we still end up with
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> long position(TopicPartition tp)
>>>>>>> 
>>>>>>> void seek(TopicPartitionOffset p)
>>>>>>> 
>>>>>>> Map<TopicPartition, Long> committed(TopicPartition tp);
>>>>>>> 
>>>>>>> void commit(TopicPartitionOffset...);
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Neha
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Friday, February 14, 2014, Jay Kreps <jay.kr...@gmail.com
>> <mailto:
>>>>>>> jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com>><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com><mailto:jay.kreps@gmail
>>>>>>> .com>>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Oh, interesting. So I am assuming the following implementation:
>>>>>>> 
>>>>>>> 1. We have an in-memory fetch position which controls the next fetch
>>>>>>> 
>>>>>>> offset.
>>>>>>> 
>>>>>>> 2. Changing this has no effect until you poll again at which point
>>>>>>> 
>>>>>>> your fetch request will be from the newly specified offset 3. We
>>>>>>> 
>>>>>>> then have an in-memory but also remotely stored committed offset.
>>>>>>> 
>>>>>>> 4. Calling commit has the effect of saving the fetch position as
>>>>>>> 
>>>>>>> both the in memory committed position and in the remote store 5.
>>>>>>> 
>>>>>>> Auto-commit is the same as periodically calling commit on all
>>>>>>> 
>>>>>>> positions.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> So batching on commit as well as getting the committed position
>>>>>>> 
>>>>>>> makes sense, but batching the fetch position wouldn't, right? I
>>>>>>> 
>>>>>>> think you are actually thinking of a different approach.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -Jay
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Feb 13, 2014 at 10:40 PM, Neha Narkhede
>>>>>>> 
>>>>>>> <neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com><mailto:
>>>>>>> neha.narkh...@gmail.com><mailto:
>>>>>>> neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com>>
>>>>>>> 
>>>>>>> <javascript:;>
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I think you are saying both, i.e. if you have committed on a
>>>>>>> 
>>>>>>> partition it returns you that value but if you
>>>>>>> 
>>>>>>> haven't
>>>>>>> 
>>>>>>> it does a remote lookup?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Correct.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The other argument for making committed batched is that commit()
>>>>>>> 
>>>>>>> is batched, so there is symmetry.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> position() and seek() are always in memory changes (I assume) so
>>>>>>> 
>>>>>>> there
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> no need to batch them.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I'm not as sure as you are about that assumption being true.
>>>>>>> 
>>>>>>> Basically
>>>>>>> 
>>>>>>> in
>>>>>>> 
>>>>>>> my example above, the batching argument for committed() also
>>>>>>> 
>>>>>>> applies to
>>>>>>> 
>>>>>>> position() since one purpose of fetching a partition's offset is
>>>>>>> 
>>>>>>> to use
>>>>>>> 
>>>>>>> it
>>>>>>> 
>>>>>>> to set the position of the consumer to that offset. Since that
>>>>>>> 
>>>>>>> might
>>>>>>> 
>>>>>>> lead
>>>>>>> 
>>>>>>> to a remote OffsetRequest call, I think we probably would be
>>>>>>> 
>>>>>>> better off batching it.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Another option for naming would be position/reposition instead of
>>>>>>> 
>>>>>>> position/seek.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I think position/seek is better since it aligns with Java file APIs.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I also think your suggestion about ConsumerPosition makes sense.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Neha
>>>>>>> 
>>>>>>> On Feb 13, 2014 9:22 PM, "Jay Kreps" <jay.kr...@gmail.com<mailto:
>>>>>>> jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com>><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com>><mailto:
>>>>>>> jay.kr...@gmail.com<mailto:jay.kr...@gmail.com><mailto:
>>>>>>> jay.kr...@gmail.com><mailto:jay.kreps@gmail
>>>>>>> .com>>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hey Neha,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I actually wasn't proposing the name TopicOffsetPosition, that
>>>>>>> 
>>>>>>> was
>>>>>>> 
>>>>>>> just a
>>>>>>> 
>>>>>>> typo. I meant TopicPartitionOffset, and I was just referencing
>>>>>>> 
>>>>>>> what
>>>>>>> 
>>>>>>> was
>>>>>>> 
>>>>>>> in
>>>>>>> 
>>>>>>> the javadoc. So to restate my proposal without the typo, using
>>>>>>> 
>>>>>>> just
>>>>>>> 
>>>>>>> the
>>>>>>> 
>>>>>>> existing classes (that naming is a separate question):
>>>>>>> 
>>>>>>> long position(TopicPartition tp)
>>>>>>> 
>>>>>>> void seek(TopicPartitionOffset p)
>>>>>>> 
>>>>>>> long committed(TopicPartition tp)
>>>>>>> 
>>>>>>> void commit(TopicPartitionOffset...);
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> So I may be unclear on committed() (AKA lastCommittedOffset). Is
>>>>>>> 
>>>>>>> it returning the in-memory value from the last commit by this
>>>>>>> 
>>>>>>> consumer,
>>>>>>> 
>>>>>>> or
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> it doing a remote fetch, or both? I think you are saying both, i.e.
>>>>>>> 
>>>>>>> if
>>>>>>> 
>>>>>>> you
>>>>>>> 
>>>>>>> have committed on a partition it returns you that value but if
>>>>>>> 
>>>>>>> you
>>>>>>> 
>>>>>>> haven't
>>>>>>> 
>>>>>>> it does a remote lookup?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> The other argument for making committed batched is that commit()
>>>>>>> 
>>>>>>> is batched, so there is symmetry.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> position() and seek() are always in memory changes (I assume) so
>>>>>>> 
>>>>>>> there
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> no need to batch them.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> So taking all that into account what if we revise it to
>>>>>>> 
>>>>>>> long position(TopicPartition tp)
>>>>>>> 
>>>>>>> void seek(TopicPartitionOffset p)
>>>>>>> 
>>>>>>> Map<TopicPartition, Long> committed(TopicPartition tp);
>>>>>>> 
>>>>>>> void commit(TopicPartitionOffset...);
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This is not symmetric between position/seek and commit/committed
>>>>>>> 
>>>>>>> but
>>>>>>> 
>>>>>>> it
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> convenient. Another option for naming would be
>>>>>>> 
>>>>>>> position/reposition
>>>>>>> 
>>>>>>> instead
>>>>>>> 
>>>>>>> of position/seek.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> With respect to the name TopicPartitionOffset, what I was trying
>>>>>>> 
>>>>>>> to
>>>>>>> 
>>>>>>> say
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> that I recommend we change that to something shorter. I think
>>>>>>> 
>>>>>>> TopicPosition
>>>>>>> 
>>>>>>> or ConsumerPosition might be better. Position does not refer to
>>>>>>> 
>>>>>>> the variables in the object, it refers to the meaning of the
>>>>>>> 
>>>>>>> object--it represents a position within a topic. The offset
>>>>>>> 
>>>>>>> field in that object
>>>>>>> 
>>>>>>> is
>>>>>>> 
>>>>>>> still called the offset. TopicOffset, PartitionOffset, or
>>>>>>> 
>>>>>>> ConsumerOffset
>>>>>>> 
>>>>>>> would all be workable too. Basically I am just objecting to
>>>>>>> 
>>>>>>> concatenating
>>>>>>> 
>>>>>>> three nouns together. :-)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -Jay
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Feb 13, 2014 at 1:54 PM, Neha Narkhede <
>>>>>>> 
>>>>>>> neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com><mailto:
>>>>>>> neha.narkh...@gmail.com><mailto:
>>>>>>> neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com>><mailto:
>>>>>>> neha.narkh...@gmail.com<mailto:neha.narkh...@gmail.com><mailto:
>>>>>>> neha.narkh...@gmail.com>>
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2. It returns a list of results. But how can you use the list?
>>>>>>> 
>>>>>>> The
>>>>>>> 
>>>>>>> only
>>>>>>> 
>>>>>>> way
>>>>>>> 
>>>>>>> to use the list is to make a map of tp=>offset and then look
>>>>>>> 
>>>>>>> up
>>>>>>> 
>>>>>>> results
>>>>>>> 
>>>>>>> in
>>>>>>> 
>>>>>>> this map (or do a for loop over the list for the partition you
>>>>>>> 
>>>>>>> want). I
>>>>>>> 
>>>>>>> recommend that if this is an in-memory check we just do one at
>>>>>>> 
>>>>>>> a
>>>>>>> 
>>>>>>> time.
>>>>>>> 
>>>>>>> E.g.
>>>>>>> 
>>>>>>> long committedPosition(
>>>>>>> 
>>>>>>> TopicPosition).
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This was discussed in the previous emails. There is a choic
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Withers
>>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com><mailto:
>>>>>>> robert.with...@dish.com><mailto:
>>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com>>
>>>>>>> c: 303.919.5856
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Withers
>>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com><mailto:
>>>>>>> robert.with...@dish.com>
>>>>>>> c: 303.919.5856
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Withers
>>>>>>> robert.with...@dish.com<mailto:robert.with...@dish.com>
>>>>>>> c: 303.919.5856
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: New Consumer API discussion

Reply via email to