Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-13 Thread Colin McCabe
That is a good point-- we should get KIP-396 voted on. I will review it today. best, Colin On Tue, Aug 13, 2019, at 05:58, Gabor Somogyi wrote: > I've had a look on KIP-396 and until now only 1 binding vote arrived. Hope > others would consider it as a good solution... > > G > > > On Tue, Au

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-13 Thread Dongjin Lee
Sorry for being late. It seems like I found a case which requires a method to update Consumer metadata. In short, kafka-console-consumer.sh is working differently from 2.1.0 for lack of this functionality. https://issues.apache.org/jira/browse/KAFKA-8789 https://github.com/apache/kafka/pull/7206

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-13 Thread Gabor Somogyi
I've had a look on KIP-396 and until now only 1 binding vote arrived. Hope others would consider it as a good solution... G On Tue, Aug 13, 2019 at 11:52 AM Gabor Somogyi wrote: > I've had concerns calling AdminClient.listTopics because on big clusters > I've seen OOM because of too many Topic

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-13 Thread Gabor Somogyi
I've had concerns calling AdminClient.listTopics because on big clusters I've seen OOM because of too many TopicPartitions. On the other this problem already exists in the actual implementation because as Colin said Consumer is doing the same on client side. All in all this part is fine. I've chec

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
So in overall, AdminClient covers the necessary to retrieve up-to-date topic-partitions, whereas KIP-396 will cover the necessary to retrieve offset (EARLIEST, LATEST, timestamp) on partition. Gabor, could you please add the input if I'm missing something? I'd like to double-check on this. Assumi

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
On Tue, Aug 13, 2019 at 10:01 AM Colin McCabe wrote: > On Mon, Aug 12, 2019, at 14:54, Jungtaek Lim wrote: > > Thanks for the feedbacks Colin and Matthias. > > > > I agree with you regarding getting topics and partitions via AdminClient, > > just curious how much the overhead would be. Would it b

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
For now Spark needs to know about exact offset for EARLIEST, LATEST per partition so that it can handle users' query on EARLIEST/LATEST and write exact offset in checkpoint. I guess Spark would also want to validate the known offset, but I guess that could be covered by knowing range of available o

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
Hi Harsha, I'm not sure what exactly the class is doing, but if I can't get all the necessary information from that class, I would end up with calling others and go back to same issue. And skimming the class, it seems to be complicated one (end-users unfriendly, as it's designed to be used interna

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Colin McCabe
On Mon, Aug 12, 2019, at 14:54, Jungtaek Lim wrote: > Thanks for the feedbacks Colin and Matthias. > > I agree with you regarding getting topics and partitions via AdminClient, > just curious how much the overhead would be. Would it be lighter, or > heavier? We may not want to list topics in regul

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Matthias J. Sax
Note that `KafkaConsumer` refreshed it's metadata every 5 minutes by default anyway... (parameter `metadata.max.age.ms`). And of course, you can refresh the metadata you get via AdminClient each time you trigger planning. I cannot quantify the overhead of a single request though. Also, what offset

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Harsha Chintalapani
Hi Jungtaek, Have you looked into this interface https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/MetadataUpdater.java . Right now its not a public interface but does the methods available in this interface work for your needs? . The Defau

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
Thanks for the feedbacks Colin and Matthias. I agree with you regarding getting topics and partitions via AdminClient, just curious how much the overhead would be. Would it be lighter, or heavier? We may not want to list topics in regular intervals - in plan phase we want to know up-to-date inform

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Matthias J. Sax
Thanks for the details Jungtaek! I tend to agree with Colin, that using the AdminClient seems to be the better choice. You can get all topics via `listTopics()` (and you can refresh this information on regular intervals) and match any pattern against the list of available topics in the driver. A

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Gabor Somogyi
Hi Colin, Thanks for your suggestion! Which KIPs are you referring to? BR, G On Mon, Aug 12, 2019 at 5:22 PM Colin McCabe wrote: > Hi, > > If there’s no need to consume records in the Spark driver, then the > Consumer is probably the wrong thing to use. Instead, Spark should use > AdminClient

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Colin McCabe
Hi, If there’s no need to consume records in the Spark driver, then the Consumer is probably the wrong thing to use. Instead, Spark should use AdminClient to find out what partitions exist and where, manage their offsets, and so on. There are some KIPs under discussion now that would add the ne

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Jungtaek Lim
My feeling is that I didn't explain the use case for Spark properly and hence fail to explain the needs. Sorry about this. Spark leverages the single instance of KafkaConsumer in the driver which is registered solely on the consumer group. This is used in the plan phase for each micro-batch to cal

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-12 Thread Gabor Somogyi
Hi Guys, Please see the actual implementation, pretty sure it explains the situation well: https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala To answer one question/assumption which popped up from all of you Spa

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-11 Thread Satish Duggana
Hi Jungtaek, Thanks for the KIP. I have a couple of questions here. Is not Spark using Kafka's consumer group management across multiple consumers? Is Spark using KafkaConsumer#subscribe(Pattern pattern, ConsumerRebalanceListener listener) only to get all the topics for a pattern based subscriptio

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-11 Thread Matthias J. Sax
If am not sure if I fully understand yet. The fact, that Spark does not stores offsets in Kafka but as part of its own checkpoint mechanism seems to be orthogonal. Maybe I am missing something here. As you are using subscribe(), you use Kafka consumer group mechanism, that takes care of the assig

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-11 Thread Jungtaek Lim
Let me elaborate my explanation a bit more. Here we say about Apache Spark, but this will apply for everything which want to control offset of Kafka consumers. Spark is managing the committed offsets and the offsets which should be polled now. Topics and partitions as well. This is required as Spa

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-11 Thread Jungtaek Lim
Sorry I didn't recognize you're also asking it here as well. I'm in favor of describing it in this discussion thread so the discussion itself can go forward. So copying my answer here: We have some use case which we don't just rely on everything what Kafka consumer provides. We want to know curren

Re: [DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-09 Thread Matthias J. Sax
Thanks for the KIP. Can you elaborate a little bit more on the use case for this feature? Why would a consumer need to update it's metadata explicitly? -Matthias On 8/8/19 8:46 PM, Jungtaek Lim wrote: > Hi devs, > > I'd like to initiate discussion around KIP-505, exposing new public method > t

[DISCUSS] KIP-505 : Add new public method to only update assignment metadata in consumer

2019-08-08 Thread Jungtaek Lim
Hi devs, I'd like to initiate discussion around KIP-505, exposing new public method to only update assignment metadata in consumer. `poll(0)` has been misused as according to Kafka doc it doesn't guarantee that it doesn't pull any records, and new method `poll(Duration)` doesn't have same semanti