That is a good point-- we should get KIP-396 voted on. I will review it today.
best,
Colin
On Tue, Aug 13, 2019, at 05:58, Gabor Somogyi wrote:
> I've had a look on KIP-396 and until now only 1 binding vote arrived. Hope
> others would consider it as a good solution...
>
> G
>
>
> On Tue, Au
Sorry for being late.
It seems like I found a case which requires a method to update Consumer
metadata. In short, kafka-console-consumer.sh is working differently from
2.1.0 for lack of this functionality.
https://issues.apache.org/jira/browse/KAFKA-8789
https://github.com/apache/kafka/pull/7206
I've had a look on KIP-396 and until now only 1 binding vote arrived. Hope
others would consider it as a good solution...
G
On Tue, Aug 13, 2019 at 11:52 AM Gabor Somogyi
wrote:
> I've had concerns calling AdminClient.listTopics because on big clusters
> I've seen OOM because of too many Topic
I've had concerns calling AdminClient.listTopics because on big clusters
I've seen OOM because of too many TopicPartitions.
On the other this problem already exists in the actual implementation
because as Colin said Consumer is doing the same on client side. All in all
this part is fine.
I've chec
So in overall, AdminClient covers the necessary to retrieve up-to-date
topic-partitions, whereas KIP-396 will cover the necessary to retrieve
offset (EARLIEST, LATEST, timestamp) on partition.
Gabor, could you please add the input if I'm missing something? I'd like to
double-check on this.
Assumi
On Tue, Aug 13, 2019 at 10:01 AM Colin McCabe wrote:
> On Mon, Aug 12, 2019, at 14:54, Jungtaek Lim wrote:
> > Thanks for the feedbacks Colin and Matthias.
> >
> > I agree with you regarding getting topics and partitions via AdminClient,
> > just curious how much the overhead would be. Would it b
For now Spark needs to know about exact offset for EARLIEST, LATEST per
partition so that it can handle users' query on EARLIEST/LATEST and write
exact offset in checkpoint. I guess Spark would also want to validate the
known offset, but I guess that could be covered by knowing range of
available o
Hi Harsha,
I'm not sure what exactly the class is doing, but if I can't get all the
necessary information from that class, I would end up with calling others
and go back to same issue. And skimming the class, it seems to be
complicated one (end-users unfriendly, as it's designed to be used
interna
On Mon, Aug 12, 2019, at 14:54, Jungtaek Lim wrote:
> Thanks for the feedbacks Colin and Matthias.
>
> I agree with you regarding getting topics and partitions via AdminClient,
> just curious how much the overhead would be. Would it be lighter, or
> heavier? We may not want to list topics in regul
Note that `KafkaConsumer` refreshed it's metadata every 5 minutes by
default anyway... (parameter `metadata.max.age.ms`). And of course, you
can refresh the metadata you get via AdminClient each time you trigger
planning. I cannot quantify the overhead of a single request though.
Also, what offset
Hi Jungtaek,
Have you looked into this interface
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/MetadataUpdater.java
.
Right now its not a public interface but does the methods available in this
interface work for your needs? . The Defau
Thanks for the feedbacks Colin and Matthias.
I agree with you regarding getting topics and partitions via AdminClient,
just curious how much the overhead would be. Would it be lighter, or
heavier? We may not want to list topics in regular intervals - in plan
phase we want to know up-to-date inform
Thanks for the details Jungtaek!
I tend to agree with Colin, that using the AdminClient seems to be the
better choice.
You can get all topics via `listTopics()` (and you can refresh this
information on regular intervals) and match any pattern against the list
of available topics in the driver.
A
Hi Colin,
Thanks for your suggestion! Which KIPs are you referring to?
BR,
G
On Mon, Aug 12, 2019 at 5:22 PM Colin McCabe wrote:
> Hi,
>
> If there’s no need to consume records in the Spark driver, then the
> Consumer is probably the wrong thing to use. Instead, Spark should use
> AdminClient
Hi,
If there’s no need to consume records in the Spark driver, then the Consumer is
probably the wrong thing to use. Instead, Spark should use AdminClient to find
out what partitions exist and where, manage their offsets, and so on. There are
some KIPs under discussion now that would add the ne
My feeling is that I didn't explain the use case for Spark properly and
hence fail to explain the needs. Sorry about this.
Spark leverages the single instance of KafkaConsumer in the driver which is
registered solely on the consumer group. This is used in the plan phase for
each micro-batch to cal
Hi Guys,
Please see the actual implementation, pretty sure it explains the situation
well:
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
To answer one question/assumption which popped up from all of you Spa
Hi Jungtaek,
Thanks for the KIP. I have a couple of questions here.
Is not Spark using Kafka's consumer group management across multiple
consumers?
Is Spark using KafkaConsumer#subscribe(Pattern pattern,
ConsumerRebalanceListener listener) only to get all the topics for a
pattern based subscriptio
If am not sure if I fully understand yet.
The fact, that Spark does not stores offsets in Kafka but as part of its
own checkpoint mechanism seems to be orthogonal. Maybe I am missing
something here.
As you are using subscribe(), you use Kafka consumer group mechanism,
that takes care of the assig
Let me elaborate my explanation a bit more. Here we say about Apache Spark,
but this will apply for everything which want to control offset of Kafka
consumers.
Spark is managing the committed offsets and the offsets which should be
polled now. Topics and partitions as well. This is required as Spa
Sorry I didn't recognize you're also asking it here as well. I'm in favor
of describing it in this discussion thread so the discussion itself can go
forward. So copying my answer here:
We have some use case which we don't just rely on everything what Kafka
consumer provides. We want to know curren
Thanks for the KIP.
Can you elaborate a little bit more on the use case for this feature?
Why would a consumer need to update it's metadata explicitly?
-Matthias
On 8/8/19 8:46 PM, Jungtaek Lim wrote:
> Hi devs,
>
> I'd like to initiate discussion around KIP-505, exposing new public method
> t
Hi devs,
I'd like to initiate discussion around KIP-505, exposing new public method
to only update assignment metadata in consumer.
`poll(0)` has been misused as according to Kafka doc it doesn't guarantee
that it doesn't pull any records, and new method `poll(Duration)` doesn't
have same semanti
23 matches
Mail list logo