This access is needed in order to determine number of partitions for a
topic. If you know number of partitions already, you can provide a list of
partitions manually and that avoids accessing Kafka cluster on the client.
See example usage at [1].
It is feasible to avoid this access with some deterministic assignment of
partitions on the workers for each input split. We might look into it if
'withTopicPartitions()' does not help in most of these cases.
[1]:
https://github.com/apache/beam/blob/60e0a22ea95921636c392b5aae77cb48196dd700/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOTest.java#L380
On Wed, Sep 19, 2018 at 11:59 AM Juan Carlos Garcia
wrote:
> Hi folks, we have a pipeline for Dataflow and our Google cloud environment
> has a private network (where the pipeline should run, this network
> interconnect via an IP-sec to an AWS environment where the Kafka brokers
> are running).
>
> We have found that in order to be able to submit the pipeline we have to
> do it from a machine that has access to the Kafka brokers.
>
> Is there a way to avoid that?
>
> Why KafkaIO cannot defer the communication to the brokers after the
> pipeline its on the worker node?
>
> Thanks and regards,
> JC
>