Re: KafkaIO needs access to the brokers even before the pipeline reach the worker

2018-09-19 Thread Raghu Angadi
This access is needed in order to determine number of partitions for a
topic. If you know number of partitions already, you can provide a list of
partitions manually and that avoids accessing Kafka cluster on the client.
See example usage at [1].

It is feasible to avoid this access with some deterministic assignment of
partitions on the workers for each input split. We might look into it if
'withTopicPartitions()' does not help in most of these cases.

[1]:
https://github.com/apache/beam/blob/60e0a22ea95921636c392b5aae77cb48196dd700/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOTest.java#L380

On Wed, Sep 19, 2018 at 11:59 AM Juan Carlos Garcia 
wrote:

> Hi folks, we have a pipeline for Dataflow and our Google cloud environment
> has a private network (where the pipeline should run, this network
> interconnect via an IP-sec to an AWS environment where the Kafka brokers
> are running).
>
> We have found that in order to be able to submit the pipeline we have to
> do it from a machine that has access to the Kafka brokers.
>
> Is there a way to avoid that?
>
> Why KafkaIO cannot defer the communication to the brokers after the
> pipeline its on the worker node?
>
> Thanks and regards,
> JC
>


KafkaIO needs access to the brokers even before the pipeline reach the worker

2018-09-19 Thread Juan Carlos Garcia
Hi folks, we have a pipeline for Dataflow and our Google cloud environment
has a private network (where the pipeline should run, this network
interconnect via an IP-sec to an AWS environment where the Kafka brokers
are running).

We have found that in order to be able to submit the pipeline we have to do
it from a machine that has access to the Kafka brokers.

Is there a way to avoid that?

Why KafkaIO cannot defer the communication to the brokers after the
pipeline its on the worker node?

Thanks and regards,
JC