Dataflow performs its initial splitting of UnboundedSources at pipeline
construction time. This is a limitation of Dataflow.

You could try using flex templates[1] which will ensure a GCE VM is used
when constructing the pipeline which may satisfy your security requirements.

Otherwise you would have to remove this split call and encode the unsplit
source as there is no way to disable this from my knowledge without
changing the code:
https://github.com/apache/beam/blob/238c33c36c219b0c2a6c6281962cf84fbbebb242/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/internal/CustomSources.java#L87

Note that Dataflow performance will be limited to only a single machine for
each Kafka unbounded source. To limit the impact of removing the splitting
you could apply multiple Kafka unbounded sources all reading from
distinct topics and/or topic partitions and then flatten those PCollections
together (this is what the initial splitting effectively does).

1:
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates

On Fri, May 15, 2020 at 2:09 AM Dmytro Mykhailov <
[email protected]> wrote:

> Hello everybody,
>
> I have a secure kafka instance on-premises that I would like to read from
> using a Dataflow job. The idea is to have the connection to the kafka only
> from dataflow machines.
>
> But when I run the job with Dataflow runner it does attempt to connect to
> the kafka instance. Surely it fails and the job is not deployed.
>
> I suspect that there is some kind of a verification step that the job
> compiler is doing locally before deployment. Is there more
> documentation around the topic? Is there a way to control what will happen
> during the verification process?
>
> Regards,
> Dmytro
>

Reply via email to