On Mon, Jun 8, 2020 at 2:06 PM Chad Dombrova <chad...@gmail.com> wrote:

> Even when running portably, Dataflow still has its own implementation of
>> PubSubIO that is switched out for Python's "implementation." (It's actually
>> built into the same layer that provides the shuffle/group-by-key
>> implementation.) However, if you used the external Java PubSubIO it may not
>> recognize this and continue to use that implementation even on dataflow.
>>
>
> That's great, actually, as we still have some headaches around using the
> Java PubSubIO transform: it requires a custom build of the Java Beam API
> and SDK container to add missing dependencies and properly deal with data
> conversions from python<->java.
>
> Next question: when using Dataflow+Portability can we specify our own
> docker container for the Beam Python SDK when using the Docker executor?
>

Yes, you should be able to do that.


>
> We have two reasons to do this:
> 1) we have some environments that cannot be bootstrapped on top of the
> stock Beam SDK image
> 2) we have a somewhat modified version of the Beam SDK (changes which we
> eventually hope to contribute back, but won't be able to for at least a few
> months).
>
> If yes, what are the restrictions around custom SDK images?  e.g. must be
> the same version of Beam, must be on a registry accessible to Dataflow,
> etc...
>

- It needs to be built as described in here:
https://beam.apache.org/documentation/runtime/environments/
- Use the flag: --workerHarnessContainerImage=[location of container image]
(images need to be accessible to Dataflow VMs.)

There are no other limitations. But, this is a not yet tested/supported
path. You might run into issues.


>
> thanks
> -chad
>
>
>
>

Reply via email to