On Mon, Jun 8, 2020 at 2:06 PM Chad Dombrova <chad...@gmail.com> wrote:
> Even when running portably, Dataflow still has its own implementation of >> PubSubIO that is switched out for Python's "implementation." (It's actually >> built into the same layer that provides the shuffle/group-by-key >> implementation.) However, if you used the external Java PubSubIO it may not >> recognize this and continue to use that implementation even on dataflow. >> > > That's great, actually, as we still have some headaches around using the > Java PubSubIO transform: it requires a custom build of the Java Beam API > and SDK container to add missing dependencies and properly deal with data > conversions from python<->java. > > Next question: when using Dataflow+Portability can we specify our own > docker container for the Beam Python SDK when using the Docker executor? > Yes, you should be able to do that. > > We have two reasons to do this: > 1) we have some environments that cannot be bootstrapped on top of the > stock Beam SDK image > 2) we have a somewhat modified version of the Beam SDK (changes which we > eventually hope to contribute back, but won't be able to for at least a few > months). > > If yes, what are the restrictions around custom SDK images? e.g. must be > the same version of Beam, must be on a registry accessible to Dataflow, > etc... > - It needs to be built as described in here: https://beam.apache.org/documentation/runtime/environments/ - Use the flag: --workerHarnessContainerImage=[location of container image] (images need to be accessible to Dataflow VMs.) There are no other limitations. But, this is a not yet tested/supported path. You might run into issues. > > thanks > -chad > > > >