> Even when running portably, Dataflow still has its own implementation of > PubSubIO that is switched out for Python's "implementation." (It's actually > built into the same layer that provides the shuffle/group-by-key > implementation.) However, if you used the external Java PubSubIO it may not > recognize this and continue to use that implementation even on dataflow. >
That's great, actually, as we still have some headaches around using the Java PubSubIO transform: it requires a custom build of the Java Beam API and SDK container to add missing dependencies and properly deal with data conversions from python<->java. Next question: when using Dataflow+Portability can we specify our own docker container for the Beam Python SDK when using the Docker executor? We have two reasons to do this: 1) we have some environments that cannot be bootstrapped on top of the stock Beam SDK image 2) we have a somewhat modified version of the Beam SDK (changes which we eventually hope to contribute back, but won't be able to for at least a few months). If yes, what are the restrictions around custom SDK images? e.g. must be the same version of Beam, must be on a registry accessible to Dataflow, etc... thanks -chad