On Wed, Dec 28, 2022 at 7:25 PM Byron Ellis via dev <dev@beam.apache.org> wrote:
> Thanks for the tips, folks! Took a bit of doing, but I got Java -> Python > -> Java working without Docker being involved in the process (getting it > working with Docker being involved wasn't so bad... though it didn't do > what I wanted with respect to collecting results). Removing Docker appears > to let me collect the results back on the Java side via Beam SQL's > TestTable, which then lets me inspect the results for test validation > purposes. > Great! FWIW, commands for running locally with DirectRunner are also documented here: https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#run-the-java-pipeline Thanks, Cham > > In case anyone else is feeling similarly foolish, here's what ended up > working: > > > https://github.com/byronellis/beam/blob/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/java/org/apache/beam/sdk/extensions/spd/StructuredPipelineExecutionTest.java > > It ain't pretty, but it gets the job done. > > Best, > B > > > > On Wed, Dec 28, 2022 at 10:42 AM Robert Bradshaw <rober...@google.com> > wrote: > >> On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis <byronel...@google.com> >> wrote: >> > >> > On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >> >> >> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev >> >> <dev@beam.apache.org> wrote: >> >> > >> >> > > Given the increasing importance of multi language pipelines, it >> does seem that we should expand the capabilities of the DirectRunner or >> just go all in on FlinkRunner for testing and local / small scale >> development >> >> > >> >> > +1 - annecdotally I've found local testing of multi-language >> pipelines to be tricky, and have had multiple conversations with others who >> have run into similar challenges in multiple contexts (both users and >> people working on the project). >> >> >> >> I generally do all my testing against the Python runner which works >> >> well. This is, of course, more natural for Python pipelines using >> >> other languages, but when I was working on typescript which uses >> >> cross-language even more heavily I just made it auto-start the python >> >> runner just like the expansion services are auto-started which works >> >> quite well. (The auto-started runner is just a plain-old portable >> >> runner speaking the runner API, so no additional support is required >> >> on the source side once it's started. And if you're already trying to >> >> use dataframes and/or ML, you need to have Python available anyway.) >> >> >> >> We could consider bundling it as a docker image to reduce the required >> >> dependency set, but we'd have to solve the docker-in-docker issue to >> >> do that. >> >> >> >> I really think it's important to make cross-language a first-class >> >> citizen--the end use should not care most of the time whether the >> >> pipelines they use are native or not. >> > >> > >> > Thanks! That's helpful. In this case getting the Python runner to >> auto-start sounds like the most straightforward option for testing. After >> all it's explicitly to provide Python initiated from Java so Python is >> already going to be around and running (and in fact the test auto-starts >> the Python expansion service already to get the graph in the first place) >> and the deps are already going to be there. >> >> Yep. >> >> > I'm personally on the fence about Docker in these sorts of situations. >> Yes, it makes life easier for the most part but gets complicated quickly. >> It's also not an option for everyone. >> >> For sure. I think it'd be good to have various alternative packaging >> of expansion services as different people will have different setups >> (e.g. a Crostini Go developer is more likely to have docker than java, >> but it's probably just the opposite for a java developer on windows). >> This is what I did for the yaml thing. Note that nominally docker is >> required for running a cross-language pipeline, so that makes it a >> more natural option there. (Technically, at least for development, you >> can have the host SDK process vend itself as a worker in LOOPBACK >> mode, and if you pass the directEmbedDockerPython=true option to the >> portable python runner it will inline the Python operations rather >> than firing up a docker worker for those (assuming, of course, the >> versions match.) >> >> > I'll give things a shot and report back (if you have an example of >> auto-starting the Python runner that'd be cool too---if I get inspired I >> might try to add that to the Python extensions in Java since right now they >> don't actually appear to be exercising the runner itself based on the TODOs) >> >> In typescript the runner is started up as >> >> >> PythonService.forModule("apache_beam.runners.portability.local_job_service_main", >> ["--port", "{{PORT}}"]) >> >> which is very similar to how the expansion service is started up >> >> >> >> PythonService.forModule("apache_beam.runners.portability.expansion_service_main", >> ["--fully_qualified_name_glob=*", "--port", "{{PORT}}"]) >> >> Here {{PORT}} is auto-populated and can be retrieved to instantiate >> the GRPC connection (or, in your case, passed to the portable runner >> as the endpoint). >> >> Java is very similar, one does >> >> PythonService service = >> new >> PythonService("apache_beam.runners.portability.expansion_service_main", >> ...); >> AutoCloseable running = service.start(); >> ... >> >> which should be easily adaptable to starting up a runner. On first use >> this service automatically creates a virtual environment with Beam >> (and other dependencies) installed. (I don't know what the analogue is >> for Go, but it shouldn't be that different...) >> >> The one difficulty with auto-started service is that the release >> artifacts are not necessarily available for a dev repo the same way >> they are with a released version. IIRC, we fall back to the previous >> release in that case. To compensate, and have faster iteration/easier >> testing, one can set an environment variable BEAM_SERVICE_OVERRIDES >> where one specifies an existing venv, jar, or address to use for a >> specific service, see >> >> https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432 >> . This works for Python and typescript; I don't remember if I >> implemented it for Java and I don't think it's yet in Go. >> >> Hopefully this is enough pointers to get started. It'd be great to get >> Java up to snuff. >> >> References: >> >> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34 >> >> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60 >> >> https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466 >> >> >> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev < >> dev@beam.apache.org> wrote: >> >> >> >> >> >> Given the increasing importance of multi language pipelines, it >> does seem that we should expand the capabilities of the DirectRunner or >> just go all in on FlinkRunner for testing and local / small scale >> development >> >> >> >> >> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke <rob...@frantil.com> >> wrote: >> >> >>> >> >> >>> Probably either on Flink, or the Python Portable runner at this >> juncture. >> >> >>> >> >> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev < >> dev@beam.apache.org> wrote: >> >> >>>> >> >> >>>> Hi all, >> >> >>>> >> >> >>>> I spent some more time adding things to my dbt-for-Beam clone ( >> https://github.com/apache/beam/pull/24670) and actually made a fair >> amount of progress, including starting to add in the profile support so I >> can start to run it against real workloads (though at the moment only the >> "test" connector is properly configured). More interestingly, though, is >> adding in support for Python Dataframe external transforms... which expands >> properly, but then (unsurprisingly) hangs if you try to actually run the >> pipeline with Java's TestPipeline. >> >> >>>> >> >> >>>> I was wondering how people go about testing Java/Python hybrid >> pipelines locally? The Java<->Python tests don't seem to actually execute a >> pipeline, but I was hoping that maybe the direct runner could be set up >> properly to do that? >> >> >>>> >> >> >>>> Best, >> >> >>>> B >> >