Re: Testing Multilanguage Pipelines?

Chamikara Jayalath via dev Tue, 10 Jan 2023 04:17:17 -0800

On Wed, Dec 28, 2022 at 7:25 PM Byron Ellis via dev <dev@beam.apache.org>
wrote:


> Thanks for the tips, folks! Took a bit of doing, but I got Java -> Python
> -> Java working without Docker being involved in the process (getting it
> working with Docker being involved wasn't so bad... though it didn't do
> what I wanted with respect to collecting results). Removing Docker appears
> to let me collect the results back on the Java side via Beam SQL's
> TestTable, which then lets me inspect the results for test validation
> purposes.
>

Great!
FWIW, commands for running locally with DirectRunner are also documented
here:
https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#run-the-java-pipeline

Thanks,
Cham


>
> In case anyone else is feeling similarly foolish, here's what ended up
> working:
>
>
> https://github.com/byronellis/beam/blob/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/java/org/apache/beam/sdk/extensions/spd/StructuredPipelineExecutionTest.java
>
> It ain't pretty, but it gets the job done.
>
> Best,
> B
>
>
>
> On Wed, Dec 28, 2022 at 10:42 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis <byronel...@google.com>
>> wrote:
>> >
>> > On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>> >>
>> >> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>> >> <dev@beam.apache.org> wrote:
>> >> >
>> >> > > Given the increasing importance of multi language pipelines, it
>> does seem that we should expand the capabilities of the DirectRunner or
>> just go all in on FlinkRunner for testing and local / small scale
>> development
>> >> >
>> >> > +1 - annecdotally I've found local testing of multi-language
>> pipelines to be tricky, and have had multiple conversations with others who
>> have run into similar challenges in multiple contexts (both users and
>> people working on the project).
>> >>
>> >> I generally do all my testing against the Python runner which works
>> >> well. This is, of course, more natural for Python pipelines using
>> >> other languages, but when I was working on typescript which uses
>> >> cross-language even more heavily I just made it auto-start the python
>> >> runner just like the expansion services are auto-started which works
>> >> quite well. (The auto-started runner is just a plain-old portable
>> >> runner speaking the runner API, so no additional support is required
>> >> on the source side once it's started. And if you're already trying to
>> >> use dataframes and/or ML, you need to have Python available anyway.)
>> >>
>> >> We could consider bundling it as a docker image to reduce the required
>> >> dependency set, but we'd have to solve the docker-in-docker issue to
>> >> do that.
>> >>
>> >> I really think it's important to make cross-language a first-class
>> >> citizen--the end use should not care most of the time whether the
>> >> pipelines they use are native or not.
>> >
>> >
>> > Thanks! That's helpful. In this case getting the Python runner to
>> auto-start sounds like the most straightforward option for testing. After
>> all it's explicitly to provide Python initiated from Java so Python is
>> already going to be around and running (and in fact the test auto-starts
>> the Python expansion service already to get the graph in the first place)
>> and the deps are already going to be there.
>>
>> Yep.
>>
>> > I'm personally on the fence about Docker in these sorts of situations.
>> Yes, it makes life easier for the most part but gets complicated quickly.
>> It's also not an option for everyone.
>>
>> For sure. I think it'd be good to have various alternative packaging
>> of expansion services as different people will have different setups
>> (e.g. a Crostini Go developer is more likely to have docker than java,
>> but it's probably just the opposite for a java developer on windows).
>> This is what I did for the yaml thing. Note that nominally docker is
>> required for running a cross-language pipeline, so that makes it a
>> more natural option there. (Technically, at least for development, you
>> can have the host SDK process vend itself as a worker in LOOPBACK
>> mode, and if you pass the directEmbedDockerPython=true option to the
>> portable python runner it will inline the Python operations rather
>> than firing up a docker worker for those (assuming, of course, the
>> versions match.)
>>
>> > I'll give things a shot and report back (if you have an example of
>> auto-starting the Python runner that'd be cool too---if I get inspired I
>> might try to add that to the Python extensions in Java since right now they
>> don't actually appear to be exercising the runner itself based on the TODOs)
>>
>> In typescript the runner is started up as
>>
>>
>> PythonService.forModule("apache_beam.runners.portability.local_job_service_main",
>> ["--port", "{{PORT}}"])
>>
>> which is very similar to how the expansion service is started up
>>
>>
>>  
>> PythonService.forModule("apache_beam.runners.portability.expansion_service_main",
>> ["--fully_qualified_name_glob=*", "--port", "{{PORT}}"])
>>
>> Here {{PORT}} is auto-populated and can be retrieved to instantiate
>> the GRPC connection (or, in your case, passed to the portable runner
>> as the endpoint).
>>
>> Java is very similar, one does
>>
>>     PythonService service =
>>         new
>> PythonService("apache_beam.runners.portability.expansion_service_main",
>> ...);
>>     AutoCloseable running = service.start();
>>     ...
>>
>> which should be easily adaptable to starting up a runner. On first use
>> this service automatically creates a virtual environment with Beam
>> (and other dependencies) installed. (I don't know what the analogue is
>> for Go, but it shouldn't be that different...)
>>
>> The one difficulty with auto-started service is that the release
>> artifacts are not necessarily available for a dev repo the same way
>> they are with a released version. IIRC, we fall back to the previous
>> release in that case. To compensate, and have faster iteration/easier
>> testing, one can set an environment variable BEAM_SERVICE_OVERRIDES
>> where one specifies an existing venv, jar, or address to use for a
>> specific service, see
>>
>> https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432
>> . This works for Python and typescript; I don't remember if I
>> implemented it for Java and I don't think it's yet in Go.
>>
>> Hopefully this is enough pointers to get started. It'd be great to get
>> Java up to snuff.
>>
>> References:
>>
>> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34
>>
>> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60
>>
>> https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466
>>
>> >> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev <
>> dev@beam.apache.org> wrote:
>> >> >>
>> >> >> Given the increasing importance of multi language pipelines, it
>> does seem that we should expand the capabilities of the DirectRunner or
>> just go all in on FlinkRunner for testing and local / small scale
>> development
>> >> >>
>> >> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke <rob...@frantil.com>
>> wrote:
>> >> >>>
>> >> >>> Probably either on Flink, or the Python Portable runner at this
>> juncture.
>> >> >>>
>> >> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev <
>> dev@beam.apache.org> wrote:
>> >> >>>>
>> >> >>>> Hi all,
>> >> >>>>
>> >> >>>> I spent some more time adding things to my dbt-for-Beam clone (
>> https://github.com/apache/beam/pull/24670) and actually made a fair
>> amount of progress, including starting to add in the profile support so I
>> can start to run it against real workloads (though at the moment only the
>> "test" connector is properly configured). More interestingly, though, is
>> adding in support for Python Dataframe external transforms... which expands
>> properly, but then (unsurprisingly) hangs if you try to actually run the
>> pipeline with Java's TestPipeline.
>> >> >>>>
>> >> >>>> I was wondering how people go about testing Java/Python hybrid
>> pipelines locally? The Java<->Python tests don't seem to actually execute a
>> pipeline, but I was hoping that maybe the direct runner could be set up
>> properly to do that?
>> >> >>>>
>> >> >>>> Best,
>> >> >>>> B
>>
>

Re: Testing Multilanguage Pipelines?

Reply via email to