While one does want to watch out for expensive per-record operations, this
may still be preferable to (and cheaper than) setting up a server and
making RPC requests. It depends on the nature of the operation. If
executing the perl script is (say) 100ms of "startup" for 1ms of actually
processing $DATA, then you'll be wasting a lot of cycles and a server may
be the way to go, but if it's 1ms of startup for 100ms of processing $DATA
than this startup cost won't matter at all.

If the startup cost is prohibitive, you could also start up a local
"server" on the worker in startBundle (or even setUp), and shut it down in
finishBundle, and communicate with it in your processElement.

The other bit is actually shipping your perl script (and, more tricky, its
dependencies). Currently that's very runner-dependent, and typically you
end up packing it as data in your jars and then trying to unpack/install it
on the workers at runtime. One of the goals of
https://beam.apache.org/contribute/portability/ is to make this easier,
specifically, you can set up your worker environment as a docker container
with everything you need and this will get used as the environment in which
your DoFns are executed.


On Wed, Oct 24, 2018 at 6:48 AM Sobhan Badiozamany <
sobhan.badiozam...@leovegas.com> wrote:

> Hi Nima,
>
> I think the answer depends on the use-case, but what you suggest is on the
> list of practices that hurt scalability of pipelines as it will be an
> example of “Expensive Per-Record Operations”, look it up here:
>
> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>
> Cheers,
> Sobi
>
> Sent from my iPhone
>
> On Oct 23, 2018, at 23:35, Nima Mousavi <nima.mous...@gmail.com> wrote:
>
> Hi,
>
> We have a dataflow pipeline written in Apache python beam, and are
> wondering if we can run a third party code (written in perl) in the
> pipeline. We basically want to run
>
> perl myscript.pl $DATA
>
> for every DATA in a PCollection passed to a DoFn
>
> and write the result back into Bigquery.  We could have setup a server for
> myscript.pl, and send HTTP/RPC request to the server from each worker
> instead. But we are wondering if it is possible to run the script directly
> inside the Beam worker? Or even through a docker container packaging our
> perl script? If yes, how? what do you think of this approach? Any caveat we
> should be aware of?
>
> Thanks!
>
>

Reply via email to