Yes this is feasible and has been done by others.
You can launch any process from within an Apache Beam DoFn using standard
process libraries (e.g. ProcessBuilder in Java, subprocess in Python, ...).

The trickier question is how do you ensure the environment that the
"worker" is executing in has the process installed and this is somewhat
runner dependent. Some solutions in the past have been:
* for runners where you manage the worker pool: preinstall the process on
all workers
* for runners where you have a lot of permissions on the worker pool:
install the process on demand on workers during DoFn setup
* for any runner: build a statically linked version of the process and ship
it with your pipeline and run that
* for runners that support custom containers: extend the Apache Beam docker
worker container and install your application there


On Thu, Apr 9, 2020 at 4:21 AM Tadas Šubonis <[email protected]>
wrote:

> Hi all,
>
> I am processing tens of TBs of data in the form of several hundred MBs
> binary files. At the moment, I am using custom made Java Queue-Worker
> system to process these files but I would like to give Apache Beam a go.
>
> The problem is that files need to be read with an external program and the
> whole communication is done via stdio. The system basically needs to
> download a binary file (from a list of many), open it with a specified
> program, read its data using stdio, and return the results for further
> processing (the remainder of the processing happens in the program
> that called the external one).
>
> Is it feasible to migrate this flow to Apache Beam? What it would take to
> make Beam call the external program and communicate via stdio?
>
> In the end, I would like to have a pipeline that's easier to monitor and
> reschedule failed steps easier (automatically).
>
> --
>
> Kind Regards,
> Tadas Šubonis
>

Reply via email to