Hi all,

I am processing tens of TBs of data in the form of several hundred MBs
binary files. At the moment, I am using custom made Java Queue-Worker
system to process these files but I would like to give Apache Beam a go.

The problem is that files need to be read with an external program and the
whole communication is done via stdio. The system basically needs to
download a binary file (from a list of many), open it with a specified
program, read its data using stdio, and return the results for further
processing (the remainder of the processing happens in the program
that called the external one).

Is it feasible to migrate this flow to Apache Beam? What it would take to
make Beam call the external program and communicate via stdio?

In the end, I would like to have a pipeline that's easier to monitor and
reschedule failed steps easier (automatically).

-- 

Kind Regards,
Tadas Šubonis

Reply via email to