I meant what is the status for Dataflow runner. On Fri, Oct 26, 2018 at 12:35 PM Nima Mousavi <[email protected]> wrote:
> Thanks everyone. > > @Robert: packaging and portability of the perl script (and the perl > itself) is the biggest caveat in the proposed approach (here > <https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads>). > Portability API is exactly what we want, but what is its status? When > (roughly) can we expect to have it available? > > On Wed, Oct 24, 2018 at 7:32 PM Reza Rokni <[email protected]> wrote: > >> Hi, >> >> Not directly connected ( its for java sdk ) but some of the concepts in >> these materials maybe useful: >> >> >> https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads >> >> >> https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess >> >> >> https://cloud.google.com/solutions/running-external-binaries-beam-grid-computing >> >> >> >> >> On 25 October 2018 at 04:23, Jeff Klukas <[email protected]> wrote: >> >>> Another option here would be to make the perl script operate on batches. >>> Your DoFn could then store the records to a buffer rather than outputting >>> them and then periodically flush the buffer, sending records through the >>> perl script and sending to output. >>> >>> On Wed, Oct 24, 2018 at 3:03 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> While one does want to watch out for expensive per-record operations, >>>> this may still be preferable to (and cheaper than) setting up a server and >>>> making RPC requests. It depends on the nature of the operation. If >>>> executing the perl script is (say) 100ms of "startup" for 1ms of actually >>>> processing $DATA, then you'll be wasting a lot of cycles and a server may >>>> be the way to go, but if it's 1ms of startup for 100ms of processing $DATA >>>> than this startup cost won't matter at all. >>>> >>>> If the startup cost is prohibitive, you could also start up a local >>>> "server" on the worker in startBundle (or even setUp), and shut it down in >>>> finishBundle, and communicate with it in your processElement. >>>> >>>> The other bit is actually shipping your perl script (and, more tricky, >>>> its dependencies). Currently that's very runner-dependent, and typically >>>> you end up packing it as data in your jars and then trying to >>>> unpack/install it on the workers at runtime. One of the goals of >>>> https://beam.apache.org/contribute/portability/ is to make this >>>> easier, specifically, you can set up your worker environment as a docker >>>> container with everything you need and this will get used as the >>>> environment in which your DoFns are executed. >>>> >>>> >>>> On Wed, Oct 24, 2018 at 6:48 AM Sobhan Badiozamany < >>>> [email protected]> wrote: >>>> >>>>> Hi Nima, >>>>> >>>>> I think the answer depends on the use-case, but what you suggest is on >>>>> the list of practices that hurt scalability of pipelines as it will be an >>>>> example of “Expensive Per-Record Operations”, look it up here: >>>>> >>>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind >>>>> >>>>> Cheers, >>>>> Sobi >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> We have a dataflow pipeline written in Apache python beam, and are >>>>> wondering if we can run a third party code (written in perl) in the >>>>> pipeline. We basically want to run >>>>> >>>>> perl myscript.pl $DATA >>>>> >>>>> for every DATA in a PCollection passed to a DoFn >>>>> >>>>> and write the result back into Bigquery. We could have setup a server >>>>> for myscript.pl, and send HTTP/RPC request to the server from each >>>>> worker instead. But we are wondering if it is possible to run the script >>>>> directly inside the Beam worker? Or even through a docker container >>>>> packaging our perl script? If yes, how? what do you think of this >>>>> approach? >>>>> Any caveat we should be aware of? >>>>> >>>>> Thanks! >>>>> >>>>> >> >> >> -- >> >> This email may be confidential and privileged. If you received this >> communication by mistake, please don't forward it to anyone else, please >> erase all copies and attachments, and please let me know that it has gone >> to the wrong person. >> >> The above terms reflect a potential business arrangement, are provided >> solely as a basis for further discussion, and are not intended to be and do >> not constitute a legally binding obligation. No legally binding obligations >> will be created, implied, or inferred until an agreement in final form is >> executed in writing by all parties involved. >> >
