Re: Is it possible to run a perl scrip in Dataflow worker?

Nima Mousavi Fri, 26 Oct 2018 09:50:57 -0700

I meant what is the status for Dataflow runner.

On Fri, Oct 26, 2018 at 12:35 PM Nima Mousavi <[email protected]>
wrote:


> Thanks everyone.
>
> @Robert: packaging and portability of the perl script (and the perl
> itself) is the biggest caveat in the proposed approach (here
> <https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads>).
> Portability API is exactly what we want, but what is its status? When
> (roughly) can we expect to have it available?
>
> On Wed, Oct 24, 2018 at 7:32 PM Reza Rokni <[email protected]> wrote:
>
>> Hi,
>>
>> Not directly connected ( its for java sdk  ) but some of the concepts in
>> these materials maybe useful:
>>
>>
>> https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads
>>
>>
>> https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess
>>
>>
>> https://cloud.google.com/solutions/running-external-binaries-beam-grid-computing
>>
>>
>>
>>
>> On 25 October 2018 at 04:23, Jeff Klukas <[email protected]> wrote:
>>
>>> Another option here would be to make the perl script operate on batches.
>>> Your DoFn could then store the records to a buffer rather than outputting
>>> them and then periodically flush the buffer, sending records through the
>>> perl script and sending to output.
>>>
>>> On Wed, Oct 24, 2018 at 3:03 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> While one does want to watch out for expensive per-record operations,
>>>> this may still be preferable to (and cheaper than) setting up a server and
>>>> making RPC requests. It depends on the nature of the operation. If
>>>> executing the perl script is (say) 100ms of "startup" for 1ms of actually
>>>> processing $DATA, then you'll be wasting a lot of cycles and a server may
>>>> be the way to go, but if it's 1ms of startup for 100ms of processing $DATA
>>>> than this startup cost won't matter at all.
>>>>
>>>> If the startup cost is prohibitive, you could also start up a local
>>>> "server" on the worker in startBundle (or even setUp), and shut it down in
>>>> finishBundle, and communicate with it in your processElement.
>>>>
>>>> The other bit is actually shipping your perl script (and, more tricky,
>>>> its dependencies). Currently that's very runner-dependent, and typically
>>>> you end up packing it as data in your jars and then trying to
>>>> unpack/install it on the workers at runtime. One of the goals of
>>>> https://beam.apache.org/contribute/portability/ is to make this
>>>> easier, specifically, you can set up your worker environment as a docker
>>>> container with everything you need and this will get used as the
>>>> environment in which your DoFns are executed.
>>>>
>>>>
>>>> On Wed, Oct 24, 2018 at 6:48 AM Sobhan Badiozamany <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Nima,
>>>>>
>>>>> I think the answer depends on the use-case, but what you suggest is on
>>>>> the list of practices that hurt scalability of pipelines as it will be an
>>>>> example of “Expensive Per-Record Operations”, look it up here:
>>>>>
>>>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>>>>
>>>>> Cheers,
>>>>> Sobi
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We have a dataflow pipeline written in Apache python beam, and are
>>>>> wondering if we can run a third party code (written in perl) in the
>>>>> pipeline. We basically want to run
>>>>>
>>>>> perl myscript.pl $DATA
>>>>>
>>>>> for every DATA in a PCollection passed to a DoFn
>>>>>
>>>>> and write the result back into Bigquery.  We could have setup a server
>>>>> for myscript.pl, and send HTTP/RPC request to the server from each
>>>>> worker instead. But we are wondering if it is possible to run the script
>>>>> directly inside the Beam worker? Or even through a docker container
>>>>> packaging our perl script? If yes, how? what do you think of this 
>>>>> approach?
>>>>> Any caveat we should be aware of?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Is it possible to run a perl scrip in Dataflow worker?

Reply via email to