Thanks everyone.

@Robert: packaging and portability of the perl script (and the perl itself)
is the biggest caveat in the proposed approach (here
<https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads>).
Portability API is exactly what we want, but what is its status? When
(roughly) can we expect to have it available?

On Wed, Oct 24, 2018 at 7:32 PM Reza Rokni <[email protected]> wrote:

> Hi,
>
> Not directly connected ( its for java sdk  ) but some of the concepts in
> these materials maybe useful:
>
>
> https://cloud.google.com/blog/products/gcp/running-external-libraries-with-cloud-dataflow-for-grid-computing-workloads
>
>
> https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/subprocess
>
>
> https://cloud.google.com/solutions/running-external-binaries-beam-grid-computing
>
>
>
>
> On 25 October 2018 at 04:23, Jeff Klukas <[email protected]> wrote:
>
>> Another option here would be to make the perl script operate on batches.
>> Your DoFn could then store the records to a buffer rather than outputting
>> them and then periodically flush the buffer, sending records through the
>> perl script and sending to output.
>>
>> On Wed, Oct 24, 2018 at 3:03 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> While one does want to watch out for expensive per-record operations,
>>> this may still be preferable to (and cheaper than) setting up a server and
>>> making RPC requests. It depends on the nature of the operation. If
>>> executing the perl script is (say) 100ms of "startup" for 1ms of actually
>>> processing $DATA, then you'll be wasting a lot of cycles and a server may
>>> be the way to go, but if it's 1ms of startup for 100ms of processing $DATA
>>> than this startup cost won't matter at all.
>>>
>>> If the startup cost is prohibitive, you could also start up a local
>>> "server" on the worker in startBundle (or even setUp), and shut it down in
>>> finishBundle, and communicate with it in your processElement.
>>>
>>> The other bit is actually shipping your perl script (and, more tricky,
>>> its dependencies). Currently that's very runner-dependent, and typically
>>> you end up packing it as data in your jars and then trying to
>>> unpack/install it on the workers at runtime. One of the goals of
>>> https://beam.apache.org/contribute/portability/ is to make this easier,
>>> specifically, you can set up your worker environment as a docker container
>>> with everything you need and this will get used as the environment in which
>>> your DoFns are executed.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 6:48 AM Sobhan Badiozamany <
>>> [email protected]> wrote:
>>>
>>>> Hi Nima,
>>>>
>>>> I think the answer depends on the use-case, but what you suggest is on
>>>> the list of practices that hurt scalability of pipelines as it will be an
>>>> example of “Expensive Per-Record Operations”, look it up here:
>>>>
>>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>>>
>>>> Cheers,
>>>> Sobi
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 23, 2018, at 23:35, Nima Mousavi <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have a dataflow pipeline written in Apache python beam, and are
>>>> wondering if we can run a third party code (written in perl) in the
>>>> pipeline. We basically want to run
>>>>
>>>> perl myscript.pl $DATA
>>>>
>>>> for every DATA in a PCollection passed to a DoFn
>>>>
>>>> and write the result back into Bigquery.  We could have setup a server
>>>> for myscript.pl, and send HTTP/RPC request to the server from each
>>>> worker instead. But we are wondering if it is possible to run the script
>>>> directly inside the Beam worker? Or even through a docker container
>>>> packaging our perl script? If yes, how? what do you think of this approach?
>>>> Any caveat we should be aware of?
>>>>
>>>> Thanks!
>>>>
>>>>
>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Reply via email to