Hi everyone and thanks for all the replies and suggestions!

Stateful DoFn's seem like something that could do the trick, I'll give it a
try and let you know if I have any particular feedback (comparing to the
example I shared previously).


Thanks,
Ana


On Wed, 8 Sept 2021 at 05:14, Thanh Phan Truong <th...@quod.ai> wrote:

> Hi Ana,
>
> I faced the issue where multiple workers need to access file from
> downloaded repositories. From my experiences you could try NFS disk, so
> that multiple workers can share the same disk. Performance is slower so you
> could try to copy it into local disk for git operations.
>
> For a Flink on K8S cluster, setting an NFS disk is quite easy, you can
> also use AWS EBS or AWS disk that support ReadWriteMany.
>
> Best,
>
> Thanh
> On Sep 8 2021, at 12:12 am, Ana Markovic <am2...@york.ac.uk> wrote:
>
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to
> recreate in Beam, and I'm sharing the link below. Generally speaking, nodes
> keep their favourite words and accept only jobs that involve those
> favourites. This is a simple example but could be beneficial in processing
> large pieces of data (for example, software repositories), where nodes
> could work on the repositories they already processed (and have some files
> already downloaded) and avoid downloading unnecessary repository contents
> if another node already has them. This could be enabled by allowing nodes
> to check their internal state and decide if they want to accept/reject a
> certain repository as a job. I know that the "more complicated" example
> might be a far fetch, but I wanted to give you more context on what I'd
> want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1]
> https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated
> <https://link.getmailspring.com/link/28249698-30fb-44a3-b420-9053be186...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fcrossflowlabs%2Fcrossflow%2Ftree%2Fmaster%2Forg.crossflow.tests%2Fsrc%2Forg%2Fcrossflow%2Ftests%2Fopinionated&recipient=dXNlckBiZWFtLmFwYWNoZS5vcmc%3D>
>
>
> [image: Sent from Mailspring]
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je...@seznam.cz> wrote:
>
> Hi Ana,
>
> in general, worker nodes do not share any state, and cannot themselves
> decide which work to accept and which to reject. How the work is
> distributed to downstream processing is defined by a runner, not the Beam
> model. On the other hand, what you ask for might be possibly accomplished
> using a grouping operation - either a GroupByKey or a stateful DoFn might
> help you with that. Can you further describe your intent?
>
> Best,
>
>  Jan
>
> On 9/7/21 12:32 PM, Ana Markovic wrote:
>
> To whom this may concern,
>
> I've been looking into polyglot data processing frameworks recently, and I
> read Beam's documentation as well as developed a few examples to get some
> hands-on experience. I've been wondering, and I haven't found this in the
> documentation, is there a way to set up worker nodes so they are
> "opinionated" or "smart" in a sense that they can decide for themselves
> which jobs they will perform? For example, in a word count example, an
> opinionated worker node could only decide to monitor occurrences of a
> specific word if it's among the node's favourite words.
>
> I hope I explained it well, but please let me know if more details are
> needed to answer this question.
>
> Thankful in advance,
> Ana
>
> --
> Best,
> Ana
>
>

Reply via email to