Hi everyone and thanks for all the replies and suggestions! Stateful DoFn's seem like something that could do the trick, I'll give it a try and let you know if I have any particular feedback (comparing to the example I shared previously).
Thanks, Ana On Wed, 8 Sept 2021 at 05:14, Thanh Phan Truong <th...@quod.ai> wrote: > Hi Ana, > > I faced the issue where multiple workers need to access file from > downloaded repositories. From my experiences you could try NFS disk, so > that multiple workers can share the same disk. Performance is slower so you > could try to copy it into local disk for git operations. > > For a Flink on K8S cluster, setting an NFS disk is quite easy, you can > also use AWS EBS or AWS disk that support ReadWriteMany. > > Best, > > Thanh > On Sep 8 2021, at 12:12 am, Ana Markovic <am2...@york.ac.uk> wrote: > > Hi Jan, > > Thanks for the fast reply! I came across an example that I wanted to > recreate in Beam, and I'm sharing the link below. Generally speaking, nodes > keep their favourite words and accept only jobs that involve those > favourites. This is a simple example but could be beneficial in processing > large pieces of data (for example, software repositories), where nodes > could work on the repositories they already processed (and have some files > already downloaded) and avoid downloading unnecessary repository contents > if another node already has them. This could be enabled by allowing nodes > to check their internal state and decide if they want to accept/reject a > certain repository as a job. I know that the "more complicated" example > might be a far fetch, but I wanted to give you more context on what I'd > want to know about Beam. > > Thanks for all the insights! > > Best, > Ana > > [1] > https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated > <https://link.getmailspring.com/link/28249698-30fb-44a3-b420-9053be186...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fcrossflowlabs%2Fcrossflow%2Ftree%2Fmaster%2Forg.crossflow.tests%2Fsrc%2Forg%2Fcrossflow%2Ftests%2Fopinionated&recipient=dXNlckBiZWFtLmFwYWNoZS5vcmc%3D> > > > [image: Sent from Mailspring] > On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je...@seznam.cz> wrote: > > Hi Ana, > > in general, worker nodes do not share any state, and cannot themselves > decide which work to accept and which to reject. How the work is > distributed to downstream processing is defined by a runner, not the Beam > model. On the other hand, what you ask for might be possibly accomplished > using a grouping operation - either a GroupByKey or a stateful DoFn might > help you with that. Can you further describe your intent? > > Best, > > Jan > > On 9/7/21 12:32 PM, Ana Markovic wrote: > > To whom this may concern, > > I've been looking into polyglot data processing frameworks recently, and I > read Beam's documentation as well as developed a few examples to get some > hands-on experience. I've been wondering, and I haven't found this in the > documentation, is there a way to set up worker nodes so they are > "opinionated" or "smart" in a sense that they can decide for themselves > which jobs they will perform? For example, in a word count example, an > opinionated worker node could only decide to monitor occurrences of a > specific word if it's among the node's favourite words. > > I hope I explained it well, but please let me know if more details are > needed to answer this question. > > Thankful in advance, > Ana > > -- > Best, > Ana > >