Can you explain a bit more? Where are these data sets coming from? On Mon, May 24, 2021 at 3:55 PM Stephan Hoyer <[email protected]> wrote:
> I'm not concerned with key-dependent topologies, which I didn't even think > was possible to express in Beam. > > It's more that I already wrote a PTransform for processing a *single* 1 > TB dataset. Now I want to write a single PTransform that effectively runs > the original PTransform in groups over ~20 such datasets (ideally without > needing to know that number 20 ahead of time). > > On Mon, May 24, 2021 at 3:30 PM Reuven Lax <[email protected]> wrote: > >> Is the issue that you have a different topology depending on the key? >> >> On Mon, May 24, 2021 at 2:49 PM Stephan Hoyer <[email protected]> wrote: >> >>> Exactly, my use-case has another nested GroupByKey to apply per key. But >>> even if it could be done in a streaming fashion, it's way too much data (1 >>> TB) to process on a single worker in a reasonable amount of time. >>> >>> On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> I was thinking there was some non-trivial topology (such as further >>>> GBKs) within the logic to be applied to each key group. >>>> >>>> Kenn >>>> >>>> On Mon, May 24, 2021 at 2:38 PM Brian Hulette <[email protected]> >>>> wrote: >>>> >>>>> Isn't it possible to read the grouped values produced by a GBK from an >>>>> Iterable and yield results as you go, without needing to collect all of >>>>> each input into memory? Perhaps I'm misunderstanding your use-case. >>>>> >>>>> Brian >>>>> >>>>> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> I'm just pinging this thread because I think it is an interesting >>>>>> problem and don't want it to slip by. >>>>>> >>>>>> I bet a lot of users have gone through the tedious conversion you >>>>>> describe. Of course, it may often not be possible if you are using a >>>>>> library transform. There are a number of aspects of the Beam model that >>>>>> are >>>>>> designed a specific way explicitly *because* we need to assume that a >>>>>> large >>>>>> number of composites in your pipeline are not modifiable by you. Most >>>>>> closely related: this is why windowing is something carried along >>>>>> implicitly rather than just a parameter to GBK - that would require all >>>>>> transforms to expose how they use GBK under the hood and they would all >>>>>> have to plumb this extra key/WindowFn through every API. Instead, we have >>>>>> this way to implicitly add a second key to any transform :-) >>>>>> >>>>>> So in addition to being tedious for you, it would be good to have a >>>>>> better solution. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I'd like to write a Beam PTransform that applies an *existing* Beam >>>>>>> transform to each set of grouped values, separately, and combines the >>>>>>> result. Is anything like this possible with Beam using the Python SDK? >>>>>>> >>>>>>> Here are the closest things I've come up with: >>>>>>> 1. If each set of *inputs* to my transform fit into memory, I could >>>>>>> use GroupByKey followed by FlatMap. >>>>>>> 2. If each set of *outputs* from my transform fit into memory, I >>>>>>> could use CombinePerKey. >>>>>>> 3. If I knew the static number of groups ahead of time, I could use >>>>>>> Partition, followed by applying my transform multiple times, followed by >>>>>>> Flatten. >>>>>>> >>>>>>> In my scenario, none of these holds true. For example, currently I >>>>>>> have ~20 groups of values, with each group holding ~1 TB of data. My >>>>>>> custom >>>>>>> transform simply shuffles this TB of data around, so each set of >>>>>>> outputs is >>>>>>> also 1TB in size. >>>>>>> >>>>>>> In my particular case, it seems my options are to either relax these >>>>>>> constraints, or to manually convert each step of my existing transform >>>>>>> to >>>>>>> apply per key. This conversion process is tedious, but very >>>>>>> straightforward, e.g., the GroupByKey and ParDo that my transform is >>>>>>> built >>>>>>> out of just need to deal with an expanded key. >>>>>>> >>>>>>> I wonder, could this be something built into Beam itself, e.g,. as >>>>>>> TransformPerKey? The ptranforms that result from combining other Beam >>>>>>> transforms (e.g., _ChainPTransform in Python) are private, so this seems >>>>>>> like something that would need to exist in Beam itself, if it could >>>>>>> exist >>>>>>> at all. >>>>>>> >>>>>>> Cheers, >>>>>>> Stephan >>>>>>> >>>>>>
