HI Vilhelm, Python SDK currently does not support stateful processing. We should update the capability matrix to show this. I filed https://issues.apache.org/jira/browse/BEAM-2687 to track this feature. Feel free to follow it there or better make it happen. As far as I know, nobody is actively working on it and will unlikely to be supported in the short term.
Thank you, Ahmet On Tue, Jul 25, 2017 at 3:49 AM, Vilhelm von Ehrenheim < [email protected]> wrote: > Hi! > Is there any way to do stateful processing in Python Beam SDK? > > I am trying to train a LSHForest for approximate nearest neighbor search. > Using the scikit-learn implementation it is possible to do partial fit's so > I can gather up mini batches and fit the model on those in sequence using > ParDo. However, to my understanding, there is no way for me to control on > how many bundles the ParDo will execute over and therefore the training > makes little sense and I will end up with a lot of different models, rather > than one. > > Another approach would be to create a CombineFn that accumulates values by > training the model on but There is no intuitive way to combine models in > `merge_accumulators` so I don't think that'll fit either. > > Does it makes sense to pass the whole pcollection as a list in a side > input and train the model as so? In that case how should I chop the pcol > into batches that I can loop over in a nice way? If I read the whole set > I'll most likely run out of memory. > > I've found that there exist stateful processing in the Java SDK but it > seems to be missing in python still. > > Any help/ideas are greatly appreciated. > > Thanks, > Vilhelm von Ehrenheim >
