Hi! Is there any way to do stateful processing in Python Beam SDK? I am trying to train a LSHForest for approximate nearest neighbor search. Using the scikit-learn implementation it is possible to do partial fit's so I can gather up mini batches and fit the model on those in sequence using ParDo. However, to my understanding, there is no way for me to control on how many bundles the ParDo will execute over and therefore the training makes little sense and I will end up with a lot of different models, rather than one.
Another approach would be to create a CombineFn that accumulates values by training the model on but There is no intuitive way to combine models in `merge_accumulators` so I don't think that'll fit either. Does it makes sense to pass the whole pcollection as a list in a side input and train the model as so? In that case how should I chop the pcol into batches that I can loop over in a nice way? If I read the whole set I'll most likely run out of memory. I've found that there exist stateful processing in the Java SDK but it seems to be missing in python still. Any help/ideas are greatly appreciated. Thanks, Vilhelm von Ehrenheim
