HI Vilhelm,

Python SDK currently does not support stateful processing. We should update
the capability matrix to show this. I filed
https://issues.apache.org/jira/browse/BEAM-2687 to track this feature. Feel
free to follow it there or better make it happen. As far as I know, nobody
is actively working on it and will unlikely to be supported in the short
term.

Thank you,
Ahmet

On Tue, Jul 25, 2017 at 3:49 AM, Vilhelm von Ehrenheim <
[email protected]> wrote:

> Hi!
> Is there any way to do stateful processing in Python Beam SDK?
>
> I am trying to train a LSHForest for approximate nearest neighbor search.
> Using the scikit-learn implementation it is possible to do partial fit's so
> I can gather up mini batches and fit the model on those in sequence using
> ParDo. However, to my understanding, there is no way for me to control on
> how many bundles the ParDo will execute over and therefore the training
> makes little sense and I will end up with a lot of different models, rather
> than one.
>
> Another approach would be to create a CombineFn that accumulates values by
> training  the model on but There is no intuitive way to combine models in
> `merge_accumulators` so I don't think that'll fit either.
>
> Does it makes sense to pass the whole pcollection as a list in a side
> input and train the model as so? In that case how should I chop the pcol
> into batches that I can loop over in a nice way? If I read the whole set
> I'll most likely run out of memory.
>
> I've found that there exist stateful processing in the Java SDK but it
> seems to be missing in python still.
>
> Any help/ideas are greatly appreciated.
>
> Thanks,
> Vilhelm von Ehrenheim
>

Reply via email to