Hello there,
I'm actually working on moving parts of my company's  architectures toward
something more cloud native. I ended up using beam as its quite
straightforward to use with our current stack.

I have some pipelines working at this time but I need to move some of them
to SDF to increase parallelism.
I tried to implement an SDF with a custom RestrictionTracker according to
the doc here https://beam.apache.org/documentation/programming-guide/  and
code here
https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/io/iobase.py
but few things were surprising and some help would be welcomed.


   1. First, I had to delete the compiled cython common.pdx file as it was
   preventing beam from finding my restriction param (reinstalling apache-beam
   though pip didn't solve the problem)
   2. It looks like with SDF the setup and teardown methods of my DoFn are
   never called
   3. check_done is called on my restriction tracker before the process
   method is called on my DoFn. This is weird as the documentation states that
   check_done should raise an error if there are some unclaimed work (no work
   is claimed if process hasn't been called yet)


For the record I am using a GroupIntoBatches later on the pipeline, which
looks like this :

   - beam.Create (very small initial set)
   - beam.ParDo (regular DoFn)
   - beam.ParDo (SDF)
   - beam.ParDo (regular DoFn)
   - beam.GroupIntoBatches
   - beam.ParDo (regular DoFn)


Any help would be greatly appreciated


-- 
Julien CHATY-CAPELLE

Full stack developer/Integrator
Deepomatic

Reply via email to