Hello there, I'm actually working on moving parts of my company's architectures toward something more cloud native. I ended up using beam as its quite straightforward to use with our current stack.
I have some pipelines working at this time but I need to move some of them to SDF to increase parallelism. I tried to implement an SDF with a custom RestrictionTracker according to the doc here https://beam.apache.org/documentation/programming-guide/ and code here https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/io/iobase.py but few things were surprising and some help would be welcomed. 1. First, I had to delete the compiled cython common.pdx file as it was preventing beam from finding my restriction param (reinstalling apache-beam though pip didn't solve the problem) 2. It looks like with SDF the setup and teardown methods of my DoFn are never called 3. check_done is called on my restriction tracker before the process method is called on my DoFn. This is weird as the documentation states that check_done should raise an error if there are some unclaimed work (no work is claimed if process hasn't been called yet) For the record I am using a GroupIntoBatches later on the pipeline, which looks like this : - beam.Create (very small initial set) - beam.ParDo (regular DoFn) - beam.ParDo (SDF) - beam.ParDo (regular DoFn) - beam.GroupIntoBatches - beam.ParDo (regular DoFn) Any help would be greatly appreciated -- Julien CHATY-CAPELLE Full stack developer/Integrator Deepomatic
