Many of my Beam pipelines start with partitioning over some large, statically known number of inputs that could be created from a list of sequential integers.
In Python, these sequential integers can be efficiently represented with a range() object, which stores the start/top and interval. However, beam.Create() always loads its arguments into memory, as a Python tuple. I suspect this will start to become problematic for really big ranges (e.g., in the tens of millions). I am thinking that it would make sense to either add a special optimization to beam.Create() to recognize range() objects, or to add a dedicated transform like beam.Range() for creating a range of inputs. Any thoughts? Cheers, Stephan P.S. I initially looked into writing this as a trivial case for trying out the Splittable DoFn API, but despite reading the docs several times could not figure out how that is supposed to work. A self-contained and complete example would really help for making that API accessible to new users.