Many of my Beam pipelines start with partitioning over some large,
statically known number of inputs that could be created from a list of
sequential integers.

In Python, these sequential integers can be efficiently represented with a
range() object, which stores the start/top and interval. However,
beam.Create() always loads its arguments into memory, as a Python tuple. I
suspect this will start to become problematic for really big ranges (e.g.,
in the tens of millions).

I am thinking that it would make sense to either add a special optimization
to beam.Create() to recognize range() objects, or to add a dedicated
transform like beam.Range() for creating a range of inputs.

Any thoughts?

Cheers,
Stephan

P.S. I initially looked into writing this as a trivial case for trying out
the Splittable DoFn API, but despite reading the docs several times could
not figure out how that is supposed to work. A self-contained and complete
example would really help for making that API accessible to new users.

Reply via email to