Combine admits many more execution plans than stateful ParDo:
- "Combiner lifting" or "mapper-side combine", in which the CombineFn is
used to reduce data before shuffling. This is tremendous in batch, but can
still matter in streaming.
- Hot key fanout & recombine. This is important in both bat
I would phrase it more optimistically :)
While there is no way to generically apply a PTransform
elementwise, Beam does have a pattern / best practice for developing IO
connectors that can be applied elementwise - it's called "readAll" and some
IOs provide this, e.g. TextIO.readAll(), JdbcIO.readAl
1)
Your looking for SplittableDoFn[1]. It is still in development and a
conversion of all the current IO connectors that exist today to be able to
consume a PCollection of resources is yet to come.
There is some limited usecases that exist already like FileIO.match[2] and
if these fit your usecase
Hello,
I wanted to write a Beam code which expands incoming `PCollection<>`,
element wise, by use of existing IO components. Example could be to have a
`PCollection` which will hold arbitrary paths to data and I
want to load them via `HadoopFormatIO.Read` which is of `PTransform>`.
Problem is, I
Hi,
Getting messages from pubsub and then saving it into hourly or other
interval files on gcs does not work on Cloud Dataflow. The job only writes
the files when I shut down the job. Is this not yet supported for the
Python SDK or am I doing something wrong?
Here is a snippet of my code:
p = be
Hi Jitendra,
I'm not sure yet but I think this may be an Avro dependency problem. See
[1] for a better explanation than I can provide (especially the "Update" in
the accepted answer). [2] is the place that is throwing the exception. Is
org.apache.avro.JsonProperties.NULL_VALUE present in your runt