Re: Performance of stateful DoFn vs CombineByKey

2019-03-14 Thread Kenneth Knowles
Combine admits many more execution plans than stateful ParDo: - "Combiner lifting" or "mapper-side combine", in which the CombineFn is used to reduce data before shuffling. This is tremendous in batch, but can still matter in streaming. - Hot key fanout & recombine. This is important in both bat

Re: "Dynamic" read / expand of PCollection element by IOs

2019-03-14 Thread Eugene Kirpichov
I would phrase it more optimistically :) While there is no way to generically apply a PTransform elementwise, Beam does have a pattern / best practice for developing IO connectors that can be applied elementwise - it's called "readAll" and some IOs provide this, e.g. TextIO.readAll(), JdbcIO.readAl

Re: "Dynamic" read / expand of PCollection element by IOs

2019-03-14 Thread Lukasz Cwik
1) Your looking for SplittableDoFn[1]. It is still in development and a conversion of all the current IO connectors that exist today to be able to consume a PCollection of resources is yet to come. There is some limited usecases that exist already like FileIO.match[2] and if these fit your usecase

"Dynamic" read / expand of PCollection element by IOs

2019-03-14 Thread Jozef Vilcek
Hello, I wanted to write a Beam code which expands incoming `PCollection<>`, element wise, by use of existing IO components. Example could be to have a `PCollection` which will hold arbitrary paths to data and I want to load them via `HadoopFormatIO.Read` which is of `PTransform>`. Problem is, I

python streaming writing hourly avro files files

2019-03-14 Thread Marc Matt
Hi, Getting messages from pubsub and then saving it into hourly or other interval files on gcs does not work on Cloud Dataflow. The job only writes the files when I shut down the job. Is this not yet supported for the Python SDK or am I doing something wrong? Here is a snippet of my code: p = be

Re: java.lang.NoSuchFieldError: NULL_VALUE while reading parquet files

2019-03-14 Thread Ɓukasz Gajowy
Hi Jitendra, I'm not sure yet but I think this may be an Avro dependency problem. See [1] for a better explanation than I can provide (especially the "Update" in the accepted answer). [2] is the place that is throwing the exception. Is org.apache.avro.JsonProperties.NULL_VALUE present in your runt