Re: Set parallelism for each operator

2020-04-29 Thread Eleanore Jin
Thanks all for the information! Eleanore On Wed, Apr 29, 2020 at 6:36 PM Ankur Goenka wrote: > Beam does support parallelism for the job which applies to all the > transforms in the job when executing on Flink using the "--parallelism" > flag. > > From the usecase you mentioned, Kafka read oper

Re: Set parallelism for each operator

2020-04-29 Thread Ankur Goenka
Beam does support parallelism for the job which applies to all the transforms in the job when executing on Flink using the "--parallelism" flag. >From the usecase you mentioned, Kafka read operations will be over parallelised but it should be ok as they will only have a small amount of memory impa

Re: possible bug in AvroUtils

2020-04-29 Thread Luke Cwik
+dev +Brian Hulette +Reuven Lax On Wed, Apr 29, 2020 at 4:21 AM Paolo Tomeo wrote: > Hi all, > > I think the method AvroUtils.toBeamSchema has a not expected side effect. > I found out that, if you invoke it and then you run a pipeline of > GenericRecords containing a timestamp (l tried with

Re: Set parallelism for each operator

2020-04-29 Thread Luke Cwik
Beam doesn't expose such a thing directly but the FlinkRunner may be able to take some pipeline options to configure this. On Wed, Apr 29, 2020 at 5:51 PM Eleanore Jin wrote: > Hi Kyle, > > I am using Flink Runner (v1.8.2) > > Thanks! > Eleanore > > On Wed, Apr 29, 2020 at 10:33 AM Kyle Weaver

Re: Set parallelism for each operator

2020-04-29 Thread Eleanore Jin
Hi Kyle, I am using Flink Runner (v1.8.2) Thanks! Eleanore On Wed, Apr 29, 2020 at 10:33 AM Kyle Weaver wrote: > Which runner are you using? > > On Wed, Apr 29, 2020 at 1:32 PM Eleanore Jin > wrote: > >> Hi all, >> >> I just wonder can Beam allow to set parallelism for each operator >> (PTran

Re: Set parallelism for each operator

2020-04-29 Thread Kyle Weaver
Which runner are you using? On Wed, Apr 29, 2020 at 1:32 PM Eleanore Jin wrote: > Hi all, > > I just wonder can Beam allow to set parallelism for each operator > (PTransform) separately? Flink provides such feature. > > The usecase I have is the source is kafka topics, which has less > partition

Set parallelism for each operator

2020-04-29 Thread Eleanore Jin
Hi all, I just wonder can Beam allow to set parallelism for each operator (PTransform) separately? Flink provides such feature. The usecase I have is the source is kafka topics, which has less partitions, while we have heavy PTransform and would like to scale it with more parallelism. Thanks a l

Re: Beam + Flink + Docker - Write to host system

2020-04-29 Thread Kyle Weaver
> This seems to have worked, as the output file is created on the host system. However the pipeline silently fails, and the output file remains empty. Have you checked the SDK container logs? They are most likely to contain relevant failure information. > I don't know if this is a result of me re

Re: Running Nexmark for Flink Streaming

2020-04-29 Thread Maximilian Michels
@Sruthi: You'll have to implement your own factory. If you include the mentioned dependency, e.g. if you use Gradle: runtimeOnly "org.apache.flink:flink-statebackend-rocksdb_2.11:$flink_version" Then create the factory: options.setStateBackend(o -> new RocksDBStateBackend("file://...")); Al

Re: Beam pipeline parallel steps

2020-04-29 Thread Marco Mistroni
Many thanks On Wed, Apr 29, 2020, 5:56 PM André Rocha Silva < a.si...@portaltelemedicina.com.br> wrote: > Hey > > You simply use the output PCollection from one to many pipes as you want. > E.g.: > p = beam.Pipeline(options=pipeline_options) > > data = ( > p > | 'Get data' >> beam.io.ReadFromText

Re: Beam pipeline parallel steps

2020-04-29 Thread André Rocha Silva
Hey You simply use the output PCollection from one to many pipes as you want. E.g.: p = beam.Pipeline(options=pipeline_options) data = ( p | 'Get data' >> beam.io.ReadFromText(user_options.input_file) ) output1 = ( data | 'Transform 1' >> beam.ParDo(trasnf1()) | 'Write transform 1 results' >> be

Re: Beam pipeline parallel steps

2020-04-29 Thread Alex Van Boxel
Yes, this is not a problem. We do it regularly in our streaming pipelines as a single stream doesn't have enough load for ruining on a Dataflow. So we run different streams in parallel in a single Beam pipelines. Data wise the streams have nothing to do with each other, but transformation wise the

Beam pipeline parallel steps

2020-04-29 Thread Marco Mistroni
Hi all Is it possible in beam to create a pipeline where two tasks can run in parallel as opposed to sequential,? Simple usecase would be step 3 will generate some data out of which I generate eg 3 completely different outcomes. ( Eg 3 different files stored in a bucket) Thanks Marco

Beam + Flink + Docker - Write to host system

2020-04-29 Thread Robbe Sneyders
Hi all, We're working on a project where we're limited to one big development machine for now. We want to start developing data processing pipelines in Python, which should eventually be ported to a currently unknown setup on a separate cluster or cloud, so we went with Beam for its portability.

RE: Apache beam job on Flink checkpoint size growing over time

2020-04-29 Thread Stephen.Hesketh
Hi Max, The ingestion is covering EOD processing from a Kafka source, so we get a lot of data from 5pm-8pm and outside of that time we get no data. The checkpoint is just storing the Kafka offset for restart. Sounds like during the period of no data there could be an open buffer. I would have

possible bug in AvroUtils

2020-04-29 Thread Paolo Tomeo
Hi all, I think the method AvroUtils.toBeamSchema has a not expected side effect. I found out that, if you invoke it and then you run a pipeline of GenericRecords containing a timestamp (l tried with logical-type timestamp-millis), Beam converts such timestamp from long to org.joda.time.DateTime.