Re: How windowing is implemented on Flink runner

2024-06-12 Thread Robert Bradshaw via user
Beam implements Windowing itself (via state and timers) rather than deferring to Flink's implementation. On Wed, Jun 12, 2024 at 11:55 AM Ruben Vargas wrote: > > Hello guys > > May be a silly question, > > But in the Flink runner, the window implementation uses the Flink > windowing? Does that me

Re: Paralalelism of a side input

2024-06-12 Thread Robert Bradshaw via user
uming resources in some way? I'm assuming may be is not > significant. That is correct, but the resources consumed by an idle operator should be negligible. > Thanks. > > El El vie, 7 de jun de 2024 a la(s) 3:56 p.m., Robert Bradshaw via user > escribió: >> >> Y

Re: Paralalelism of a side input

2024-06-07 Thread Robert Bradshaw via user
You can always limit the parallelism by assigning a single key to every element and then doing a grouping or reshuffle[1] on that key before processing the elements. Even if the operator parallelism for that step is technically, say, eight, your effective parallelism will be exactly one. [1] http

Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Robert Bradshaw via user
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > > > Here is an example from a book that I'm reading now and it may be > applicable. > > > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > > PYTHON - ord(id[0]) % 100 > or abs(hash(

Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Robert Bradshaw via user
Are you draining[1] your pipeline or simply canceling it and starting a new one? Draining should close open windows and attempt to flush all in-flight data before shutting down. For PubSub you may also need to read from subscriptions rather than topics to ensure messages are processed by either one

Re: Fails to run two multi-language pipelines locally?

2024-03-08 Thread Robert Bradshaw via user
it begins >> to create an error. My speculation is the containers don't recognise each >> other and get killed by the Flink task manager. I see containers are kept >> created and killed. >> >> Does every multi-language pipeline runs in a separate container? >>

Re: [Question] Python Streaming Pipeline Support

2024-03-08 Thread Robert Bradshaw via user
The Python Local Runner has limited support for streaming pipelines. For the time being would recommend using Dataflow or Flink (the latter can be run locally) to try out streaming pipelines. On Fri, Mar 8, 2024 at 2:11 PM Puertos tavares, Jose J (Canada) via user wrote: > > Hello Hu: > > > > No

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
t the flink runner. A flink cluster > is started locally. > > On Thu, 7 Mar 2024 at 12:13, Robert Bradshaw via user > wrote: >> >> Streaming portable pipelines are not yet supported on the Python local >> runner. >> >> On Wed, Mar 6, 2024 at 5:03 PM Jaeh

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
Streaming portable pipelines are not yet supported on the Python local runner. On Wed, Mar 6, 2024 at 5:03 PM Jaehyeon Kim wrote: > > Hello, > > I use the python SDK and my pipeline reads messages from Kafka and transforms > via SQL. I see two containers are created but it seems that they don't

Re: Roadmap of Calcite support on Beam SQL?

2024-03-04 Thread Robert Bradshaw via user
There is no longer a huge amount of active development going on here, but implementing a missing function seems like an easy contribution (lots of examples to follow). Otherwise, definitely worth filing a feature request as a useful signal for prioritization. On Mon, Mar 4, 2024 at 4:33 PM Jaehyeo

Re: ParDo(DoFn) with multiple context.output vs FlatMapElements

2024-01-26 Thread Robert Bradshaw via user
There is no difference; FlatMapElements is implemented in terms of a DoFn that invokes context.output multiple times. And, yes, Dataflow will fuse consecutive operations automatically. So if you have something like ... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ... Dataflow will fuse DoFnA and DoFnB to

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
On Wed, Jan 24, 2024 at 10:48 AM Mark Striebeck wrote: > > If point beam to the local jar, will beam start and also stop the expansion > service? Yes it will. > Thanks > Mark > > On Wed, 24 Jan 2024 at 08:30, Robert Bradshaw via user > wrote: >> >&g

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
You can also manually designate a replacement jar to be used rather than fetching the jar from maven, either as a pipeline option or (as of the next release) as an environment variable. The format is a json mapping from gradle targets (which is how we identify these jars) to local files (or urls).

Re: TypeError: '_ConcatSequence' object is not subscriptable

2024-01-22 Thread Robert Bradshaw via user
This is probably because you're trying to index into the result of the GroupByKey in your AnalyzeSession as if it were a list. All that is promised is that it is an iterable. If it is large enough to merit splitting over multiple fetches, it won't be a list. (If you need to index, explicitly conver

Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread Robert Bradshaw via user
Reshuffle is perfectly fine to use if the goal is just to redistribute work. It's only deprecated as a "checkpointing" mechanism. On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user wrote: > > For runners that support Reshuffle, it should be safe to use. Its been > "deprecated" for 7 years,

Re: How to debug ArtifactStagingService ?

2024-01-05 Thread Robert Bradshaw via user
Nothing problematic is standing out for me in those logs. A job service and artifact staging service is spun up to allow the job (and its artifacts) to be submitted, then they are shut down. What are the actual errors that you are seeing? On Wed, Jan 3, 2024 at 7:39 AM Lydian wrote: > > > Hi, > >

Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Robert Bradshaw via user
And should it be a list of strings, rather than a string? On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user wrote: > Can you try passing `extra_packages` instead of `extra_package` when > passing pipeline options as a dict? > > On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user < > user@

Re: Streaming management exception in the sink target.

2023-12-05 Thread Robert Bradshaw via user
Currently error handling is implemented on sinks in an ad-hoc basis (if at all) but John (cc'd) is looking at improving things here. On Mon, Dec 4, 2023 at 10:25 AM Juan Romero wrote: > > Hi guys. I want to ask you about how to deal with the scenario when the > target sink (eg: jdbc, kafka, bigq

Re: [QUESTION] Why no auto labels?

2023-10-20 Thread Robert Bradshaw via user
/org/apache/beam/sdk/Pipeline.java#L630 > > On Fri, Oct 13, 2023 at 1:32 PM Joey Tran wrote: >> >> >> >> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw wrote: >>> >>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran >>> wrote: >>>&

Re: Advanced Composite Transform Documentation

2023-10-19 Thread Robert Bradshaw via user
On Thu, Oct 19, 2023 at 2:00 PM Joey Tran wrote: > > For the python SDK, is there somewhere where we document more "advance" > composite transform operations? I'm not sure, but https://beam.apache.org/documentation/programming-guide/ is the canonical palace information like this should probaby b

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
something we try to avoid though. Another option is to make the suffix a uuid rather than a single counter. (This would still have issues with the first application possibly getting mixed up with a "different" first application unless it was always appended.) > On Fri, Oct 13, 202

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
nflight messages). At least with the old, intersecting names we can detect this problem rather than silently give corrupt data. On Fri, Oct 13, 2023 at 7:15 AM Joey Tran wrote: > For posterity: https://github.com/apache/beam/pull/28984 > > On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw >

Re: [QUESTION] Why no auto labels?

2023-10-10 Thread Robert Bradshaw via user
va? I imagine that if it's a toggle it'd be >> a very sticky toggle since it'd be easy for PTransforms to accidentally >> rely on it. >> >> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw >> wrote: >> >>> Huh. This used to be a hard err

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Robert Bradshaw via user
] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW > > On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user < > user@beam.apache.org> wrote: > >> BeamJava and BeamPython have the exact same behavior: transform names >> within must be distinct [1]. This

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Robert Bradshaw via user
BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this res

Re: UDF/UADF over complex structures

2023-09-28 Thread Robert Bradshaw via user
Yes, for sure. This is one of the areas Beam excels vs. more simple tools like SQL. You can write arbitrary code to iterate over arbitrary structures in the typical Java/Python/Go/Typescript/Scala/[pick your language] way. In the Beam nomenclature. UDFs correspond to DoFns and UDAFs correspond to C

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
> > > Is that assumption correct? > > > > El El vie, 15 de septiembre de 2023 a la(s) 10:59, Robert Bradshaw via > user escribió: > >> Beam will block on side inputs until at least one value is available (or >> the watermark has advanced such that we can be sur

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
Beam will block on side inputs until at least one value is available (or the watermark has advanced such that we can be sure one will never become available, which doesn't really apply to the global window case). After that, workers generally cache the side input value (for performance reasons) but

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
rDo, extract it's >> dofn, wrap it, and return a new ParDo >> >> On Fri, Sep 15, 2023, 11:53 AM Robert Bradshaw via user < >> user@beam.apache.org> wrote: >> >>> +1 to looking at composite transforms. You could even have a composite >>> trans

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
+1 to looking at composite transforms. You could even have a composite transform that takes another transform as one of its construction arguments and whose expand method does pre- and post-processing to the inputs/outputs before/after applying the transform in question. (You could even implement t

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
(As an aside, I think all of these options would make for a great blog post if anyone is interested in authoring one of those...) On Fri, Sep 1, 2023 at 9:26 AM Robert Bradshaw wrote: > You can also use Python's RenderRunner, e.g. > > python -m apache_beam.examples.wordcount --

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
You can also use Python's RenderRunner, e.g. python -m apache_beam.examples.wordcount --output out.txt \ --runner=apache_beam.runners.render.RenderRunner \ --render_output=pipeline.svg This also has an interactive mode, triggered by passing --port=N (where 0 can be used to pick an unuse

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath wrote: > > > On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw > wrote: > >> I would like to figure out a way to get the stream-y interface to work, >> as I think it's more natural overall. >> >>

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
change > to make and maybe even less typing for the user. I was originally thinking > side inputs and metrics would happen outside the loop, but I think you want > a class and not a closure at that point for sanity. > > On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw > wrote: &

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
but not >> distributed in the same was as, say, Beam SQL... but it would allow for SQL >> statements on individual files with projection pushdown supported for >> things like Parquet which could have some cool and performant data lake >> applications. I'll probably do a couple of the si

Re: [Request for Feedback] Swift SDK Prototype

2023-08-23 Thread Robert Bradshaw via user
Neat. Nothing like writing and SDK to actually understand how the FnAPI works :). I like the use of groupBy. I have to admit I'm a bit mystified by the syntax for parDo (I don't know swift at all which is probably tripping me up). The addition of external (cross-language) transforms could let you

Re: Getting Started With Implementing a Runner

2023-07-24 Thread Robert Bradshaw via user
interest. On Fri, Jul 21, 2023 at 7:25 AM Joey Tran wrote: > > Could you let me know when you update it? I would be interested in rereading > after the rewrite. > > Thanks! > Joey > > On Fri, Jul 14, 2023 at 4:38 PM Robert Bradshaw wrote: >> >> I'm taki

Re: Growing checkpoint size with Python SDF for reading from Redis streams

2023-07-20 Thread Robert Bradshaw via user
Your SDF looks fine. I wonder if there is an issue with how Flink is implementing SDFs (e.g. not garbage collecting previous remainders). On Tue, Jul 18, 2023 at 5:43 PM Nimalan Mahendran wrote: > > Hello, > > I am running a pipeline built in the Python SDK that reads from a Redis > stream via a

Re: Getting Started With Implementing a Runner

2023-07-14 Thread Robert Bradshaw via user
he-runner-api-protos> docs > page which implied to me that they'd be safe to use. I'll check out the > bundle_processor. Thanks! > > On Mon, Jul 10, 2023 at 1:07 PM Robert Bradshaw > wrote: > >> On Sun, Jul 9, 2023 at 9:22 AM Joey Tran >> wrote: >

Re: Pandas 2 Timeline Estimate

2023-07-12 Thread Robert Bradshaw via user
Contributions welcome! I don't think we're at the point we can stop supporting Pandas 1.x though, so we'd have to do it in such a way as to support both. On Wed, Jul 12, 2023 at 4:53 PM XQ Hu via user wrote: > https://github.com/apache/beam/issues/27221#issuecomment-1603626880 > > This tracks th

Re: Getting Started With Implementing a Runner

2023-07-10 Thread Robert Bradshaw via user
Also portability gives you more flexibility when it >> comes to choosing an SDK to define the pipeline and will allow you to >> execute transforms in any SDK via cross-language. >> >> Thanks, >> Cham >> >> On Fri, Jun 23, 2023 at 1:57 PM Robert Bradshaw via

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
ical "SDK Worker" (which in turn invokes the actual DoFns). This latter may be inlined (e.g. if it's 100% Python on both sides). See, for example, https://github.com/apache/beam/blob/v2.48.0/sdks/python/apache_beam/runners/portability/fn_api_runner/worker_handlers.py#L350 > On Fr

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
ur compute infrastructure. (If you're not doing streaming, this is much more straightforward than all the bundler scheduler stuff that currently exists in that code). > > > > > > On Fri, Jun 23, 2023 at 12:17 PM Alexey Romanenko < > aromanenko@gmail.com> wrote: > &

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko wrote: > If Beam Runner Authoring Guide is rather high-level for you, then, at > fist, I’d suggest to answer two questions for yourself: > - Am I going to implement a portable runner or native one? > The answer to this should be portable, as non-por

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
> [1] https://gist.github.com/egalpin/162a04b896dc7be1d0899acf17e676b3 > > On Thu, May 25, 2023 at 2:25 PM Robert Bradshaw via user < > user@beam.apache.org> wrote: > >> The GbkBeforeStatefulParDo is an implementation detail used to send all >> elements with the same key to the same

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
The GbkBeforeStatefulParDo is an implementation detail used to send all elements with the same key to the same worker (so that they can share state, which is itself partitioned by worker). This does cause a global barrier in batch pipelines. On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote: > H

Re: [EXTERNAL] Re: Vulnerabilities in Transitive dependencies

2023-05-02 Thread Robert Bradshaw via user
Generally these types of vulnerabilities are only exploitable when processing untrusted data and/or exposing a public service to the internet. This is not the typical use of Beam (especially the latter), but that's not to say Beam can't be used in this way. That being said, it's preferable to simpl

Re: How Beam Pipeline Handle late events

2023-04-24 Thread Robert Bradshaw via user
On Fri, Apr 21, 2023 at 3:37 AM Pavel Solomin wrote: > > Thank you for the information. > > I'm assuming you had a unique ID in records, and you observed some IDs > missing in Beam output comparing with Spark, and not just some duplicates > produced by Spark. > > If so, I would suggest to create

Re: [Question] - Time series - cumulative sum in right order with python api in a batch process

2023-04-24 Thread Robert Bradshaw via user
You are correct in that the data may arrive in an unordered way. However, once a window finishes, you are guaranteed to have seen all the data up to that point (modulo late data) and can then confidently compute your ordered cumulative sum. You could do something like this: def cumulative_sums(ke

Re: Avoid using docker when I use a external transformation

2023-04-18 Thread Robert Bradshaw via user
Docker is not necessary to expand the transform (indeed, by default it should just pull the Jar and invokes that directly to start the expansion service), but it is used as the environment in which to execute the expanded transform. It would be in theory possible to run the worker without docker a

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-18 Thread Robert Bradshaw via user
;>>> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax wrote: >>>>> >>>>>> Are you running on the Dataflow runner? If so, Dataflow - unlike >>>>>> Spark and Flink - dynamically modifies the parallelism as the operator >>>>>> runs, so there

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Robert Bradshaw via user
What are you trying to achieve by setting the parallelism? On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang wrote: > Thanks Reuven, what I mean is to set the parallelism in operator level. > And the input size of the operator is unknown at compiling stage if it is > not a source > operator, > > Here'

Re: Message guarantees

2023-04-14 Thread Robert Bradshaw via user
That is correct. On Tue, Apr 11, 2023 at 5:44 AM Hans Hartmann wrote: > > Hello, > > i'm wondering if Apache Beam is using the message guarantees of the > execution engines, that the pipeline is running on. > > So if i use the SparkRunner the consistency guarantees are exactly-once? > > Have a g

Re: Why is FlatMap different from composing Flatten and Map?

2023-03-15 Thread Robert Bradshaw via user
On Mon, Mar 13, 2023 at 11:33 AM Godefroy Clair wrote: > Hi, > I am wondering about the way `Flatten()` and `FlatMap()` are implemented > in Apache Beam Python. > In most functional languages, FlatMap() is the same as composing > `Flatten()` and `Map()` as indicated by the name, so Flatten() and

Re: Deduplicate usage

2023-03-02 Thread Robert Bradshaw via user
Whenever state is used, the runner will arrange such that the same keys will all go to the same worker, which often involves injecting a shuffle-like operation if the keys are spread out among many workers in the input. (An alternative implementation could involve storing the state in a distributed

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-07 Thread Robert Bradshaw via user
Seams reasonable to me. On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote: > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped > being built and supported since July 2022. I have filed [2] to track the > resolution of this issue. > > Based upon [1], almost eve

Re: How to submit beam python pipeline to GKE flink cluster

2023-02-03 Thread Robert Bradshaw via user
You should be able to omit the environment_type and environment_config variables and they will be populated automatically. For running locally, the flink_master parameter is not needed either (one will be started up automatically). On Fri, Feb 3, 2023 at 12:51 PM Talat Uyarer via user wrote: > >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
I'm also not sure it's part of the contract that the containerization technology we use will always have these capabilities. On Mon, Jan 30, 2023 at 10:53 AM Chad Dombrova wrote: > > Hi Valentyn, > >> >> Beam SDK docker containers on Dataflow VMs are currently launched in >> privileged mode. > >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
Different idea: is it possible to serve this data via another protocol (e.g. sftp) rather than requiring a mount? On Mon, Jan 30, 2023 at 9:26 AM Chad Dombrova wrote: > > Hi Robert, > I know very little about the FileSystem classes, but I don’t think it’s > possible for a process running in dock

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
If it's your input/output data, presumably you could implement a https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystem.html for nfs. (I don't know what all that would entail...) On Mon, Jan 30, 2023 at 9:04 AM Chad Dombrova wrote: > > Hi Israel, > Thanks for responding.

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-09 Thread Robert Bradshaw
bc > > It seemed to run fine both on DirectRunner and PortableRunner (embed mode), > but Dataflow v2 runner raised an error at runtime seemingly associated with > the Shuffle service? I have job IDs and trace links if those are helpful as > well. > > Thanks, > Eva

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
Awesome, thanks! On Mon, Jun 14, 2021 at 5:36 PM Evan Galpin wrote: > > I’ll try to create something as small as possible from the pipeline I > mentioned 👍 I should have time this week to do so. > > Thanks, > Evan > > On Mon, Jun 14, 2021 at 18:09 Robert Bradshaw wro

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
ink > > It doesn't seem to matter if there are 0 messages in a subscription or 50k > messages at startup. The rate of new messages however is very low. Not sure > if those are helpful details, let me know if there's anything else specific > which would help. > > On M

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
+1, we'd really like to get to the bottom of this, so clear instructions on a pipeline/conditions that can reproduce it would be great. On Mon, Jun 14, 2021 at 7:34 AM Jan Lukavský wrote: > > Hi Eddy, > > you are probably hitting a not-yet discovered bug in SDF implementation in > FlinkRunner th

Re: Issues running Kafka streaming pipeline in Python

2021-06-04 Thread Robert Bradshaw
Glad you were able to figure it out. Maybe it's moot with runner v2 becoming the default, but we really should give a clearer error in this case. On Wed, Jun 2, 2021 at 8:16 PM Chamikara Jayalath wrote: > > Great :) > > On Wed, Jun 2, 2021 at 8:15 PM Alex Koay wrote: >> >> Finally figured out t

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez wrote: > > On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw wrote: >> >> If you want to control the total number of elements being processed >> across all workers at a time, you can do this by assigning random keys >

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
pipelines (google > does not allow more than 25 dataflow pipelines per region) with 10 elements > each, I am launching the next 20 pipelines. > > This is ofcourse missing the benefit of serverless. > > Any idea, how to work around this? > > Best, > Eila > > >

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-05-17 Thread Robert Bradshaw
Note that workers generally process one element per thread at a time. The number of threads defaults to the number of cores of the VM that you're using. On Mon, May 17, 2021 at 10:18 AM Brian Hulette wrote: > What type of files are you reading? If they can be split and read by > multiple workers

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
y1" that are sorted by "key2". The >>> downstreaming process, for example, will make a rolling window with size N >>> that reads N records together at one time. But note, the rolling window >>> will not cross different "key1". >>> >>

Re: Question on printing out a PCollection

2021-04-30 Thread Robert Bradshaw
Sorry, no Java versions of this stuff (though it may be possible to use cross-language to invoke your Java pipeline from Python and get the benefits that way). On Fri, Apr 30, 2021 at 11:30 AM Tao Li wrote: > > Thanks @Ning Kang. > > @Robert Bradshaw I assume you are referring

Re: Question on printing out a PCollection

2021-04-30 Thread Robert Bradshaw
You can also use interactive Beam's collect, to get the PCollection as a Dataframe, and then print it or do whatever else with it as you like. On Fri, Apr 30, 2021 at 10:24 AM Ning Kang wrote: > > Hi Tao, > > The `show()` API works with any IPython notebook runtimes, including Colab, > Jupyter L

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Robert Bradshaw
u, Robert and Brian. >>>>> >>>>> I'd like to try this out. I am trying to distribute my dataset to >>>>> nodes, sort each partition by some key and then store each partition to >>>>> its >>>>> own file. >>>

Re: Beam Dataframe - sort and grouping

2021-04-02 Thread Robert Bradshaw
Thanks for trying this out. Better support for groupby (e.g. https://github.com/apache/beam/pull/13843 , https://github.com/apache/beam/pull/13637) will be available in the next Beam release (2.29, in progress, but you could try out head if you want). Note, however, that Beam PCollections are by d

Re: [Question] Need to write a pipeline in Go consuming events from Kafka

2021-03-29 Thread Robert Bradshaw
On Wed, Mar 24, 2021 at 4:24 AM Đức Trần Tiến wrote: > > And the last question: Could I write that pipeline in Java and invoke that > pipeline from Go? :D > That is exactly the story we're trying to pursue for getting the large set of Java connectors available to Go: https://cloud.google.com/bl

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
>> Am I reading this wrong? >> >> Kenn >> >> On Wed, Mar 24, 2021 at 4:35 PM Alex Amato wrote: >> >>> How about a PCollection containing every element which was successfully >>> written? >>> Basically the same things which were passed into it

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
nd just add new > funtionality. Though, we need to follow the same pattern for user API and > maybe even naming for this feature across different IOs (like we have for > "readAll()” methods). > > > > I agree that we have to avoid returning PDone for such cases. > >

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
Returning PDone is an anti-pattern that should be avoided, but changing it now would be backwards incompatible. PRs to add non-PDone returning variants (probably as another option to the builders) that compose well with Wait, etc. would be welcome. On Wed, Mar 24, 2021 at 11:14 AM Alexey Romanenko

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-12 Thread Robert Bradshaw
Do we now support 1.8 through 1.12? Unless there are specific objections, makes sense to me. On Fri, Mar 12, 2021 at 8:29 AM Alexey Romanenko wrote: > +1 too but are there any potential objections for this? > > On 12 Mar 2021, at 11:21, David Morávek wrote: > > +1 > > D. > > On Thu, Mar 11, 20

Re: Overwrite support from ParquetIO

2021-01-27 Thread Robert Bradshaw
Fortunately making deleting files idempotent is much easier than writing them :). But one needs to handle the case of concurrent execution as well as sequential re-execution due to possible zombie workers. On Wed, Jan 27, 2021 at 5:04 PM Reuven Lax wrote: > Keep in mind thatt DoFns might be reex

Re: Is there an array explode function/transform?

2021-01-14 Thread Robert Bradshaw
t of all the different array > elements? > > On Thu, Jan 14, 2021 at 11:25 AM Robert Bradshaw > wrote: > >> I think it makes sense to allow specifying more than one, if desired. >> This is equivalent to just stacking multiple Unnests. (Possibly one could >> even hav

Re: Is there an array explode function/transform?

2021-01-14 Thread Robert Bradshaw
ly could be a top-level transform. Should it automatically >>>>> unnest all arrays, or just the fields specified? >>>>> >>>>> We do have to define the semantics for nested arrays as well. >>>>> >>>>> On Wed, Jan 13, 2021 at 1

Re: Is there an array explode function/transform?

2021-01-13 Thread Robert Bradshaw
Ah, thanks for the clarification. UNNEST does sound like what you want here, and would likely make sense as a top-level relational transform as well as being supported by SQL. On Wed, Jan 13, 2021 at 10:53 AM Tao Li wrote: > @Kyle Weaver sure thing! So the input/output > definition for the Flat

Re: Help measuring upcoming performance increase in flink runner on production systems

2020-12-21 Thread Robert Bradshaw
I agree. Borrowing the mutation detection from the direct runner as an intermediate point sounds like a good idea. On Mon, Dec 21, 2020 at 8:57 AM Kenneth Knowles wrote: > I really think we should make a plan to make this the default. If you test > with the DirectRunner it will do mutation check

Re: is apache beam go sdk supported by spark runner?

2020-11-25 Thread Robert Bradshaw
Yes, it should be for batch (just like for Python). There is ongoing work to make it work for Streaming as well. On Sat, Nov 21, 2020 at 2:57 PM Meriem Sara wrote: > > Hello everyone. I am trying to use apache beam with Golang to execute a data > processing workflow using apache Spark. However,

Re: Support for Flink 1.11

2020-10-16 Thread Robert Bradshaw
Support for Flink 1.11 is https://issues.apache.org/jira/browse/BEAM-10612 . It has been implemented and will be included in the next release (Beam 2.25). In the meantime, you could try building yourself from head. On Fri, Oct 16, 2020 at 4:39 AM Kishor Joshi wrote: > > Hi Team, > > Since the bea

Re: Upload third party runtime dependencies for expanding transform like KafkaIO.Read in Python Portable Runner

2020-10-02 Thread Robert Bradshaw
:177) > at > org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157) > > On Fri, Oct 2, 2020 at 5:14 PM Robert Bradshaw wrote: >> >> Could you clarify a bit exactly what you're trying to

Re: [DISCUSS] Deprecation of AWS SDK v2 IO connectors

2020-09-15 Thread Robert Bradshaw
> should > deprecate a v1 IO ONLY when we have full feature parity in the v2 version. > I think we don't have a replacement for AWSv1 S3 IO so that one should not > be > deprecated. > > On Tue, Sep 15, 2020 at 6:07 PM Robert Bradshaw > wrote: > > > > The

Re: Info needed - pmc mailing list

2020-08-25 Thread Robert Bradshaw
Try priv...@beam.apache.org. On Tue, Aug 25, 2020 at 6:18 AM D, Anup (Nokia - IN/Bangalore) wrote: > > Hi, > > > > We would like to know if there is a way to reach out to members of the pmc > group. > > We tried sending email to p...@beam.apache.org but it got bounced. > > > > Thanks > > Anup

Re: Out-of-orderness of window results when testing stateful operators with TextIO

2020-08-24 Thread Robert Bradshaw
As for the question of writing tests in the face of non-determinism, you should look into TestStream. MyStatefulDoFn still needs to be updated to not assume an ordering. (This can be done by setting timers that provide guarantees that (modulo late data) one has seen all data up to a certain timesta

Re: Resource Consumption increase With TupleTag

2020-08-19 Thread Robert Bradshaw
Is this 2kps coming out of Filter1 + 2kps coming out of Filter2 (which would be 4kps total), or only 2kps coming out of KafkaIO and MessageExtractor? Though it /shouldn't/ matter, due to sibling fusion, there's a chance things are getting fused poorly and you could write Filter1 and Filter2 instea

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-17 Thread Robert Bradshaw
I checked Java, it looks like the way things are structured we do not have that bug there. On Mon, Aug 17, 2020 at 3:31 PM Robert Bradshaw wrote: > > +1 > > Thanks, Eugene, for finding and fixing this! > > FWIW, most use of Python from the Python Portable Runner used the >

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-17 Thread Robert Bradshaw
+1 Thanks, Eugene, for finding and fixing this! FWIW, most use of Python from the Python Portable Runner used the embedded environment (this is the default direct runner), so dependencies are already present. On Mon, Aug 17, 2020 at 3:19 PM Daniel Oliveira wrote: > > Normally I'd say not to che

Re: Scio 0.9.3 released

2020-08-05 Thread Robert Bradshaw
Thanks for the update! On Wed, Aug 5, 2020 at 11:46 AM Neville Li wrote: > > Hi all, > > We just released Scio 0.9.3. This bumps Beam SDK to 2.23.0 and includes a lot > of improvements & bug fixes. > > Cheers, > Neville > > https://github.com/spotify/scio/releases/tag/v0.9.3 > > "Petrificus Tota

Re: ReadFromKafka returns error - RuntimeError: cannot encode a null byte[]

2020-07-22 Thread Robert Bradshaw
On Sat, Jul 18, 2020 at 12:08 PM Chamikara Jayalath wrote: > > > On Fri, Jul 17, 2020 at 10:04 PM ayush sharma <1705ay...@gmail.com> wrote: > >> Thank you guys for the reply. I am really stuck and could not proceed >> further. >> Yes, the previous trial published message had null key. >> But when

Re: Testing Apache Beam pipelines / python SDK

2020-07-21 Thread Robert Bradshaw
ditionally, what is the best way to test writing to BigQuery? > I have seen this file > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/bigtableio_it_test.py > but it appears it writes to real big query? > > kind regards > Marco > > &g

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Robert Bradshaw
test_pipeline.get_full_options_as_args(**extra_opts)) > > print(result) > > Basically, i would expect a PCollection as result of the pipeline, and i > would be testing the content of the PCollection > > Running this results in this messsage > > IT is skipped b

Re: Testing Apache Beam pipelines / python SDK

2020-07-13 Thread Robert Bradshaw
You can use apache_beam.testing.util.assert_that to write tests of Beam pipelines. This is what Beam uses for its tests, e.g. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util_test.py#L80 On Mon, Jul 13, 2020 at 2:36 PM Sofia’s World wrote: > > Hi all > i was wo

Re: Not able to see WordCount output in docker /tmp/...

2020-07-07 Thread Robert Bradshaw
Does it work when you write to a distributed filesystem? (One issue with Docker is that the manager and each of their workers have their own local filesystem.) On Tue, Jul 7, 2020 at 2:17 PM Avijit Saha wrote: > > While trying to run the Beam WordCount example on Flink runner using Job > Manage

Re: Understanding combiner's distribution logic

2020-07-01 Thread Robert Bradshaw
On Tue, Jun 30, 2020 at 3:32 PM Julien Phalip wrote: > Thanks Luke! > > One part I'm still a bit unclear about is how exactly the PreCombine stage > works. In particular, I'm wondering how it can perform the combination > before the GBK. Is it because it can already compute the combination on > a

Re: PaneInfo showing UNKOWN State

2020-05-26 Thread Robert Bradshaw
To clarify, PaneInfo is supported on the FnAPI local runner, but not on the bundle based one. Unfortunately, Streaming is not supported on the FnAPI one (yet), but work there is ongoing. On Tue, May 26, 2020 at 11:46 AM Pablo Estrada wrote: > Hi Jayadeep, > Unfortunately, it seems that PaneInfo

  1   2   3   >