Re: AvroIO.to(DynamicAvroDestinations) deprecated?

2022-09-13 Thread Steve Niemitz
make these features available. > > I hope this helps, > John > > On Mon, Sep 12, 2022 at 3:38 PM Steve Niemitz wrote: > >> We're trying to do some semi-advanced custom logic (custom writers and >> schemas per destination) with AvroIO, and want to use >> DynamicAvroDes

AvroIO.to(DynamicAvroDestinations) deprecated?

2022-09-12 Thread Steve Niemitz
We're trying to do some semi-advanced custom logic (custom writers and schemas per destination) with AvroIO, and want to use DynamicAvroDestinations to accomplish this. However, AvroIO.to(DynamicAvroDestinations) is deprecated, but there doesn't seem to be any other way to accomplish what we want

Re: Breaking change for FileIO WriteDynamic in Beam 2.34?

2022-04-06 Thread Steve Niemitz
Without the full logs it's hard to say, but I've definitely seen that error in the past when the worker disks are full. ApplianceShuffleWriter needs to extract a native library to a temp location, and if the disk is full that'll fail, resulting in the NoClassDefFoundError. On Wed, Apr 6, 2022 at

Re: "Slowly updating global window side inputs" example buggy?

2022-02-22 Thread Steve Niemitz
t; Pavel Solomin >> >> Tel: +351 962 950 692 <+351%20962%20950%20692> | Skype: pavel_solomin | >> Linkedin <https://www.linkedin.com/in/pavelsolomin> >> >> >> >> >> >> On Tue, 22 Feb 2022 at 18:46, Steve Niemitz wrote: >>

"Slowly updating global window side inputs" example buggy?

2022-02-22 Thread Steve Niemitz
We had a team try to use the "slowly updating global window side inputs" pattern (on dataflow) to update some metadata in their pipeline every minute, but surprisingly ran into errors that the side input PCollection contained more than one element, [1] although this only manifested intermittently.

Testing a jvm pipeline on the portability framework locally

2022-01-18 Thread Steve Niemitz
If I have a (jvm) pipeline, is there a simple way (ie DirectRunner) to run it locally but using the portability framework? I'm running into a lot of weird bugs running it on dataflow (v2) and want to be able to run it locally for a faster debug loop.

Re: Some questions about external tables in BeamSQL

2022-01-13 Thread Steve Niemitz
Thanks for the quick responses! Mine are inline as well. On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette wrote: > I added some responses inline. Also adding dev@ since this is getting > into SQL internals. > > On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz > wrote: > >>

Some questions about external tables in BeamSQL

2022-01-13 Thread Steve Niemitz
I've been playing around with CREATE EXTERNAL TABLE (using a custom TableProvider as well) w/ BeamSQL and really love it. I have a few questions though that I've accumulated as I've been using it I wanted to ask. - I'm a little confused about the need to define columns in the CREATE EXTERNAL

Re: Using CREATE EXTERNAL TABLE in a pipeline?

2022-01-12 Thread Steve Niemitz
via > SqlTransform(..).withDdlString(). There are some usage examples here: > https://beam.apache.org/releases/javadoc/2.35.0/org/apache/beam/sdk/extensions/sql/SqlTransform.html > > Brian > > On Wed, Jan 12, 2022 at 1:54 PM Steve Niemitz > wrote: > >> Is it possible

Using CREATE EXTERNAL TABLE in a pipeline?

2022-01-12 Thread Steve Niemitz
Is it possible to use CREATE EXTERNAL TABLE + a select statement in a SqlTransform to act as a source for a pipeline? The functionality seems really useful, but it seems like it only works from the JDBC context? Am I missing anything?

Re: ImportError: __import__ not found on python job

2021-12-08 Thread Steve Niemitz
hat was going on? > > [1] https://github.com/pybind/pybind11/issues/2557 > [2] https://github.com/apache/beam/pull/5071 > > On Wed, Dec 8, 2021 at 5:30 AM Steve Niemitz wrote: > >> Yeah, I can't imagine this is a "normal" problem. >> >> I'm on linux w/

Re: ImportError: __import__ not found on python job

2021-12-08 Thread Steve Niemitz
; > What OS and Python version are you using? Does your script come with a `if > __name__ == '__main__': `? > > On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz wrote: > >> I have a fairly simple python word count job (although the packaging is a >> little m

ImportError: __import__ not found on python job

2021-12-07 Thread Steve Niemitz
I have a fairly simple python word count job (although the packaging is a little more complicated) that I'm trying to run. (note: I'm explicitly NOT using save_main_session.) In it is a method to tokenize the incoming text to words, and I used something similar to how the wordcount example

Re: ExpansionService...as a service?

2021-10-06 Thread Steve Niemitz
n services as > a service) is another level that could be interesting to explore > (though more complicated, e.g. now you're hosting third party jars and > running arbitrary code). > > > On Wed, Oct 6, 2021 at 10:02 AM Steve Niemitz > wrote: > >> > >> I no

Re: ExpansionService...as a service?

2021-10-06 Thread Steve Niemitz
s > that you were noticing? > > On Wed, Oct 6, 2021 at 9:13 AM Steve Niemitz wrote: > > > > cool, thanks for the info. I might be the first to try then :) > > > > On Wed, Oct 6, 2021 at 12:00 PM Luke Cwik wrote: > >> > >> I believe that was one o

Re: ExpansionService...as a service?

2021-10-06 Thread Steve Niemitz
users so I don't believe what you > describe has been done. > > On Wed, Oct 6, 2021 at 8:37 AM Ahmet Altay wrote: > >> /cc @Chamikara Jayalath @Robert Bradshaw >> >> >> On Wed, Oct 6, 2021 at 6:36 AM Steve Niemitz wrote: >> >>> Has anyone

ExpansionService...as a service?

2021-10-06 Thread Steve Niemitz
Has anyone ever tried hosting a long-running expansion service as a real "service", the intent being that users don't need to run it locally, and can instead connect to the shared one when expanding pipelines? Looking around the code I already see a few assumptions that it will only live for a

Re: KafkaIO with DirectRunner is creating tons of connections to Kafka Brokers

2021-05-24 Thread Steve Niemitz
Out of curiosity, does adding the "--experiments=use_deprecated_read" argument fix things? (note, this flag was broken in beam 2.29 on the direct runner and didn't do anything, so you'd need to test on 2.28 or 2.30) On Mon, May 24, 2021 at 4:44 AM Sozonoff Serge wrote: > Hi. > > OK thanks.

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
Thanks, > Evan > > On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: > >> Yeah, sorry my email was confusing. use_deprecated_reads is broken on >> the DirectRunner in 2.29. >> >> The behavior you describe is exactly the behavior I ran into as well when >>

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
2, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. Is it possible that use_deprecated_read is >> broken in 2.29.0 as well? >> >> Thanks, >> Evan >> >>

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
use case at least) regardless of use_deprecated_read setting. > > Thanks, > Evan > > > On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > >> use_deprecated_read was broken in 2.19 on the direct runner and didn't do >> anything. [1] I don't think the fix is in 2.20 ei

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
use_deprecated_read was broken in 2.19 on the direct runner and didn't do anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. [1] https://github.com/apache/beam/pull/14469 On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > I forgot to also mention that in all tests I

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Steve Niemitz
This has been a common problem I've run into with lots of built-in IOs, I've generally submitted PRs for them to add support for emitting something once writed are completed. On Wed, Mar 24, 2021 at 1:04 PM Vincent Marquez wrote: > > *~Vincent* > > > On Wed, Mar 24, 2021 at 10:01 AM Reuven Lax

Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Steve Niemitz
he google client api wrappers (I'm not sure > if I know what they are.) > > Thank you! > > On Mon, Oct 12, 2020 at 11:04 AM Steve Niemitz > wrote: > >> We use the Dataflow API [1] directly, via the google api client wrappers >> (both python and java), pretty extensiv

Re: Querying Dataflow job status via Java SDK

2020-10-12 Thread Steve Niemitz
We use the Dataflow API [1] directly, via the google api client wrappers (both python and java), pretty extensively. It works well and doesn't require a dependency on beam. [1] https://cloud.google.com/dataflow/docs/reference/rest On Mon, Oct 12, 2020 at 1:56 PM Luke Cwik wrote: > It is your

Re: Building Dataflow Worker

2020-06-15 Thread Steve Niemitz
I think you want the "legacy-worker" target instead: ./gradlew -Ppublishing -PnoSigning :runners:google-cloud-dataflow-java:worker:legacy-worker:shadowJar That's what I've always used at least. On Mon, Jun 15, 2020 at 4:57 PM Luke Cwik wrote: > I noticed that you are not using the gradle

Re: daily dataflow job failing today

2020-02-12 Thread Steve Niemitz
avro-python3 1.9.2 was released on pypi 4 hours ago, and added pycodestyle as a dependency, probably related? On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik wrote: > +dev > > There was recently an update to add autoformatting to the Python SDK[1]. > > I'm seeing this during testing of a PR as well.

Re: Stability of Timer.withOutputTimestamp

2020-02-06 Thread Steve Niemitz
adjustments and you do not require Dataflow pipeline update > compatibility. > > Kenn > > On Wed, Feb 5, 2020 at 10:31 AM Luke Cwik wrote: > >> +Reuven Lax >> >> On Wed, Feb 5, 2020 at 7:33 AM Steve Niemitz wrote: >> >>> Also, as a fo

Re: Stability of Timer.withOutputTimestamp

2020-02-05 Thread Steve Niemitz
, 2020 at 10:01 AM Steve Niemitz wrote: > I noticed that Timer.withOutputTimestamp has landed in 2.19, but I didn't > see any mention of it in the release notes. > > Is this feature considered stable (specifically on dataflow)? >

Stability of Timer.withOutputTimestamp

2020-02-05 Thread Steve Niemitz
I noticed that Timer.withOutputTimestamp has landed in 2.19, but I didn't see any mention of it in the release notes. Is this feature considered stable (specifically on dataflow)?

Re: [FYI] Rephrasing the 'lull'/processing stuck logs

2020-01-09 Thread Steve Niemitz
One other nice enhancement around this would be if a transform could indicate that it was executing a "slow" operation. A good example is writing in BigQueryIO, it's very reasonable/normal for a load job to run for more than 5 minutes, and the "stuck" message can be confusing to users. The

Re: Memory profiling on Dataflow with java

2019-11-18 Thread Steve Niemitz
If you go the port forwarding route, you need to use a SOCKS proxy as well as forwarding the JMX port because of how JMX works. For example, I SSH into a worker with: ssh *-D -L :127.0.0.1: * and then launch eg, jvisualvm with: jvisualvm

Re: Questions about the bundled PubsubIO read implementation

2019-07-10 Thread Steve Niemitz
Oh, one other important thing I forgot to mention is that I can't reproduce (the empty message issue at least) locally on the DirectRunner. On Wed, Jul 10, 2019 at 6:04 PM Steve Niemitz wrote: > Thanks for making JIRAs for these, I was going to, I just wanted to do a > sanity check

Re: Questions about the bundled PubsubIO read implementation

2019-07-10 Thread Steve Niemitz
rk tracking: https://issues.apache.org/jira/browse/BEAM-7717 > > You reproduced these with the original PubsubIO? > > Kenn > > On Mon, Jul 8, 2019 at 10:38 AM Steve Niemitz wrote: > >> I was trying to use the bundled PubsubIO.Read implementation in beam on >> datafl

Questions about the bundled PubsubIO read implementation

2019-07-08 Thread Steve Niemitz
I was trying to use the bundled PubsubIO.Read implementation in beam on dataflow (using --experiments=enable_custom_pubsub_source to prevent dataflow from overriding it with its own implementation) and ran into some interesting issues. I was curious if people have any experience with these. I'd

Performance of Wait.on (and side inputs in general) in dataflow

2019-05-24 Thread Steve Niemitz
Hi everyone. I've been debugging a streaming job (on dataflow) for a little while now, and seem to have tracked it down to the Wait.on transform that I'm using. Some background: our pipeline takes in ~250,000 messages/sec from pubsub, aggregates them in an hourly window, and then emits the

Is it safe to cache the value of a singleton view (with a global window) in a DoFn?

2019-05-04 Thread Steve Niemitz
I have a singleton view in a global window that is read from a DoFn. I'm curious if its "correct" to cache that value from the view, or if I need to read it every time. As a (simplified) example, if I were to generate the view as such: input.getPipeline

Re: Performance of stateful DoFn vs CombineByKey

2019-03-18 Thread Steve Niemitz
when Combine is so similar? > https://beam.apache.org/blog/2017/02/13/stateful-processing.html#how-does-stateful-processing-fit-into-the-beam-model > > And here's a slide with the same idea but side-by-side illustrations: > https://s.apache.org/ffsf-2017-beam-state#slide=id.g1dbf0d46d2_0

Performance of stateful DoFn vs CombineByKey

2019-03-12 Thread Steve Niemitz
Hi all. I'm curious if anyone has done any comparison of the performance of a pipeline that uses CombineByKey, vs one that uses a stateful DoFn with combining state. [1] More specifically, if I had a pipeline that had a CombineByKey configured with early firings every N minutes, and I replaced

Re: Stateful processing : @OnWindowExpiration DoFn annotation

2019-02-26 Thread Steve Niemitz
;); > } > } > > > > Le 26 févr. 2019 à 00:49, Kenneth Knowles a écrit : > > Sorry you hit this issue. > > Implementation of this feature has been marked in progress [1] for a > while. It looks to be stalled so I unassigned the ticket. There is not any > expli

Re: Stateful processing : @OnWindowExpiration DoFn annotation

2019-02-25 Thread Steve Niemitz
I've noticed this doesn't seem to work either. The workaround is to just schedule an event-time timer at the end of the window + allowed lateness. The built-in GroupIntoBatches transform [1] does just this, I suspect to work around the issue as well. [1]

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-19 Thread Steve Niemitz
, 2019 at 4:34 PM Kenneth Knowles wrote: > > > On Wed, Feb 13, 2019 at 3:11 PM Robert Bradshaw > wrote: > >> On Wed, Feb 13, 2019 at 11:39 PM Steve Niemitz >> wrote: >> >>> >>> On Wed, Feb 13, 2019 at 5:01 PM Robert Bradshaw >>> w

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-13 Thread Steve Niemitz
On Wed, Feb 13, 2019 at 5:01 PM Robert Bradshaw wrote: > On Wed, Feb 13, 2019 at 5:07 PM Steve Niemitz wrote: > >> Thanks again for the answers so far! I really appreciate it. As for my >> specific use-case, we're using Bigtable as the final sink, and I'd prefer >>

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-13 Thread Steve Niemitz
of the sample beam projects use early firings, but there's never any mention that the output may be out-of-order. On Wed, Feb 13, 2019 at 3:11 AM Robert Bradshaw wrote: > On Tue, Feb 12, 2019 at 7:38 PM Steve Niemitz > wrote: > > > > wow, thats super unexpected and dangerou

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
eason about. > > > On Tue, Feb 12, 2019 at 7:19 PM Steve Niemitz wrote: > >> Also to clarify here (I re-read this and realized it could be slightly >> unclear). My question is only about in-order delivery of panes. ie: will >> pane P always be delivered before P+1. >

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
P0, P2, P1, but then always PLast). On Tue, Feb 12, 2019 at 11:48 AM Steve Niemitz wrote: > Are you also saying also that even in the first example (Source -> > CombineByKey (Sum) -> Sink) there's no guarantee that events would be > delivered in-order from the Combine -> Sink

Re: Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
mation that sort the elements and then dispatch them sorted out. >> >> Or uses the Sorter extension for this: >> >> https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter >> >> Steve Niemitz schrieb am Di., 12. Feb. 2019, 16:31: >> >&

Some questions about ensuring correctness with windowing and triggering

2019-02-12 Thread Steve Niemitz
Hi everyone, I have some questions I want to ask about how windowing, triggering, and panes work together, and how to ensure correctness throughout a pipeline. Lets assume I have a very simple streaming pipeline that looks like: Source -> CombineByKey (Sum) -> Sink Given fixed windows of 1 hour,

Re: Using gRPC with PubsubIO?

2019-01-02 Thread Steve Niemitz
Something to consider: if you're running in Dataflow, the entire Pubsub read step becomes a noop [1], and the underlying streaming implementation itself handles reading from pubsub (either windmill or the streaming engine). [1]

Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Steve Niemitz
ut further details my suggestion is a guess. > > Also. the implementation for state storage is Runner dependent but I am > aware of users storing very large amounts (>> 1 TiB) within state on > Dataflow and in general scales very well with the number of keys and > windows. > &

Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Steve Niemitz
We have a similar use case, except with BigtableIO instead of HBase. We ended up building a custom transform that was basically PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows from Bigtable based on the input, however it's tricky to get right because of batching, etc.

Re:

2017-12-01 Thread Steve Niemitz
I do something almost exactly like this, but with BigtableIO instead. I have a pull request open here [1] (which reminds me I need to finish this up...). It would really be nice for most IOs to support something like this. Essentially you do a GroupByKey (or some CombineFn) on the output from

Re: Slack channel

2017-08-16 Thread Steve Niemitz
cy > for having a bot which sends invites out automatically. > > On Wed, Aug 16, 2017 at 10:18 AM, Apache Enthu <apacheen...@gmail.com> > wrote: > >> Please could you add me too? >> >> Thanks, >> Almas >> >> On 16 Aug 2017 22:41, "Steve Niemi

Re: Slack channel

2017-08-16 Thread Steve Niemitz
I'll jump on this thread as well, can I get an invite too? Also, has anyone though of making this self service? The apache mesos slack has this set up [1]. [1] https://mesos-slackin.herokuapp.com On Aug 16, 2017 1:08 PM, "Griselda Cuevas" wrote: > Hi Manu, I'd like to piggy