Re: Job execution starts very slowly on Flink server.

2024-05-07 Thread Steve Niemitz via dev
This sounds about right to me. When I was working on getting flink running this was one of the biggest pain points, the upload/download process is painfully slow. We had implemented a jar cache in flink to improve this (and stopped using fat jars) https://github.com/twitter-forks/flink/commit/01f

Re: [ANNOUNCE] New committer: Steven Niemitz

2022-07-21 Thread Steve Niemitz via dev
Thanks everyone! On Thu, Jul 21, 2022 at 2:23 AM Moritz Mack wrote: > Congrats, Steven! > > > > On 21.07.22, 05:25, "Evan Galpin" wrote: > > > > Congrats! Well deserved! On Wed, Jul 20, 2022 at 15:⁠​17 Chamikara > Jayalath via dev wrote:⁠​ Congrats, Steve! On > Wed, Jul 20, 2022, 9:⁠​16 AM Aus

Re: Null PCollection errors in v2.40 unit tests

2022-06-14 Thread Steve Niemitz
I had brought up a weird issues I was having with AutoValue awhile ago that looks actually very similar to this: https://lists.apache.org/thread/0sbkykop2gsw71jpf3ln6forbnwq3j4o I never got to the bottom of it, but `--rerun-tasks` always fixes it for me. On Tue, Jun 14, 2022 at 5:11 PM Danny McC

Re: Timer bug in 2.37 around output timestamps?

2022-04-04 Thread Steve Niemitz
> get this in faster we should definitely move forward with the more-targeted > PR. > > On Mon, Apr 4, 2022 at 6:59 AM Steve Niemitz wrote: > >> Oh I had forgotten about that thread, good point, it is very related to >> this. I agree we should fix this, to force the conv

Re: Timer bug in 2.37 around output timestamps?

2022-04-04 Thread Steve Niemitz
ataflow's timer implementation, where resetting a timer with a new output timestamp doesn't clear the existing one, resulting in multiple timers being set and firing. On Mon, Apr 4, 2022 at 4:47 AM Jan Lukavský wrote: > On 4/2/22 16:41, Steve Niemitz wrote: > > I've du

Re: Timer bug in 2.37 around output timestamps?

2022-04-02 Thread Steve Niemitz
. [1] https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L212 On Sat, Apr 2, 2022 at 10:41 AM Steve Niemitz wrote: > I've dug into this some more and have a couple observations/questions: > - I added some lo

Re: Timer bug in 2.37 around output timestamps?

2022-04-02 Thread Steve Niemitz
the _output_ watermark is held in the stateful DoFn. Is there a reason the input watermark is used here, rather than the actual output timestamp of the timer (or the output watermark)? On Fri, Apr 1, 2022 at 8:12 PM Steve Niemitz wrote: > While I do agree the symptoms are similar, I don

Re: Timer bug in 2.37 around output timestamps?

2022-04-01 Thread Steve Niemitz
tween elements of that bundle causing some timestamps to be “invalid”. > > [1] > https://lists.apache.org/thread/10db7l9bhnhmo484myps723sfxtjwwmv > [2] > https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db > > - Evan > > On Fri, Apr 1, 2022 at 18:35 Steve Niem

Re: Timer bug in 2.37 around output timestamps?

2022-04-01 Thread Steve Niemitz
gt; > 1: https://issues.apache.org/jira/browse/BEAM-12931 > > > On Fri, Apr 1, 2022 at 2:17 PM Steve Niemitz wrote: > >> I'm unclear how the timer would have even ever been set to output at that >> timestamp though. The output timestamp falls into the next win

Re: Timer bug in 2.37 around output timestamps?

2022-04-01 Thread Steve Niemitz
[1] https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L1284 On Fri, Apr 1, 2022 at 5:11 PM Reuven Lax wrote: > There is a long-standing bug with processing timestamps. > > On Fri, Apr 1, 2022 at 2:01 PM Steve Niemitz

Timer bug in 2.37 around output timestamps?

2022-04-01 Thread Steve Niemitz
We have a job that uses processing time timers, and just upgraded from 2.33 to 2.37. Sporadically we've started seeing jobs fail with this error: java.lang.IllegalArgumentException: Cannot output with timestamp 2022-04-01T19:19:59.999Z. Output timestamps must be no earlier than the output timesta

Re: Possible 2.36.0 regression with CoGbkResult

2022-03-24 Thread Steve Niemitz
rue, but next() == null. > >> > >> - Claire > >> > >> On Tue, Mar 22, 2022 at 6:13 PM Robert Bradshaw > wrote: > >>> > >>> BEAM-13541 is as far as I'm aware the only major change that has > >>> happened in this code rece

Re: Weird AutoValue bug when compiling GenerateSequence.Builder

2022-03-24 Thread Steve Niemitz
if I watch the file it looks correct for a minute, then changes after sdks-java-core builds. On Thu, Mar 24, 2022 at 8:02 PM Steve Niemitz wrote: > ./gradlew --rerun-tasks fixed it. I had tried a clean many times (and > deleted the generated file even), who knows what was actually cached

Re: Weird AutoValue bug when compiling GenerateSequence.Builder

2022-03-24 Thread Steve Niemitz
./gradlew --rerun-tasks fixed it. I had tried a clean many times (and deleted the generated file even), who knows what was actually cached that was wrong... On Thu, Mar 24, 2022 at 7:57 PM Steve Niemitz wrote: > 2.37 is on 1.8.2, I also tried 1.9.0 (which is what's on master) and ra

Re: Weird AutoValue bug when compiling GenerateSequence.Builder

2022-03-24 Thread Steve Niemitz
every time I build I end up with the wrong generated builder, I had the correct one at some point, but now it's back to always being wrong. On Thu, Mar 24, 2022 at 7:03 PM Reuven Lax wrote: > Did we pick up a new version of AutoValue? > > On Thu, Mar 24, 2022 at 3:01 PM Steve Niemi

Weird AutoValue bug when compiling GenerateSequence.Builder

2022-03-24 Thread Steve Niemitz
Seemingly randomly when I build beam (I'm building 2.37 right now), the AutoValue builder for GenerateSequence seems to ignore the @Nullable attribute on many of the fields, resulting in an AutoValue builder that enforces all fields are set. This breaks building eg GenerateSequence.from(...).to(..

Re: Possible 2.36.0 regression with CoGbkResult

2022-03-24 Thread Steve Niemitz
t; loads CoGbkResult iterables - in one of our internal pipielines that > started failing, I added some logging to CoGbkResult and found that the > TagIterable’s iterator would return hasNext() == true, but next() == null. > >>>> > >>>> - Claire > >>

Re: Possible 2.36.0 regression with CoGbkResult

2022-03-22 Thread Steve Niemitz
or 2.35.0... > It does seem to depend on the number of elements in the group. > > > On Tue, 22 Mar 2022 at 22:27, Steve Niemitz wrote: > >> Your email was actually what made me notice this! :D >> >> I haven't been able to reproduce the NPE you found (also on 2.37

Re: Possible 2.36.0 regression with CoGbkResult

2022-03-22 Thread Steve Niemitz
Your email was actually what made me notice this! :D I haven't been able to reproduce the NPE you found (also on 2.37) but that certainly doesn't mean it's not a bug. On Tue, Mar 22, 2022 at 5:23 PM Niel Markwick wrote: > I have also seen this with Java beam 2.36.0 and 2.37.0, again with larg

Re: CoGbkResult size sampling performance issues

2022-03-22 Thread Steve Niemitz
low/worker/GroupingShuffleReader.java#L452 On Tue, Mar 22, 2022 at 4:18 PM Robert Bradshaw wrote: > If we're not using ElementByteSizeObservable for our CoGBK outputs, > that's the right fix to make. > > On Tue, Mar 22, 2022 at 12:20 PM Steve Niemitz > wrote: > > > > A

Re: CoGbkResult size sampling performance issues

2022-03-22 Thread Steve Niemitz
com/apache/beam/blob/9a36490de8129359106baf37e1d0b071e19e303a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/common/ElementByteSizeObservableIterable.java#L54 On Tue, Mar 22, 2022 at 3:07 PM Steve Niemitz wrote: > Oh that's interesting I didn't know it even had that optim

Re: CoGbkResult size sampling performance issues

2022-03-22 Thread Steve Niemitz
erver.java#L29 > 2: > https://github.com/apache/beam/blob/9a36490de8129359106baf37e1d0b071e19e303a/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/IterableLikeCoder.java#L192 > > > > On Wed, Mar 16, 2022 at 4:09 PM Steve Niemitz wrote: > >> The thread a couple

CoGbkResult size sampling performance issues

2022-03-16 Thread Steve Niemitz
The thread a couple days ago about CoGroupByKey being possibly broken in beam 2.36.0 [1] had an interesting thing in it that had me thinking. The CoGbkResultCoder doesn't override getEncodedElementByteSize (nor does it know how to actually compute the size in any efficient way), so sampling a CoGb

Re: GCS staging location when uploading artifacts from python for dataflow

2022-01-15 Thread Steve Niemitz
> with Dataflow prime for you and if not you can control what the Python > container is doing. > > On Mon, Dec 13, 2021 at 12:28 PM Steve Niemitz > wrote: > >> I spent a bunch of time on this over the last week and arrived at >> something that works. I had to try a f

Re: Some questions about external tables in BeamSQL

2022-01-13 Thread Steve Niemitz
Thanks for the quick responses! Mine are inline as well. On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette wrote: > I added some responses inline. Also adding dev@ since this is getting > into SQL internals. > > On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz > wrote: > >>

Re: Default output timestamp of processing-time timers

2021-12-14 Thread Steve Niemitz
ent time timer, right? Correct, the main thing I'm trying to solve is having to recalculate an output timestamp using the same logic that the timer itself is using to set its firing timestamp. On Tue, Dec 14, 2021 at 4:10 PM Kenneth Knowles wrote: > > > On Tue, Dec 7, 2021 at 7:27 AM

Re: GCS staging location when uploading artifacts from python for dataflow

2021-12-13 Thread Steve Niemitz
7;t want it to be the default. On Tue, Dec 7, 2021 at 12:49 PM Steve Niemitz wrote: > I noticed that the python dataflow runner appends some "uniqueness" (the > timestamp) [1] to the staging directory when staging artifacts for a > dataflow job. This is very suboptimal be

GCS staging location when uploading artifacts from python for dataflow

2021-12-07 Thread Steve Niemitz
I noticed that the python dataflow runner appends some "uniqueness" (the timestamp) [1] to the staging directory when staging artifacts for a dataflow job. This is very suboptimal because it makes caching artifacts between job runs useless. The jvm runner doesn't do this, is there a good reason t

Default output timestamp of processing-time timers

2021-12-07 Thread Steve Niemitz
If I have a processing time timer, is there any way to automatically set the output timestamp to the timer firing timestamp (similar to how event-time timers work). A common use case would be to do something like: timer.offset(X).align(Y).setRelative() but have the output timestamp be the firing

Re: UTF-8 passthrough with beam Rows

2021-11-30 Thread Steve Niemitz
e to pass through columns without conversion. The former API > was 'Object getValue(...)' which would return either the base type or > logical type. That approach was fragile (relied on Object) and broken by > other performance work so it was removed: > https://github.com/apache/b

Re: UTF-8 passthrough with beam Rows

2021-11-30 Thread Steve Niemitz
p track of the original encoded bytes in the Row object. Unlike Avro we >>> have a lot of control over creating new Row objects, since we discourage >>> users from manipulating Rows directly. So we could potentially make sure >>> some sidecar tracking encoded bytes alwa

Re: UTF-8 passthrough with beam Rows

2021-11-30 Thread Steve Niemitz
ncoded String? > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java > > On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax wrote: > >> I'm intrigued - how do you imagine doing this in RowCoder? >>

Re: UTF-8 passthrough with beam Rows

2021-11-30 Thread Steve Niemitz
gs. This would be a breaking change though for existing pipelines so it'd have to be opt-in. On Tue, Nov 30, 2021 at 11:17 AM Reuven Lax wrote: > I'm intrigued - how do you imagine doing this in RowCoder? > > On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz > wrote: > >

UTF-8 passthrough with beam Rows

2021-11-30 Thread Steve Niemitz
A common use case we're running into with beam rows is something like: - Read data from source X - Convert to Row - Encode row (generally for xlang) In cases like this, I've noticed that we spend a significant (30%+) amount of time just decoding and re-encoding strings. Avro has a nice solution t

Re: Portable artifact retrieval using URLs

2021-11-02 Thread Steve Niemitz
ifacts that you need. > This will help speed up how fast the workers start in Dataflow in addition > to not having to workaround the fact that only FILE is supported. > > On Tue, Nov 2, 2021 at 8:03 AM Steve Niemitz wrote: > >> We're working on running an "expansion s

Portable artifact retrieval using URLs

2021-11-02 Thread Steve Niemitz
We're working on running an "expansion service as a service" for xlang transforms. One of the things we'd really like is to serve the actual required artifacts to the client (submitting the pipeline) from our blobstore rather than streaming it through the artifact retrieval API (GetArtifact). I h

Re: Running python tests in an IDE (IntelliJ, PyCharm)

2021-10-19 Thread Steve Niemitz
ata > s vetaksund...@google.com > > > > On Tue, Oct 19, 2021 at 2:13 PM Steve Niemitz wrote: > >> Has anyone gotten the beam python sdk tests running in an IDE (I'm >> specifically trying with IntelliJ/PyCharm)? I didn't see anything on the >> wiki about it and haven't gotten it working yet. >> >

Running python tests in an IDE (IntelliJ, PyCharm)

2021-10-19 Thread Steve Niemitz
Has anyone gotten the beam python sdk tests running in an IDE (I'm specifically trying with IntelliJ/PyCharm)? I didn't see anything on the wiki about it and haven't gotten it working yet.

Re: xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
ython support for additional types, see BEAM-7996 [1] >>> > - Regarding cross-language testing, see tests for beam:coder:row:v1 in >>> standard_coders.yaml [2] >>> > >>> > Brian >>> > >>> > [1] https://issues.apache.org/jira/browse

Re: xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
t;> [1] https://issues.apache.org/jira/browse/BEAM-7996 >> [2] >> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml >> >> >> On Tue, Oct 12, 2021 at 1:55 PM Steve Niemitz >> wrote: >>

Re: xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
rpreted as that by the that an >>>> empty byte array indicating no nils. This is what go implements in the >>>> coder.WriteRowHeader and coder.ReadRowHeader functions. >>>> >>>> But that strikes me as a special case for fully populated rows, not a &g

Re: xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
x wrote: > >> Do you think that BitSetCoder is incorrect? >> >> On Tue, Oct 12, 2021 at 1:27 PM Steve Niemitz >> wrote: >> >>> Yeah I believe they're all bugs/missing features in the python >>> implementation. The nullable BitSet one is arguab

Re: xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
e same bug there, in which case that's two languages doing it "wrong" and one doing it "right". :P On Tue, Oct 12, 2021 at 4:20 PM Reuven Lax wrote: > These are bugs in Python, correct? > > On Tue, Oct 12, 2021 at 1:18 PM Steve Niemitz wrote: > >> It seems

xlang compatibility issues with Rows

2021-10-12 Thread Steve Niemitz
It seems like there's a good amount of incompatibility between java and python wrt beam Rows. For example the following are unsupported in python (that I've noticed so far) - BYTE - INT16 - OneOf Additionally, it seems like nullable fields don't really work correctly, the java BitSetCoder won't e

Re: Force SDF to run on every task the JAR is loaded on?

2021-10-07 Thread Steve Niemitz
g-a-pipeline#workers>, > so I assume I can always ignore the warning there and mess around with the > instance group settings in a totally unsupported way if I have to. > > Thanks for the help! > > > On Thu, Oct 7, 2021 at 2:15 PM Steve Niemitz wrote: > >> hm

Re: Force SDF to run on every task the JAR is loaded on?

2021-10-07 Thread Steve Niemitz
t;> constructs before the SDF to add more http listeners with different >> configurations. Also, once there is a runner that supports dynamic >> splitting you would be able to scale up and down based upon incoming >> requests and might be able to reduce the burden/load on

Re: Force SDF to run on every task the JAR is loaded on?

2021-10-07 Thread Steve Niemitz
unrelated to the actual question, but iirc dataflow workers have iptables rules that drop all inbound traffic (other than a few exceptions). In any case, do you actually need the server part to be "inside" the pipeline? Could you just use a JvmInitalizer to launch the http server and do the pubsu

Re: Support for avro when writing to BQ using the streaming write API

2021-08-05 Thread Steve Niemitz
o the same when wrapping protobufs > and AuttoValue, though I'm not entirely sure about Avro. > > > On Thu, Aug 5, 2021 at 6:42 AM Steve Niemitz wrote: > >> We currently have a lot of jobs that write to BQ using avro (using >> withAvroFormatFunction), and would like

Support for avro when writing to BQ using the streaming write API

2021-08-05 Thread Steve Niemitz
We currently have a lot of jobs that write to BQ using avro (using withAvroFormatFunction), and would like to start experimenting with the streaming write API. I see two options for how to do this: - Implement support for avro -> protobuf like TableRow and beam Row do - Add something like withProt

Re: Streaming side-input performance in dataflow

2021-07-28 Thread Steve Niemitz
that is missing from the cache. [1] https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingSideInputFetcher.java#L238 On Thu, Jul 22, 2021 at 2:41 PM Kenneth Knowles wrote: > On Thu, Jul 22, 2021 a

Re: BigQueryIO SchemaUpdateOptions incompatible with temp tables?

2021-07-27 Thread Steve Niemitz
I think the same problem was recently fixed in python [1]. It'd be great to fix this in java though, we hit this a bunch, I've never had enough time to fix it. [1] https://github.com/apache/beam/pull/14113 On Tue, Jul 27, 2021 at 4:19 PM Chamikara Jayalath wrote: > I don't have a lot of contex

Re: Streaming side-input performance in dataflow

2021-07-22 Thread Steve Niemitz
nging the cache to cache tombstones for cleared cells is another possibility as well. On Thu, Jul 22, 2021 at 2:33 AM Reuven Lax wrote: > So you're saying there's a bug in the caching logic that prevents the > side-input cache from working? > > On Wed, Jul 21, 2021 at 7:0

Streaming side-input performance in dataflow

2021-07-21 Thread Steve Niemitz
I had opened a jira years ago [1] about this, but would like to actually fix it for real now, given that our users have started using streaming more and more. There's more detail in the jira, but basically side inputs in streaming pipelines on dataflow lead to pretty bad performance because they r

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
, > Evan > > On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: > >> Yeah, sorry my email was confusing. use_deprecated_reads is broken on >> the DirectRunner in 2.29. >> >> The behavior you describe is exactly the behavior I ran into as well when >> read

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
2, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. Is it possible that use_deprecated_read is >> broken in 2.29.0 as well? >> >> Thanks, >> Evan >> >>

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
case at least) regardless of use_deprecated_read setting. > > Thanks, > Evan > > > On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > >> use_deprecated_read was broken in 2.19 on the direct runner and didn't do >> anything. [1] I don't think the fix is

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
use_deprecated_read was broken in 2.19 on the direct runner and didn't do anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. [1] https://github.com/apache/beam/pull/14469 On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > I forgot to also mention that in all tests I

Re: User-related questions in dev@ list

2021-03-10 Thread Steve Niemitz
As a frequent emailer of dev@, I'll admit that it's often very difficult to figure out if I should be emailing user@ or dev@, and typically just chose dev@ because it seems more likely to get an answer there. Having clearer guidelines around what is a "dev" topic would be very useful to better gui

Re: Question about Schema Logical Types and RowWithGetters

2021-03-01 Thread Steve Niemitz
n-internal? On Mon, Mar 1, 2021 at 2:54 PM Reuven Lax wrote: > RowWithGetters is used as an internal detail in Beam, so it special cases > internal Beam-provided logical types. If we want it to work well with user > logical types, we might need to redesign it a bit. > > On Mon, Mar

Question about Schema Logical Types and RowWithGetters

2021-03-01 Thread Steve Niemitz
I'm working on a little library to wrap domain objects into Rows (using RowWithGetters) and am trying to implement a new LogicalType. I'm struggling to understand how it's supposed to interact with RowWithGetters however. Let's say that my LogicalType has a java type of MyLogicalType.Value, and b

Problems building latest beam source

2021-01-25 Thread Steve Niemitz
I ran into issues resolving org.pentaho:pentaho-aggdesigner-algorithm:5.1.5-jhyde, it looks like the spring repo hosting it has locked the artifact, I get a 401 unauthorized trying to download it. Has anyone else run into this? I assume many people have the artifact cached locally and so haven't

Re: Possible issue with bounded Read translation using SDF

2020-12-18 Thread Steve Niemitz
I think this actually the same problem as I reported w/ the PubsubIO [1], but in the bounded case. The BoundedSourceAsSDFWrapper closes (and then re-creates) the underlying source each time it checkpoints, and the default behavior is to checkpoint very frequently. [1] https://lists.apache.org/thr

XLang pipelines in dataflow pull from docker.io by default?

2020-12-18 Thread Steve Niemitz
I'm playing around with xlang portable pipelines in dataflow and noticed that it tries to pull the java harness (beam_java8_sdk:2.25.0) from docker.io. This is problematic because our VPC prevents access to external hosts. I was able to fix the problem by passing in --sdk_harness_container_image

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Steve Niemitz
they shouldn't be > static, e.g. it might be preferable to start emitting results right away, > but use larger batches for the steady state if there are performance > benefits.) > >>> > >>> That being said, it sounds like there's something deeper going on

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Steve Niemitz
eems like the 1 second limit kicks in before the output >> reaches 100 elements. >> >> > >> >> > I think the original purpose for DirectRunner to use a small limit >> on issuing checkpoint requests is for exercising SDF better in a small data >>

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-11 Thread Steve Niemitz
s/C9H0YNP3P/p1607057900393900 >> > >> >> > >> On Thu, Dec 3, 2020 at 1:56 PM Boyuan Zhang >> wrote: >> > >> >> > >>> Hi Steve, >> > >>> >> > >>> I think the major performance regression comes from >> > >>> OutputAn

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-04 Thread Steve Niemitz
s. > > [1] > https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvoker.java > > On Thu, Dec 3, 2020 at 9:40 AM Steve Niemitz wrote: > >> I have a pipeline that reads from pubsu

Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-03 Thread Steve Niemitz
I have a pipeline that reads from pubsub, does some aggregation, and writes to various places. Previously, in older versions of beam, when running this in the DirectRunner, messages would go through the pipeline almost instantly, making it very easy to debug locally, etc. However, after upgrading

Re: Create External Transform with WindowFn

2020-11-30 Thread Steve Niemitz
alright, thank you. Is BEAM-10507 the jira to watch for any progress on that? On Mon, Nov 30, 2020 at 12:55 PM Boyuan Zhang wrote: > Hi Steve, > > Unfortunately I don't think there is a workaround before we have the > change that Cham mentions. > > On Mon, Nov 30, 2020 a

Re: Create External Transform with WindowFn

2020-11-30 Thread Steve Niemitz
I'm trying to write an xlang transform that uses Reshuffle internally, and ran into this as well. Is there any workaround to this for now (other than removing the reshuffle), or do I just need to wait for what Chamikara mentioned? I noticed the same issue was mentioned in the SnowflakeIO.Read PR

Re: Issues building/running 2.25 on java 8

2020-11-06 Thread Steve Niemitz
ommit/7fec038bf9e3861462744ba5522208a4b9d15b85#diff-0435a83a413ec063bf7e682cadcd56776cd18fc878f197cc99a65fc231ef2047 On Fri, Nov 6, 2020 at 6:27 PM Steve Niemitz wrote: > yeah, I built it via: > JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 ./gradlew --no-daemon > -Ppublishing -PnoSigning publishMavenJavaPublicati

Re: Issues building/running 2.25 on java 8

2020-11-06 Thread Steve Niemitz
https://issues.apache.org/jira/browse/BEAM-11080) > > On Fri, Nov 6, 2020 at 3:13 PM Steve Niemitz wrote: > >> I'm trying out 2.25 (built from source, using java 8), and running into >> this error, both on the direct runner and dataflow: >> >

Issues building/running 2.25 on java 8

2020-11-06 Thread Steve Niemitz
I'm trying out 2.25 (built from source, using java 8), and running into this error, both on the direct runner and dataflow: Caused by: java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer; at com.google.protobuf.NioByteString.copyToInternal(NioByteString.java:112) at co

Re: 2.22.0 Release Update

2020-05-26 Thread Steve Niemitz
On Tue, May 26, 2020 at 12:20 PM Brian Hulette wrote: > > On Mon, May 25, 2020 at 1:30 PM Steve Niemitz wrote: > >> I tried to validate the release branch on dataflow, but it looks like >> there's no tag yet for the 2.22 docker containers [1] >> > > I hav

Re: 2.22.0 Release Update

2020-05-25 Thread Steve Niemitz
I tried to validate the release branch on dataflow, but it looks like there's no tag yet for the 2.22 docker containers [1] I'm also fairly concerned with the artifact staging regressions introduced in BEAM-9383 [2] [1] https://console.cloud.google.com/gcr/images/cloud-dataflow/GLOBAL/v1beta3/bea

Re: [VOTE] Release 2.21.0, release candidate #1

2020-05-19 Thread Steve Niemitz
https://issues.apache.org/jira/browse/BEAM-10015 was marked as 2.21 but isn't in the RC1 tag. It's marked as P1, and seems like the implication is that without the fix, pipelines can produce incorrect data. Is this a blocker? On Tue, May 19, 2020 at 4:51 PM Kyle Weaver wrote: > Hi everyone, >

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
those initial splits > but helps a lot with worker rebalancing since a few workers are usually > stragglers and will need some help at the end of a pipeline. > > On Fri, May 15, 2020 at 1:54 PM Steve Niemitz wrote: > >> Thanks for the replies so far. I should have specifically m

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
do today but give up the composability and any splitting support for > unbounded SDFs that will come later. > > > > Finally, there is a way to get limited support for dynamic splitting of > bounded and unbounded SDFs for other runners using the composability of > SDFs and the li

Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
I'm going to be writing a new IO (in java) for reading files in a custom format, and want to make it splittable. It seems like I have a choice between the "legacy" source API, and newer experimental SDF API. Is there any guidance on which I should use? I can likely tolerate some API churn as wel

Re: Dataflow Streaming ValidatesRunner

2020-04-09 Thread Steve Niemitz
re looking to have the first API stable version using the portability > framework with 2.21.0 which should mean that tests that run outside of > Google will be possible. > > > On Thu, Apr 9, 2020 at 7:17 AM Steve Niemitz wrote: > >> I was trying to run a @ValidatesRunner test

Dataflow Streaming ValidatesRunner

2020-04-09 Thread Steve Niemitz
I was trying to run a @ValidatesRunner test for the streaming dataflow runner, but I actually can't find any way to run them in streaming. It looks like all the tests are set up using the Create transform, which generates a batch pipeline. Are there actually no @ValidatesRunner tests for the stre

Re: [VOTE] Release 2.20.0, release candidate #1

2020-04-06 Thread Steve Niemitz
g, no test/validation > gave a signal that something relevant was broken. Plus that fix didn't > include a test. > > I will hesitate to say such a fix is critical for a release, unless there > is something to test or validate it. > > > -Rui > > On Mon, Apr 6, 20

Re: [VOTE] Release 2.20.0, release candidate #1

2020-04-06 Thread Steve Niemitz
> > On Mon, Apr 6, 2020 at 2:38 PM Steve Niemitz wrote: > >> I can confirm that the artifact on maven central [1] does not have the >> change in it either, I disassembled it with javap. >> >> [1] >> https://repository.apache.org/content/repositories/orgapac

Re: [VOTE] Release 2.20.0, release candidate #1

2020-04-06 Thread Steve Niemitz
I can confirm that the artifact on maven central [1] does not have the change in it either, I disassembled it with javap. [1] https://repository.apache.org/content/repositories/orgapachebeam-1100/org/apache/beam/beam-runners-core-java/2.20.0/beam-runners-core-java-2.20.0.jar On Mon, Apr 6, 2020 a

Re: ZetaSQL to Calcite translation layer

2020-03-26 Thread Steve Niemitz
to solve your case? If > not, why? > > > [1]: > https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-sql-zetasql/2.17.0 > > > > > -Rui > > On Thu, Mar 26, 2020 at 1:23 PM Steve Niemitz wrote: > >> Oh I think I actually remembe

Re: ZetaSQL to Calcite translation layer

2020-03-26 Thread Steve Niemitz
add a set of core SQL interfaces that only depend on Calcite and > then split our ZetaSQL translator into a piece that only depends on those > interfaces, Calcite, and ZetaSQL. > > Andrew > > On Thu, Mar 26, 2020 at 12:41 PM Steve Niemitz > wrote: > >> The ZetaSQL to calc

ZetaSQL to Calcite translation layer

2020-03-26 Thread Steve Niemitz
The ZetaSQL to calcite translation layer that is bundled with beam seems generally useful in cases other than for beam. In fact, we're using (essentially a fork of) it internally outside of beam right now (and I've fixed a bunch of bugs in it). Has there ever been any thought about splitting into

Re: [PROPOSAL] Preparing for Beam 2.20.0 release

2020-03-23 Thread Steve Niemitz
https://issues.apache.org/jira/browse/BEAM-9557 should also be a blocker IMO, as it breaks existing pipelines. On Mon, Mar 23, 2020 at 1:17 PM Maximilian Michels wrote: > Just mentioning that we have discovered > https://issues.apache.org/jira/browse/BEAM-9566 and > https://issues.apache.org/jir

BEAM-9557 - error setting processing time timers past end-of-window

2020-03-19 Thread Steve Niemitz
I just opened this bug because I was testing out the 2.20 release and ran into it. It seems like it's now an error to set a processing time timer that happens to fire after the end of a window. In the past this worked fine (and I think it would just not fire). This seems like a pretty bad regres

Re: daily dataflow job failing today

2020-02-12 Thread Steve Niemitz
avro-python3 1.9.2 was released on pypi 4 hours ago, and added pycodestyle as a dependency, probably related? On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik wrote: > +dev > > There was recently an update to add autoformatting to the Python SDK[1]. > > I'm seeing this during testing of a PR as well.

Re: Dropping late data in DirectRunner

2020-01-03 Thread Steve Niemitz
I do agree that the direct runner doesn't drop late data arriving at a stateful DoFn (I just tested as well). However, I believe this is consistent with other runners. I'm fairly certain (at least last time I checked) that at least Dataflow will also only drop late data at GBK operations, and NOT

Re: real real-time beam

2019-11-25 Thread Steve Niemitz
If you have a pipeline that looks like Input -> GroupByKey -> ParDo, while it is not guaranteed, in practice the sink will observe the trigger firings in order (per key), since it'll be fused to the output of the GBK operation (in all runners I know of). There have been a couple threads about trig

Re: Triggers still finish and drop all data

2019-11-08 Thread Steve Niemitz
withSummarizer(combineFn) > .withPredicate(summary -> ...) > > Something like that? The complexity is not much less than just writing a > stateful DoFn directly, but the boilerplate is much less. > > Kenn > > On Thu, Nov 7, 2019 at 2:02 PM Steve Niemitz wrote: > >> Inter

Re: Triggers still finish and drop all data

2019-11-07 Thread Steve Niemitz
Interestingly enough, we just had a use case come up that I think could have been solved by finishing triggers. Basically, we want to emit a notification when a certain threshold is reached (in this case, we saw at least N elements for a given key), and then never notify again within that window.

Re: using avro instead of json for BigQueryIO.Write

2019-09-27 Thread Steve Niemitz
lo Estrada wrote: > Thanks for offering to work on this! It would be awesome to have it. I can > say that we don't have that for Python ATM. > > On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz > wrote: > >> Our experience has actually been that avro is more efficient than

Re: using avro instead of json for BigQueryIO.Write

2019-09-16 Thread Steve Niemitz
en an avro version would be much > more efficient. I believe Parquet files would be even more efficient if you > wanted to go that path, but there might be more code to write (as we > already have some code in the codebase to convert between TableRows and > Avro). > > Reuven > >

using avro instead of json for BigQueryIO.Write

2019-09-16 Thread Steve Niemitz
Has anyone investigated using avro rather than json to load data into BigQuery using BigQueryIO (+ FILE_LOADS)? I'd be interested in enhancing it to support this, but I'm curious if there's any prior work here.

Re: Bucketed histogram metrics in beam. Anyone currently looking into this?

2019-07-12 Thread Steve Niemitz
I've been doing some experiments in my own fork of the Dataflow worker using HdrHistogram [1] to record histograms. I export them to our own stats collector, not Stackdriver, but have been having good success with them. The problem is that the dataflow worker metrics implementation is totally dif

Re: Accumulating mode implies that panes are processed in order?

2019-06-26 Thread Steve Niemitz
There was a thread about this a few months ago as well: https://lists.apache.org/thread.html/20d11046d26174969ef44a781e409a1cb9f7c736e605fa40fdf98397@%3Cuser.beam.apache.org%3E On Wed, Jun 26, 2019 at 4:02 PM Robert Bradshaw wrote: > There is no promise that panes will arrive in order (especial

Re: Java DirectRunner BagState Problem

2019-04-10 Thread Steve Niemitz
This sounds a lot like what I had reported in https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-6813 (sorry for the mobile link). On Wed, Apr 10, 2019 at 6:59 PM Anton Kedin wrote: > Hi dev@, > > I am debugging a flaky test and observing an interesting problem with a > BagState. T

  1   2   >