This sounds about right to me. When I was working on getting flink running
this was one of the biggest pain points, the upload/download process is
painfully slow. We had implemented a jar cache in flink to improve this
(and stopped using fat jars)
https://github.com/twitter-forks/flink/commit/01f
Thanks everyone!
On Thu, Jul 21, 2022 at 2:23 AM Moritz Mack wrote:
> Congrats, Steven!
>
>
>
> On 21.07.22, 05:25, "Evan Galpin" wrote:
>
>
>
> Congrats! Well deserved! On Wed, Jul 20, 2022 at 15:17 Chamikara
> Jayalath via dev wrote: Congrats, Steve! On
> Wed, Jul 20, 2022, 9:16 AM Aus
I had brought up a weird issues I was having with AutoValue awhile ago that
looks actually very similar to this:
https://lists.apache.org/thread/0sbkykop2gsw71jpf3ln6forbnwq3j4o
I never got to the bottom of it, but `--rerun-tasks` always fixes it for me.
On Tue, Jun 14, 2022 at 5:11 PM Danny McC
> get this in faster we should definitely move forward with the more-targeted
> PR.
>
> On Mon, Apr 4, 2022 at 6:59 AM Steve Niemitz wrote:
>
>> Oh I had forgotten about that thread, good point, it is very related to
>> this. I agree we should fix this, to force the conv
ataflow's timer implementation, where resetting a timer
with a new output timestamp doesn't clear the existing one, resulting in
multiple timers being set and firing.
On Mon, Apr 4, 2022 at 4:47 AM Jan Lukavský wrote:
> On 4/2/22 16:41, Steve Niemitz wrote:
>
> I've du
.
[1]
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L212
On Sat, Apr 2, 2022 at 10:41 AM Steve Niemitz wrote:
> I've dug into this some more and have a couple observations/questions:
> - I added some lo
the _output_ watermark is
held in the stateful DoFn.
Is there a reason the input watermark is used here, rather than the actual
output timestamp of the timer (or the output watermark)?
On Fri, Apr 1, 2022 at 8:12 PM Steve Niemitz wrote:
> While I do agree the symptoms are similar, I don
tween elements of that bundle causing some timestamps to be “invalid”.
>
> [1]
> https://lists.apache.org/thread/10db7l9bhnhmo484myps723sfxtjwwmv
> [2]
> https://lists.apache.org/thread/mtwtno2o88lx3zl12jlz7o5w1lcgm2db
>
> - Evan
>
> On Fri, Apr 1, 2022 at 18:35 Steve Niem
gt;
> 1: https://issues.apache.org/jira/browse/BEAM-12931
>
>
> On Fri, Apr 1, 2022 at 2:17 PM Steve Niemitz wrote:
>
>> I'm unclear how the timer would have even ever been set to output at that
>> timestamp though. The output timestamp falls into the next win
[1]
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SimpleDoFnRunner.java#L1284
On Fri, Apr 1, 2022 at 5:11 PM Reuven Lax wrote:
> There is a long-standing bug with processing timestamps.
>
> On Fri, Apr 1, 2022 at 2:01 PM Steve Niemitz
We have a job that uses processing time timers, and just upgraded from 2.33
to 2.37. Sporadically we've started seeing jobs fail with this error:
java.lang.IllegalArgumentException: Cannot output with timestamp
2022-04-01T19:19:59.999Z. Output timestamps must be no earlier than the
output timesta
rue, but next() == null.
> >>
> >> - Claire
> >>
> >> On Tue, Mar 22, 2022 at 6:13 PM Robert Bradshaw
> wrote:
> >>>
> >>> BEAM-13541 is as far as I'm aware the only major change that has
> >>> happened in this code rece
if I watch the file it looks correct for a minute, then changes
after sdks-java-core builds.
On Thu, Mar 24, 2022 at 8:02 PM Steve Niemitz wrote:
> ./gradlew --rerun-tasks fixed it. I had tried a clean many times (and
> deleted the generated file even), who knows what was actually cached
./gradlew --rerun-tasks fixed it. I had tried a clean many times (and
deleted the generated file even), who knows what was actually cached that
was wrong...
On Thu, Mar 24, 2022 at 7:57 PM Steve Niemitz wrote:
> 2.37 is on 1.8.2, I also tried 1.9.0 (which is what's on master) and ra
every time I build I end up with the wrong generated builder, I
had the correct one at some point, but now it's back to always being wrong.
On Thu, Mar 24, 2022 at 7:03 PM Reuven Lax wrote:
> Did we pick up a new version of AutoValue?
>
> On Thu, Mar 24, 2022 at 3:01 PM Steve Niemi
Seemingly randomly when I build beam (I'm building 2.37 right now), the
AutoValue builder for GenerateSequence seems to ignore the @Nullable
attribute on many of the fields, resulting in an AutoValue builder that
enforces all fields are set.
This breaks building eg GenerateSequence.from(...).to(..
t; loads CoGbkResult iterables - in one of our internal pipielines that
> started failing, I added some logging to CoGbkResult and found that the
> TagIterable’s iterator would return hasNext() == true, but next() == null.
> >>>>
> >>>> - Claire
> >>
or 2.35.0...
> It does seem to depend on the number of elements in the group.
>
>
> On Tue, 22 Mar 2022 at 22:27, Steve Niemitz wrote:
>
>> Your email was actually what made me notice this! :D
>>
>> I haven't been able to reproduce the NPE you found (also on 2.37
Your email was actually what made me notice this! :D
I haven't been able to reproduce the NPE you found (also on 2.37) but that
certainly doesn't mean it's not a bug.
On Tue, Mar 22, 2022 at 5:23 PM Niel Markwick wrote:
> I have also seen this with Java beam 2.36.0 and 2.37.0, again with larg
low/worker/GroupingShuffleReader.java#L452
On Tue, Mar 22, 2022 at 4:18 PM Robert Bradshaw wrote:
> If we're not using ElementByteSizeObservable for our CoGBK outputs,
> that's the right fix to make.
>
> On Tue, Mar 22, 2022 at 12:20 PM Steve Niemitz
> wrote:
> >
> > A
com/apache/beam/blob/9a36490de8129359106baf37e1d0b071e19e303a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/common/ElementByteSizeObservableIterable.java#L54
On Tue, Mar 22, 2022 at 3:07 PM Steve Niemitz wrote:
> Oh that's interesting I didn't know it even had that optim
erver.java#L29
> 2:
> https://github.com/apache/beam/blob/9a36490de8129359106baf37e1d0b071e19e303a/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/IterableLikeCoder.java#L192
>
>
>
> On Wed, Mar 16, 2022 at 4:09 PM Steve Niemitz wrote:
>
>> The thread a couple
The thread a couple days ago about CoGroupByKey being possibly broken in
beam 2.36.0 [1] had an interesting thing in it that had me thinking. The
CoGbkResultCoder doesn't override getEncodedElementByteSize (nor does it
know how to actually compute the size in any efficient way), so sampling a
CoGb
> with Dataflow prime for you and if not you can control what the Python
> container is doing.
>
> On Mon, Dec 13, 2021 at 12:28 PM Steve Niemitz
> wrote:
>
>> I spent a bunch of time on this over the last week and arrived at
>> something that works. I had to try a f
Thanks for the quick responses! Mine are inline as well.
On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette wrote:
> I added some responses inline. Also adding dev@ since this is getting
> into SQL internals.
>
> On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz
> wrote:
>
>>
ent time timer, right?
Correct, the main thing I'm trying to solve is having to recalculate an
output timestamp using the same logic that the timer itself is using to set
its firing timestamp.
On Tue, Dec 14, 2021 at 4:10 PM Kenneth Knowles wrote:
>
>
> On Tue, Dec 7, 2021 at 7:27 AM
7;t want it to be the default.
On Tue, Dec 7, 2021 at 12:49 PM Steve Niemitz wrote:
> I noticed that the python dataflow runner appends some "uniqueness" (the
> timestamp) [1] to the staging directory when staging artifacts for a
> dataflow job. This is very suboptimal be
I noticed that the python dataflow runner appends some "uniqueness" (the
timestamp) [1] to the staging directory when staging artifacts for a
dataflow job. This is very suboptimal because it makes caching artifacts
between job runs useless.
The jvm runner doesn't do this, is there a good reason t
If I have a processing time timer, is there any way to automatically set
the output timestamp to the timer firing timestamp (similar to how
event-time timers work).
A common use case would be to do something like:
timer.offset(X).align(Y).setRelative()
but have the output timestamp be the firing
e to pass through columns without conversion. The former API
> was 'Object getValue(...)' which would return either the base type or
> logical type. That approach was fragile (relied on Object) and broken by
> other performance work so it was removed:
> https://github.com/apache/b
p track of the original encoded bytes in the Row object. Unlike Avro we
>>> have a lot of control over creating new Row objects, since we discourage
>>> users from manipulating Rows directly. So we could potentially make sure
>>> some sidecar tracking encoded bytes alwa
ncoded String?
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java
>
> On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax wrote:
>
>> I'm intrigued - how do you imagine doing this in RowCoder?
>>
gs. This would be a breaking change
though for existing pipelines so it'd have to be opt-in.
On Tue, Nov 30, 2021 at 11:17 AM Reuven Lax wrote:
> I'm intrigued - how do you imagine doing this in RowCoder?
>
> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz
> wrote:
>
>
A common use case we're running into with beam rows is something like:
- Read data from source X
- Convert to Row
- Encode row (generally for xlang)
In cases like this, I've noticed that we spend a significant (30%+) amount
of time just decoding and re-encoding strings.
Avro has a nice solution t
ifacts that you need.
> This will help speed up how fast the workers start in Dataflow in addition
> to not having to workaround the fact that only FILE is supported.
>
> On Tue, Nov 2, 2021 at 8:03 AM Steve Niemitz wrote:
>
>> We're working on running an "expansion s
We're working on running an "expansion service as a service" for xlang
transforms. One of the things we'd really like is to serve the actual
required artifacts to the client (submitting the pipeline) from our
blobstore rather than streaming it through the artifact retrieval API
(GetArtifact).
I h
ata
> s vetaksund...@google.com
>
>
>
> On Tue, Oct 19, 2021 at 2:13 PM Steve Niemitz wrote:
>
>> Has anyone gotten the beam python sdk tests running in an IDE (I'm
>> specifically trying with IntelliJ/PyCharm)? I didn't see anything on the
>> wiki about it and haven't gotten it working yet.
>>
>
Has anyone gotten the beam python sdk tests running in an IDE (I'm
specifically trying with IntelliJ/PyCharm)? I didn't see anything on the
wiki about it and haven't gotten it working yet.
ython support for additional types, see BEAM-7996 [1]
>>> > - Regarding cross-language testing, see tests for beam:coder:row:v1 in
>>> standard_coders.yaml [2]
>>> >
>>> > Brian
>>> >
>>> > [1] https://issues.apache.org/jira/browse
t;> [1] https://issues.apache.org/jira/browse/BEAM-7996
>> [2]
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>
>>
>> On Tue, Oct 12, 2021 at 1:55 PM Steve Niemitz
>> wrote:
>>
rpreted as that by the that an
>>>> empty byte array indicating no nils. This is what go implements in the
>>>> coder.WriteRowHeader and coder.ReadRowHeader functions.
>>>>
>>>> But that strikes me as a special case for fully populated rows, not a
&g
x wrote:
>
>> Do you think that BitSetCoder is incorrect?
>>
>> On Tue, Oct 12, 2021 at 1:27 PM Steve Niemitz
>> wrote:
>>
>>> Yeah I believe they're all bugs/missing features in the python
>>> implementation. The nullable BitSet one is arguab
e
same bug there, in which case that's two languages doing it "wrong" and one
doing it "right". :P
On Tue, Oct 12, 2021 at 4:20 PM Reuven Lax wrote:
> These are bugs in Python, correct?
>
> On Tue, Oct 12, 2021 at 1:18 PM Steve Niemitz wrote:
>
>> It seems
It seems like there's a good amount of incompatibility between java and
python wrt beam Rows. For example the following are unsupported in python
(that I've noticed so far)
- BYTE
- INT16
- OneOf
Additionally, it seems like nullable fields don't really work correctly,
the java BitSetCoder won't e
g-a-pipeline#workers>,
> so I assume I can always ignore the warning there and mess around with the
> instance group settings in a totally unsupported way if I have to.
>
> Thanks for the help!
>
>
> On Thu, Oct 7, 2021 at 2:15 PM Steve Niemitz wrote:
>
>> hm
t;> constructs before the SDF to add more http listeners with different
>> configurations. Also, once there is a runner that supports dynamic
>> splitting you would be able to scale up and down based upon incoming
>> requests and might be able to reduce the burden/load on
unrelated to the actual question, but iirc dataflow workers have iptables
rules that drop all inbound traffic (other than a few exceptions).
In any case, do you actually need the server part to be "inside" the
pipeline? Could you just use a JvmInitalizer to launch the http server and
do the pubsu
o the same when wrapping protobufs
> and AuttoValue, though I'm not entirely sure about Avro.
>
>
> On Thu, Aug 5, 2021 at 6:42 AM Steve Niemitz wrote:
>
>> We currently have a lot of jobs that write to BQ using avro (using
>> withAvroFormatFunction), and would like
We currently have a lot of jobs that write to BQ using avro (using
withAvroFormatFunction), and would like to start experimenting with the
streaming write API.
I see two options for how to do this:
- Implement support for avro -> protobuf like TableRow and beam Row do
- Add something like withProt
that is missing from the cache.
[1]
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingSideInputFetcher.java#L238
On Thu, Jul 22, 2021 at 2:41 PM Kenneth Knowles wrote:
> On Thu, Jul 22, 2021 a
I think the same problem was recently fixed in python [1]. It'd be great
to fix this in java though, we hit this a bunch, I've never had enough time
to fix it.
[1] https://github.com/apache/beam/pull/14113
On Tue, Jul 27, 2021 at 4:19 PM Chamikara Jayalath
wrote:
> I don't have a lot of contex
nging the cache to cache tombstones for cleared cells is
another possibility as well.
On Thu, Jul 22, 2021 at 2:33 AM Reuven Lax wrote:
> So you're saying there's a bug in the caching logic that prevents the
> side-input cache from working?
>
> On Wed, Jul 21, 2021 at 7:0
I had opened a jira years ago [1] about this, but would like to actually
fix it for real now, given that our users have started using streaming more
and more.
There's more detail in the jira, but basically side inputs in streaming
pipelines on dataflow lead to pretty bad performance because they r
,
> Evan
>
> On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote:
>
>> Yeah, sorry my email was confusing. use_deprecated_reads is broken on
>> the DirectRunner in 2.29.
>>
>> The behavior you describe is exactly the behavior I ran into as well when
>> read
2, 2021 at 1:35 PM Evan Galpin wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. Is it possible that use_deprecated_read is
>> broken in 2.29.0 as well?
>>
>> Thanks,
>> Evan
>>
>>
case at least) regardless of use_deprecated_read setting.
>
> Thanks,
> Evan
>
>
> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote:
>
>> use_deprecated_read was broken in 2.19 on the direct runner and didn't do
>> anything. [1] I don't think the fix is
use_deprecated_read was broken in 2.19 on the direct runner and didn't do
anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21.
[1] https://github.com/apache/beam/pull/14469
On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote:
> I forgot to also mention that in all tests I
As a frequent emailer of dev@, I'll admit that it's often very difficult to
figure out if I should be emailing user@ or dev@, and typically just chose
dev@ because it seems more likely to get an answer there. Having clearer
guidelines around what is a "dev" topic would be very useful to better
gui
n-internal?
On Mon, Mar 1, 2021 at 2:54 PM Reuven Lax wrote:
> RowWithGetters is used as an internal detail in Beam, so it special cases
> internal Beam-provided logical types. If we want it to work well with user
> logical types, we might need to redesign it a bit.
>
> On Mon, Mar
I'm working on a little library to wrap domain objects into Rows (using
RowWithGetters) and am trying to implement a new LogicalType. I'm
struggling to understand how it's supposed to interact with RowWithGetters
however.
Let's say that my LogicalType has a java type of MyLogicalType.Value, and
b
I ran into issues
resolving org.pentaho:pentaho-aggdesigner-algorithm:5.1.5-jhyde, it looks
like the spring repo hosting it has locked the artifact, I get a 401
unauthorized trying to download it.
Has anyone else run into this? I assume many people have the artifact
cached locally and so haven't
I think this actually the same problem as I reported w/ the PubsubIO [1],
but in the bounded case. The BoundedSourceAsSDFWrapper closes (and then
re-creates) the underlying source each time it checkpoints, and the default
behavior is to checkpoint very frequently.
[1]
https://lists.apache.org/thr
I'm playing around with xlang portable pipelines in dataflow and noticed
that it tries to pull the java harness (beam_java8_sdk:2.25.0) from
docker.io. This is problematic because our VPC prevents access to external
hosts. I was able to fix the problem by passing in
--sdk_harness_container_image
they shouldn't be
> static, e.g. it might be preferable to start emitting results right away,
> but use larger batches for the steady state if there are performance
> benefits.)
> >>>
> >>> That being said, it sounds like there's something deeper going on
eems like the 1 second limit kicks in before the output
>> reaches 100 elements.
>> >> >
>> >> > I think the original purpose for DirectRunner to use a small limit
>> on issuing checkpoint requests is for exercising SDF better in a small data
>>
s/C9H0YNP3P/p1607057900393900
>> > >>
>> > >> On Thu, Dec 3, 2020 at 1:56 PM Boyuan Zhang
>> wrote:
>> > >>
>> > >>> Hi Steve,
>> > >>>
>> > >>> I think the major performance regression comes from
>> > >>> OutputAn
s.
>
> [1]
> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvoker.java
>
> On Thu, Dec 3, 2020 at 9:40 AM Steve Niemitz wrote:
>
>> I have a pipeline that reads from pubsu
I have a pipeline that reads from pubsub, does some aggregation, and writes
to various places. Previously, in older versions of beam, when running
this in the DirectRunner, messages would go through the pipeline almost
instantly, making it very easy to debug locally, etc.
However, after upgrading
alright, thank you. Is BEAM-10507 the jira to watch for any progress on
that?
On Mon, Nov 30, 2020 at 12:55 PM Boyuan Zhang wrote:
> Hi Steve,
>
> Unfortunately I don't think there is a workaround before we have the
> change that Cham mentions.
>
> On Mon, Nov 30, 2020 a
I'm trying to write an xlang transform that uses Reshuffle internally, and
ran into this as well. Is there any workaround to this for now (other than
removing the reshuffle), or do I just need to wait for what Chamikara
mentioned? I noticed the same issue was mentioned in the SnowflakeIO.Read
PR
ommit/7fec038bf9e3861462744ba5522208a4b9d15b85#diff-0435a83a413ec063bf7e682cadcd56776cd18fc878f197cc99a65fc231ef2047
On Fri, Nov 6, 2020 at 6:27 PM Steve Niemitz wrote:
> yeah, I built it via:
> JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 ./gradlew --no-daemon
> -Ppublishing -PnoSigning publishMavenJavaPublicati
https://issues.apache.org/jira/browse/BEAM-11080)
>
> On Fri, Nov 6, 2020 at 3:13 PM Steve Niemitz wrote:
>
>> I'm trying out 2.25 (built from source, using java 8), and running into
>> this error, both on the direct runner and dataflow:
>>
>
I'm trying out 2.25 (built from source, using java 8), and running into
this error, both on the direct runner and dataflow:
Caused by: java.lang.NoSuchMethodError:
java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
at com.google.protobuf.NioByteString.copyToInternal(NioByteString.java:112)
at co
On Tue, May 26, 2020 at 12:20 PM Brian Hulette wrote:
>
> On Mon, May 25, 2020 at 1:30 PM Steve Niemitz wrote:
>
>> I tried to validate the release branch on dataflow, but it looks like
>> there's no tag yet for the 2.22 docker containers [1]
>>
>
> I hav
I tried to validate the release branch on dataflow, but it looks like
there's no tag yet for the 2.22 docker containers [1]
I'm also fairly concerned with the artifact staging regressions introduced
in BEAM-9383 [2]
[1]
https://console.cloud.google.com/gcr/images/cloud-dataflow/GLOBAL/v1beta3/bea
https://issues.apache.org/jira/browse/BEAM-10015 was marked as 2.21 but
isn't in the RC1 tag. It's marked as P1, and seems like the implication is
that without the fix, pipelines can produce incorrect data. Is this a
blocker?
On Tue, May 19, 2020 at 4:51 PM Kyle Weaver wrote:
> Hi everyone,
>
those initial splits
> but helps a lot with worker rebalancing since a few workers are usually
> stragglers and will need some help at the end of a pipeline.
>
> On Fri, May 15, 2020 at 1:54 PM Steve Niemitz wrote:
>
>> Thanks for the replies so far. I should have specifically m
do today but give up the composability and any splitting support for
> unbounded SDFs that will come later.
> >
> > Finally, there is a way to get limited support for dynamic splitting of
> bounded and unbounded SDFs for other runners using the composability of
> SDFs and the li
I'm going to be writing a new IO (in java) for reading files in a custom
format, and want to make it splittable. It seems like I have a choice
between the "legacy" source API, and newer experimental SDF API. Is there
any guidance on which I should use? I can likely tolerate some API churn
as wel
re looking to have the first API stable version using the portability
> framework with 2.21.0 which should mean that tests that run outside of
> Google will be possible.
>
>
> On Thu, Apr 9, 2020 at 7:17 AM Steve Niemitz wrote:
>
>> I was trying to run a @ValidatesRunner test
I was trying to run a @ValidatesRunner test for the streaming dataflow
runner, but I actually can't find any way to run them in streaming. It
looks like all the tests are set up using the Create transform,
which generates a batch pipeline.
Are there actually no @ValidatesRunner tests for the stre
g, no test/validation
> gave a signal that something relevant was broken. Plus that fix didn't
> include a test.
>
> I will hesitate to say such a fix is critical for a release, unless there
> is something to test or validate it.
>
>
> -Rui
>
> On Mon, Apr 6, 20
>
> On Mon, Apr 6, 2020 at 2:38 PM Steve Niemitz wrote:
>
>> I can confirm that the artifact on maven central [1] does not have the
>> change in it either, I disassembled it with javap.
>>
>> [1]
>> https://repository.apache.org/content/repositories/orgapac
I can confirm that the artifact on maven central [1] does not have the
change in it either, I disassembled it with javap.
[1]
https://repository.apache.org/content/repositories/orgapachebeam-1100/org/apache/beam/beam-runners-core-java/2.20.0/beam-runners-core-java-2.20.0.jar
On Mon, Apr 6, 2020 a
to solve your case? If
> not, why?
>
>
> [1]:
> https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-sql-zetasql/2.17.0
>
>
>
>
> -Rui
>
> On Thu, Mar 26, 2020 at 1:23 PM Steve Niemitz wrote:
>
>> Oh I think I actually remembe
add a set of core SQL interfaces that only depend on Calcite and
> then split our ZetaSQL translator into a piece that only depends on those
> interfaces, Calcite, and ZetaSQL.
>
> Andrew
>
> On Thu, Mar 26, 2020 at 12:41 PM Steve Niemitz
> wrote:
>
>> The ZetaSQL to calc
The ZetaSQL to calcite translation layer that is bundled with beam seems
generally useful in cases other than for beam. In fact, we're using
(essentially a fork of) it internally outside of beam right now (and I've
fixed a bunch of bugs in it).
Has there ever been any thought about splitting into
https://issues.apache.org/jira/browse/BEAM-9557 should also be a blocker
IMO, as it breaks existing pipelines.
On Mon, Mar 23, 2020 at 1:17 PM Maximilian Michels wrote:
> Just mentioning that we have discovered
> https://issues.apache.org/jira/browse/BEAM-9566 and
> https://issues.apache.org/jir
I just opened this bug because I was testing out the 2.20 release and ran
into it.
It seems like it's now an error to set a processing time timer that happens
to fire after the end of a window. In the past this worked fine (and I
think it would just not fire).
This seems like a pretty bad regres
avro-python3 1.9.2 was released on pypi 4 hours ago, and added pycodestyle
as a dependency, probably related?
On Wed, Feb 12, 2020 at 1:03 PM Luke Cwik wrote:
> +dev
>
> There was recently an update to add autoformatting to the Python SDK[1].
>
> I'm seeing this during testing of a PR as well.
I do agree that the direct runner doesn't drop late data arriving at a
stateful DoFn (I just tested as well).
However, I believe this is consistent with other runners. I'm fairly
certain (at least last time I checked) that at least Dataflow will also
only drop late data at GBK operations, and NOT
If you have a pipeline that looks like Input -> GroupByKey -> ParDo, while
it is not guaranteed, in practice the sink will observe the trigger firings
in order (per key), since it'll be fused to the output of the GBK operation
(in all runners I know of).
There have been a couple threads about trig
withSummarizer(combineFn)
> .withPredicate(summary -> ...)
>
> Something like that? The complexity is not much less than just writing a
> stateful DoFn directly, but the boilerplate is much less.
>
> Kenn
>
> On Thu, Nov 7, 2019 at 2:02 PM Steve Niemitz wrote:
>
>> Inter
Interestingly enough, we just had a use case come up that I think could
have been solved by finishing triggers.
Basically, we want to emit a notification when a certain threshold is
reached (in this case, we saw at least N elements for a given key), and
then never notify again within that window.
lo Estrada wrote:
> Thanks for offering to work on this! It would be awesome to have it. I can
> say that we don't have that for Python ATM.
>
> On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz
> wrote:
>
>> Our experience has actually been that avro is more efficient than
en an avro version would be much
> more efficient. I believe Parquet files would be even more efficient if you
> wanted to go that path, but there might be more code to write (as we
> already have some code in the codebase to convert between TableRows and
> Avro).
>
> Reuven
>
>
Has anyone investigated using avro rather than json to load data into
BigQuery using BigQueryIO (+ FILE_LOADS)?
I'd be interested in enhancing it to support this, but I'm curious if
there's any prior work here.
I've been doing some experiments in my own fork of the Dataflow worker
using HdrHistogram [1] to record histograms. I export them to our own
stats collector, not Stackdriver, but have been having good success with
them.
The problem is that the dataflow worker metrics implementation is totally
dif
There was a thread about this a few months ago as well:
https://lists.apache.org/thread.html/20d11046d26174969ef44a781e409a1cb9f7c736e605fa40fdf98397@%3Cuser.beam.apache.org%3E
On Wed, Jun 26, 2019 at 4:02 PM Robert Bradshaw wrote:
> There is no promise that panes will arrive in order (especial
This sounds a lot like what I had reported in
https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-6813
(sorry for the mobile link).
On Wed, Apr 10, 2019 at 6:59 PM Anton Kedin wrote:
> Hi dev@,
>
> I am debugging a flaky test and observing an interesting problem with a
> BagState. T
1 - 100 of 121 matches
Mail list logo