Re: Hazelcast Jet Runner

2019-05-28 Thread Kenneth Knowles
On Mon, May 27, 2019 at 4:05 PM Reza Rokni wrote: > "Many APIs that have been in place for years and are used by most Beam > users are still marked Experimental." > > Should there be a formal process in place to start 'graduating' features > out of @Experimental? Perhaps even target an up coming

Re: Hazelcast Jet Runner

2019-05-28 Thread Kenneth Knowles
On Mon, May 27, 2019 at 3:44 PM Reuven Lax wrote: > We generally use Experimental for two different things, which leads to > confusion. > 1. Features that work stably, but where we think we might still make > some changes to the API. > 2. New features that we think might not yet be stable. >

Timer support in Flink

2019-05-28 Thread Reza Rokni
Hi Flink experts, I am getting ready to push a PR around a utility class for timeseries join left.timestamp match to closest right.timestamp where right.timestamp <= left.timestamp. It makes very heavy use of Event.Time timers and has to do some manual DoFn cache work to get around some O(heavy)

Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.

2019-05-28 Thread Ahmet Altay
Could we first figure out the process (where to push, how to push, permissions needed, how to validate etc.) as part of the snapshots and update the release guide based on that? On Tue, May 28, 2019 at 2:43 AM Robert Bradshaw wrote: > In the future (read, next release) the SDK will likely have r

Re: How do I debug failing runners:google-cloud-dataflow-java:examples:verifyFnApiWorker task in presubmit

2019-05-28 Thread Alex Amato
PR link https://github.com/apache/beam/pull/8416 On Tue, May 28, 2019 at 4:25 PM Alex Amato wrote: > I'm had a lingering PR for some about a month now. I'm trying to get this > passing presubmits and submitted, but I don't have enough output from the > failing task to debug this. > > I think its

How do I debug failing runners:google-cloud-dataflow-java:examples:verifyFnApiWorker task in presubmit

2019-05-28 Thread Alex Amato
I'm had a lingering PR for some about a month now. I'm trying to get this passing presubmits and submitted, but I don't have enough output from the failing task to debug this. I think its from a wordcount timeout, but I don't know how to get more info. I don't think its a dataflow job with any lin

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Ankur Goenka
Open cherry pick PRs for spark runner https://github.com/apache/beam/pull/8705 https://github.com/apache/beam/pull/8706 On Tue, May 28, 2019 at 3:42 PM Valentyn Tymofieiev wrote: > Yes, looking into that. > > On Tue, May 28, 2019 at 3:37 PM Ankur Goenka wrote: > >> Valentyn, Can you please send

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Valentyn Tymofieiev
Yes, looking into that. On Tue, May 28, 2019 at 3:37 PM Ankur Goenka wrote: > Valentyn, Can you please send the cherry pick PR for > https://issues.apache.org/jira/browse/BEAM-7439 > > On Tue, May 28, 2019 at 3:04 PM Ankur Goenka wrote: > >> Sure, I will cherry pick those PRs. >> >> On Tue, May

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Ankur Goenka
Hi All, In the meanwhile Please validate RC1 to catch anyother issues. Thanks, Ankur On Tue, May 28, 2019 at 3:37 PM Ankur Goenka wrote: > Valentyn, Can you please send the cherry pick PR for > https://issues.apache.org/jira/browse/BEAM-7439 > > On Tue, May 28, 2019 at 3:04 PM Ankur Goenka wr

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Ankur Goenka
Valentyn, Can you please send the cherry pick PR for https://issues.apache.org/jira/browse/BEAM-7439 On Tue, May 28, 2019 at 3:04 PM Ankur Goenka wrote: > Sure, I will cherry pick those PRs. > > On Tue, May 28, 2019 at 2:19 PM Kyle Weaver wrote: > >> Hi Ankur, >> >> It's not a blocker, but I'd

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Ankur Goenka
Sure, I will cherry pick those PRs. On Tue, May 28, 2019 at 2:19 PM Kyle Weaver wrote: > Hi Ankur, > > It's not a blocker, but I'd like to see > https://github.com/apache/beam/pull/8558 and > https://github.com/apache/beam/pull/8569 be included so TFX examples can > be run without errors on the

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Kyle Weaver
Hi Ankur, It's not a blocker, but I'd like to see https://github.com/apache/beam/pull/8558 and https://github.com/apache/beam/pull/8569 be included so TFX examples can be run without errors on the 2.13.0 Spark runner ( https://github.com/tensorflow/tfx/pull/84). Kyle Weaver | Software Engineer |

Re: [DISCUSS] Autoformat python code with Black

2019-05-28 Thread Ahmet Altay
I am in the same boat with Robert, I am in favor of autoformatters but I am not familiar with this one. My concerns are: - The product is clearly marked as beta with a big warning. - It looks like mostly a single person project. For the same reason I also strongly prefer not using a fork for a spec

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Ankur Goenka
Thanks for the validation. I have marked fixed version of https://issues.apache.org/jira/browse/BEAM-7406 https://issues.apache.org/jira/browse/BEAM-6380 to be 2.13.0 and will cherry pick the associated commits to the jira. On Tue, May 28, 2019 at 11:19 AM Lukasz Cwik wrote: > I would also sug

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Lukasz Cwik
I would also suggest to get https://github.com/apache/beam/pull/8668 in to 2.13.0 since it fixes a logging setup issue on Dataflow (BEAM-7406). On Tue, May 28, 2019 at 10:22 AM Chamikara Jayalath wrote: > I would also like to get https://github.com/apache/beam/pull/8661 in to > 2.13.0 that fixes

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Łukasz Gajowy
Alexey, sorry for the confusion then. Let me explain this better once more: 1. IO tests: In IO tests we do not use the Synthetic Sources that generate the records. We use a GenerateSequence class that generates a sequence of long values and then map it to some records to finally write that to a

Re: [DISCUSS] Portability representation of schemas

2019-05-28 Thread Lukasz Cwik
I like the concept of expressing type coercion as a wrapper coder which says that this language treats this type as Foo. This seems to be useful in general for cross language pipelines since it is much more likely that two languages will understand an encoding but may want to express the type withi

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Chamikara Jayalath
I would also like to get https://github.com/apache/beam/pull/8661 in to 2.13.0 that fixes https://issues.apache.org/jira/browse/BEAM-6380. It's not a new issue but has affected a number of users. - Cham On Tue, May 28, 2019 at 9:31 AM Valentyn Tymofieiev wrote: > Thanks, Juta Staes, for reporti

Re: [DISCUSS] Portability representation of schemas

2019-05-28 Thread Brian Hulette
On Sun, May 26, 2019 at 1:25 PM Reuven Lax wrote: > > > On Fri, May 24, 2019 at 11:42 AM Brian Hulette > wrote: > >> *tl;dr:* SchemaCoder represents a logical type with a base type of Row >> and we should think about that. >> >> I'm a little concerned that the current proposals for a portable >>

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Robert Burke
The Go SDK doesn't yet have these counters implemented or published (sampling elements &countinf between DoFns, etc). On Tue, May 28, 2019, 9:08 AM Alexey Romanenko wrote: > On 28 May 2019, at 17:31, Łukasz Gajowy wrote: > > > I'm not quite following what these sizes are needed for--aren't the

Re: Definition of Unified model

2019-05-28 Thread Reuven Lax
A slightly larger concern: it also will force users to create stateful DoFns everywhere to generate these sequence numbers. If I have a ParDo that is not a simple 1:1 transform (i.e. not MapElements), then the ParDo will need to generate its own sequence numbers for ordering, and the only safe way

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Valentyn Tymofieiev
Thanks, Juta Staes, for reporting this issue. On Tue, May 28, 2019, 9:19 AM Valentyn Tymofieiev wrote: > -1. > I would like us to fix > https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a > regression that happened in 2.12.0, but was not caught by existing tests. > > Thanks, > Va

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-28 Thread Valentyn Tymofieiev
-1. I would like us to fix https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a regression that happened in 2.12.0, but was not caught by existing tests. Thanks, Valentyn On Wed, May 22, 2019, 4:30 PM Ankur Goenka wrote: > Hi everyone, > > Please review and vote on the release ca

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Alexey Romanenko
On 28 May 2019, at 17:31, Łukasz Gajowy wrote: > > I'm not quite following what these sizes are needed for--aren't the > benchmarks already tuned to be specific, known sizes? > > Maybe I wasn't clear enough. Such metric is useful mostly in IO tests - > different IOs generate records of differen

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Łukasz Gajowy
I'm not quite following what these sizes are needed for--aren't the benchmarks already tuned to be specific, known sizes? Maybe I wasn't clear enough. Such metric is useful mostly in IO tests - different IOs generate records of different size. It would be ideal for us to have a universal way to ge

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Robert Bradshaw
I'm not quite following what these sizes are needed for--aren't the benchmarks already tuned to be specific, known sizes? I agree that this can be expensive; especially for benchmarking purposes a 5x overhead means you're benchmarking the sizing code, not the pipeline itself. Beam computes estimat

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
Hi Reuven, > It also gets awkward with Flatten - the sequence number is no longer enough, you must also encode which side of the flatten each element came from. That is a generic need. Even if you read data from Kafka, the offsets are comparable only inside single partition. So, for Kafka to

Re: Definition of Unified model

2019-05-28 Thread Reuven Lax
Sequence metadata does have the disadvantage that users can no longer use the types coming from the source. You must create a new type that contains a sequence number (unless Beam provides this). It also gets awkward with Flatten - the sequence number is no longer enough, you must also encode which

Re: [DISCUSS] Autoformat python code with Black

2019-05-28 Thread Katarzyna Kucharczyk
This sounds really good. A lot of Jenkins jobs failures are caused by lint problems. I think it would be great to have something similar to Spotless in Java SDK (I heard there is problem with configuring Black with IntelliJ). On Mon, May 27, 2019 at 10:52 PM Robert Bradshaw wrote: > I'm generall

Measuring element sizes in benchmarks

2019-05-28 Thread Łukasz Gajowy
Hi all, part of our work while creating benchmarks for Beam is to collect total data size (bytes) that was put inside the testing pipeline. We need that in load tests of core beam operations (to see how big was the load really) and IO tests (to calculate throughput). The "not so good" way we're do

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
As I understood it, Kenn was supporting the idea that sequence metadata is preferable over FIFO. I was trying to point out, that it even should provide the same functionally as FIFO, plus one important more - reproducibility and ability to being persisted and reused the same way in batch and st

Re: DISCUSS: Sorted MapState API

2019-05-28 Thread Robert Bradshaw
On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles wrote: > > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote: >> >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: >>> >>> Some great comments! >>> >>> Aljoscha: absolutely this would have to be implemented by runners to be >>> efficient. W

Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.

2019-05-28 Thread Robert Bradshaw
In the future (read, next release) the SDK will likely have reference to the containers, so this will have to be part of the release. But I agree for 2.13 it should be more about figuring out the process and not necessarily holding back. On Mon, May 27, 2019 at 7:42 PM Ankur Goenka wrote: > > +1

Re: Definition of Unified model

2019-05-28 Thread Robert Bradshaw
Huge +1 to all Kenn said. Jan, batch sources can have orderings too, just like Kafka. I think it's reasonable (for both batch and streaming) that if a source has an ordering that is an important part of the data, it should preserve this ordering into the data itself (e.g. as sequence numbers, offs