from:"Etienne Chauchot"

Re: [DISCUSS] Next steps for update of Avro dependency in Beam

2022-05-13 Thread Etienne Chauchot


Hi,

Thanks Alexey for bringing this topic up.

I'd be in favor of 3

Best

Etienne

Le 12/05/2022 à 23:21, Brian Hulette a écrit :
Regarding Option (3) "but keep and shade Avro for “core” needs as 
v.1.8.2 (still have an issue with CVEs)"


Do we actually need to keep avro in core for any reason? I thought we 
only had it in core for AvroCoder, schema support, and IOs, which I 
think are all reasonable to separate out into an extension (this would 
be comparable to the protobuf extension). To confirm I just grepped 
for files in core that import avro:


❯ grep -liIrn 'import org\.apache\.avro' sdks/java/core/src/main
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroByteBuddyUtils.java
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ConvertHelpers.java
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/DynamicAvroDestinations.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/SerializableAvroCodecFactory.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSink.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSchemaIOProvider.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/ConstantAvroDestination.java
sdks/java/core/src/main/java/org/apache/beam/sdk/coders/AvroGenericCoder.java
sdks/java/core/src/main/java/org/apache/beam/sdk/coders/AvroCoder.java

Brian

On Thu, May 12, 2022 at 2:08 PM Robert Bradshaw  
wrote:


Keeping avro in our public (core) API and as an internal dependency
seems to be a recurring pain point, I would be all for pulling it out
(basically option 3) and subsequently updating our internal version
(hopefully no backwards compatibility issues here) and letting the
extension live with a variety of versions (insofar as this is
feasible).

On Thu, May 12, 2022 at 10:29 AM Alexey Romanenko
 wrote:
>
> Hi everyone,
>
> Sorry in advance for a long email.
> TL;DR: Let’s discuss the next steps to update Avro dependency in
Beam.
>
> I’d like to come back to this old and quite sensitive topic here
which is Apache Avro version update in Beam. Along the time, we
already had several discussions on this (for example [1]) but
without any concrete resolutions in the end, iirc.
>
> As we all know, Beam still depends on quite old Avro version
1.8.2 and there were some attempts to bump it to more recent ones.
One of the main reasons to bump an Avro version, imho, is that
Avro 1.8.2 dependency brings several CVEs [2], but the latest Avro
1.11.0 brings only one [3]
>
> In the same time, this update with introduce some incompatible
changes that Avro has between versions and this may affect Beam
users and potentially it may affect transitive dependencies while
using Beam with other project that use Avro as well:
> - Avro completely moved to java.time.* instead of
org.joda.time.*. So, we need to adjust date/time conversions
from/to Beam schema accordingly since Beam schema still uses
joda.time. It will require users to regenerate already generated
Java code with avro-compiler (if any) otherwise it won’t compile;
> - Some minor changes in Avro dependencies and user API;
> - Something else?
>
> I know that here, on the list, we have people from Avro
community that are much more experienced in this than me - so,
please correct me if I say something wrong or not 100% correct.
>
>
> In Beam, we also performed several attempts to update Avro - for
example, [4], [5], [6] and others.
>
> To make such update easier in the future, we also discussed to
move Avro dependency out of core Beam [7] and there were an
attempt to do that [8] by finally this PR was closed with a
resolution that it’s not actually needed and we may just want to
test Beam with different Avro versions [9]
>
> The latest work on this was a PR to support several versions of
Avro in Beam (1.8.x and 1.9.x) [10] which still introduces some
breaking changes for users, iirc.
>
> So, seems that we are a bit stuck on this topic, though, imho,
we need to decide how move forward mostly because of CVEs in old
Avro versions and future Avro updates in Beam.
>
> The potential options (as I can see them):
>
> 1) Bump Avro dependency to the latest one (1.11.0) or the
possible more recent one
> - Pros:
> - latest/recent Avro dependency;
> - potentially easy to update in the future;
> - Cons:
> - breaking change for users;
> - potentially issues with other projects that use Avro (like
Apache Spark e.g.).
>
> 2) Support different Avro

Remove support for Elasticsearch 5 and 6 ?

2022-03-25 Thread Etienne Chauchot


Hi all,

Elastic no more supports Elastic 5 and 6 (1). We are in the middle of 
removing long overdue Elasticsearch 2 support but maybe it is time to 
remove support for ES5 and ES6 from Beam as well.


WDYT ?

[1] https://endoflife.date/elasticsearch


Best

Etienne

Re: [ANNOUNCE] New committer: Moritz Mack

2022-03-11 Thread Etienne Chauchot


Congrats Moritz ! Well deserved !

Etienne

Le 10/03/2022 à 19:44, Sachin Agarwal a écrit :

Congratulations Moritz!

On Thu, Mar 10, 2022 at 10:44 AM Alexey Romanenko 
 wrote:


Hi everyone,

Please join me and the rest of the Beam PMC in welcoming a new
committer: Moritz Mack

Moritz started to contribute to Beam quiet recently and he did it
very actively since then. He contributed a lot to the all set of
Beam AWS IO connectors for Java SDK (s3, kinesis, sns, sqs, etc)
[1] and, actually, became one of the main responsable person for
that part of Beam.

In consideration of his contributions, the Beam PMC trusts him
with the responsibilities of a Beam committer [2].

Thank you for your contributions, Moritz!

-Alexey, on behalf of the Apache Beam PMC

[1] https://github.com/apache/beam/pulls?q=is:pr+author:mosche
[2]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

[ANNOUNCE] New committer: Evan Galpin

2022-03-10 Thread Etienne Chauchot


Hi all,

Please join me and the rest of the Beam PMC in welcoming 
a new committer: Evan Galpin


Since joining the Beam community Evan has done lots of contributions to 
IOs mainly Elasticsearch, but also to SDK transforms. He also gave 
support on the ML and tested releases.


Considering these highlighted contributions and the rest, the Beam PMC 
trusts Evan with the responsibilities of a Beam committer [1].


Thank you, Evan, for your contributions.

[1] 
https://beam.apache.org/contribute/become-a-committer/#what-are-the-traits-of-an-apache-beam-committer


Etienne, on behalf of the Apache Beam PMC

Re: Intro

2021-10-22 Thread Etienne Chauchot


Welcome onboard Moritz !

Best

Etienne

On 22/10/2021 15:52, Moritz Mack wrote:


Hi all,

I’m very much looking forward to start contributing to Beam and just 
want to briefly introduce myself.


My name is Moritz (mosche) and I’m working together with Alexey and 
Etienne. Having worked mostly with Spark in the past, I’m excited to 
dive deeper into Beam 😊


Looking forward to working with all of you!

Kind regards from Munich,

Moritz

*As a recipient of an email from Talend, your contact personal data 
will be on our systems. Please see our privacy notice (updated August 
2020) at Talend, Inc. *

Spark Structured Streaming runner migrated to Spark 3

2021-08-05 Thread Etienne Chauchot


Hi all,

Just to let you know that Spark Structured Streaming runner was migrated 
to Spark 3.


Enjoy !

Etienne

Re: Spark Structured Streaming Runner Roadmap

2021-08-03 Thread Etienne Chauchot


Hi,

Sorry for the late answer: the streaming mode in spark structured 
streaming runner is stuck because of spark structured streaming 
framework implementation of watermark at the apache spark project side. See


https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html

best

Etienne

On 20/05/2021 20:37, Yu Zhang wrote:

Hi Beam Community,

Would there be any roadmap for Spark Structured Runner to support streaming and 
Splittable DoFn API? Like the specific timeline or release version.

Thanks,
Yu

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Etienne Chauchot

+1 (binding) on releasing vendored bytebuddy for testing in 
https://github.com/apache/beam/pull/14824


Etienne

On 19/05/2021 23:43, Kai Jiang wrote:

+1 (non-binding)

On Wed, May 19, 2021 at 12:23 PM Jan Lukavský > wrote:


+1 (non-binding)

verified correct shading.

 Jan

On 5/19/21 8:53 PM, Ismaël Mejía wrote:

This release is only to publish the vendored dependency
artifacts. We need those to integrate it and be able to verify if
it causes problems or not. The PR for this is already opened but
it needs the artifacts of this vote to be ran.
https://github.com/apache/beam/pull/14824


For ref there is a document on how to release and validate
releases of Beam's vendored dependencies that can be handy to
anyone wishing to help validate:
https://s.apache.org/beam-release-vendored-artifacts


On Wed, May 19, 2021 at 8:45 PM Tyson Hamilton
mailto:tyso...@google.com>> wrote:

I'd like to help, but I don't know how to determine whether
this upgrade is going to cause problems or not. Are there
tests I should look at, or some validation I should perform?

On Wed, May 19, 2021 at 11:29 AM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Kind reminder, the vote is ongoing

On Mon, May 17, 2021 at 5:32 PM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Please review the release of the following artifacts
that we vendor:
 * beam-vendor-bytebuddy-1_11_0

Hi everyone,
Please review and vote on the release candidate #1
for the version 0.1, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide
specific comments)

The complete staging area is available for your
review, which includes:
* the official Apache source release to be deployed
to dist.apache.org  [1],
which is signed with the key with fingerprint
3415631729E15B33051ADB670A9DAF6713B86349 [2],
* all artifacts to be deployed to the Maven Central
Repository [3],
* commit hash
"d93c591deb21237ddb656583d7ef7a4debba" [4],

The vote will be open for at least 72 hours. It is
adopted by majority approval, with at least 3 PMC
affirmative votes.

Thanks,
Release Manager

[1]
https://dist.apache.org/repos/dist/dev/beam/vendor/

[2]
https://dist.apache.org/repos/dist/release/beam/KEYS

[3]

https://repository.apache.org/content/repositories/orgapachebeam-1166/


[4]

https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-15 Thread Etienne Chauchot


Hi,

+1 on drop

Etienne

On 12/03/2021 20:39, Ismaël Mejía wrote:

Do we now support 1.8 through 1.12?

Yes and that's clearly too much given that the Flink community only
support the two latest release.
It also hits us because we run tests for all those versions on precommit.

On Fri, Mar 12, 2021 at 7:27 PM Robert Bradshaw  wrote:

Do we now support 1.8 through 1.12?

Unless there are specific objections, makes sense to me.

On Fri, Mar 12, 2021 at 8:29 AM Alexey Romanenko  
wrote:

+1 too but are there any potential objections for this?

On 12 Mar 2021, at 11:21, David Morávek  wrote:

+1

D.

On Thu, Mar 11, 2021 at 8:33 PM Ismaël Mejía  wrote:

+user


Should we add a warning or something to 2.29.0?

Sounds like a good idea.




On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles  wrote:

Should we add a warning or something to 2.29.0?

On Thu, Mar 11, 2021 at 10:19 AM Ismaël Mejía  wrote:

Hello,

We have been supporting older versions of Flink that we had agreed in previous
discussions where we said we will be supporting only the latest three releases
[1].

I would like to propose that for Beam 2.30.0 we stop supporting Flink 1.8 and
1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring the
subject here (and to user@) for your attention and in case someone has a
different opinion or reason to still support the older versions.

WDYT?

Regards,
Ismael

[1] 
https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-11948
[3] https://github.com/apache/beam/pull/14203

Re: Unit tests vs. Integration Tests

2021-01-15 Thread Etienne Chauchot

Big +1 on using testcontainers rather than embedded real backends. That 
is what we plan to use for ES refactoring.

I'm a strong believer that mocks are useless to replace complex 
backends. Testing things like IOs against mocks are a almost certain 
failure because mocks cannot be representative of the complex backend.

Etienne

On 08/12/2020 00:36, Kenneth Knowles wrote:

On Fri, Dec 4, 2020 at 3:29 PM Brian Hulette > wrote:

On Wed, Dec 2, 2020 at 5:50 PM Brian Hulette mailto:bhule...@google.com>> wrote:

I guess another question I should ask - is :test supposed to
only run unit tests? I've been assuming so since many modules
have separate :integrationTest tasks for *IT tests.

On Wed, Dec 2, 2020 at 4:15 PM Kyle Weaver
mailto:kcwea...@google.com>> wrote:

> DicomIOTest, FhirIOTest,
HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT

Looking only at the former three tests, I don't see any
reason they can't mock their API clients, especially
since all they expect the server to do is send back an error.

Fair point, that wouldn't be *too* much trouble. More than
just re-classifying them as integration tests though :)

Can we write the tests to be agnostic as to whether the service is 
mocked, faked, or real? Then we can run them in various modes.

I'm a strong believer in "fake don't mock" to the extent that I think 
mocking these might be counterproductive. On a case-by-case basis, 
perhaps one can write a tiny in-memory version that has the 
functionality needed rather than exact scripted responses. This will 
interact well with the idea of making the test unaware of whether it 
is running against the real service or not.

This way we can run a fast (or at least hermetic) version as a unit 
test and once in a while sanity check it against the real service.

(In a perfect world, owners of services would always ship a local 
testing fake and own keeping it up to date... testcontainers is one 
approach but also in-memory fakes are great)

Kenn

> This seems like something that is easy to get wrong
without some automation to help. Could we run the :test
targets on Jenkins using the sandbox command or docker to
block network access?

That's a great idea. Are we planning on integrating the
"standardized developer build environment" mentioned in
the original post into our CI somehow?

I was thinking it could be good to use it in CI somehow to
make sure it doesn't get out of date, but all I had in mind
was running some minimal set of tasks. Using it in this way
would obviously be even better.

I filed https://issues.apache.org/jira/browse/BEAM-11404 to track
this idea.

On Wed, Dec 2, 2020 at 4:03 PM Andrew Pilloud
mailto:apill...@google.com>> wrote:

We have a large number of tests that run pipelines on
the Direct Runner or various local runners, but don't
require network access, so I don't think the
distinction is clear. I do agree that requiring a
remote service falls on the integration test side.

This seems like something that is easy to get wrong
without some automation to help. Could we run the
:test targets on Jenkins using the sandbox command or
docker to block network access?

On Wed, Dec 2, 2020 at 3:38 PM Brian Hulette
mailto:bhule...@google.com>> wrote:

Sorry I should've included the list of tests here.
So far we've run into:
DicomIOTest, FhirIOTest,
HL7v2IOTest, org.apache.beam.sdk.extensions.ml.*IT

Note the latter are called IT, but that package's
build.gradle has a line to scoop ITs into the
:test task (addressing in [1]).

All of these tests are actually running pipelines
so I think they'd be difficult to mock.

[1] https://github.com/apache/beam/pull/13444

On Wed, Dec 2, 2020 at 3:28 PM Kyle Weaver
mailto:kcwea...@google.com>>
wrote:

> Should we (do we) require unit tests to be
hermetic?

We should. Unit tests are hermetic by
definition. That begs the definition of
hermetic, but clearly the internet is not.

> Personally I think these tests should be
classified as integration tests (renamed to
*IT, and run with the :integrationTest task)

I'm

Re: Combine with multiple outputs case Sample and the rest

2021-01-15 Thread Etienne Chauchot

Hi all,

Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo) 
to have multiple outputs, please note that most of the time Combine is 
translated by the runners with a native (destination-tech) Combine and 
not a GBK + Pardo.

Regarding using the Stateful DoFn I agree with Kenn with the little 
exception that Statefull DoFn is not supported in streaming mode with 
Spark runner.

But I guess, Ismaël, that the use case is batch mode.

Best

Etienne

On 05/01/2021 15:00, Kenneth Knowles wrote:
Perhaps something based on stateful DoFn so there is a simple decision 
point at which each element is either sampled or not so it can be 
output to one PCollection or the other. Without doing a little 
research, I don't recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía > wrote:

Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:
>
> There are two ways to emit multiple outputs: either to multiple
distinct PCollections (e.g. withOutputTags) or multiple (including
0) outputs to a single PCollection (the difference between Map and
FlatMap). In full generality, one can always have a CombineFn that
outputs lists (say *) followed by a DoFn that emits
to multiple places based on this result.
>
> One other cons of emitting multiple values from a CombineFn is
that they are used in other contexts as well, e.g. combining
state, and trying to make sense of a multi-outputting CombineFn in
that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn
because we throw most of the data away. If we kept most of the
data, it likely wouldn't fit into one machine to do the final
sampling. The idea of using a side input to filter after the fact
should work well (unless there's duplicate elements, in which case
you'd have to uniquify them somehow to filter out only the "right"
copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía mailto:ieme...@gmail.com>> wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to
produce a
>> uniform sample of size n of a PCollection). They wanted to
obtain also
>> the rest of the PCollection as an output (the non sampled
elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why
we don’t
>> have a Combine function with multiple outputs (maybe an
evolution of
>> CombineWithContext). I know this sounds weird and I have
probably not
>> thought much about issues or the performance of the translation
but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël

Re: ElasticsearchIO.Write() dynamic ES indices

2021-01-15 Thread Etienne Chauchot


+1

What is not supported yet is the wildcard indexes (index* pattern). It 
will be when the refactoring (use high level ES objects rather than low 
level REST String objects) is done.


Best

Etienne

On 13/01/2021 16:45, Brian Hulette wrote:
It looks like you should be able to accomplish this with withIndexFn 
[1]. Does that work for you?


Brian

[1] 
https://beam.apache.org/releases/javadoc/2.27.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html#withIndexFn-org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.Write.FieldValueExtractFn-


On Wed, Jan 13, 2021 at 1:15 AM Arif Alili > wrote:


Hi all,

I am writing to Elasticsearch using Beam (Google Dataflow). At the
moment I pass index name to ElasticsearchIO.Write() as variable
when deploying the dataflow. Instead of having one static index, I
want to insert to multiple Elasticsearch indices, I have the index
name on Pub/Sub topic, so I want to read the current row that is
being written and set the Elasticsearch index based on that value.

My current code is:
ConnectionConfiguration config =
ConnectionConfiguration.create(
options().getNodeAddresses().split(","),
options().getIndex(),
options().getDocumentType())
   .withUsername(options().getUsername())
   .withPassword(options().getPassword());

I need to change "options().getIndex()" line to get the index from
Pub/Sub topic instead. Pub/Sub topic field where index name is
stored is: esIndex, so I tried replacing that line with "doc ->
doc.get("esIndex")" but I see the next error: "java.lang.String is
not a functional interface".

Does anyone know how to set the Elasticsearch index reading
Pub/Sub topic that is currently being written?

Best,
Arif

Re: [blog about Beam]

2020-11-12 Thread Etienne Chauchot


Hi Pablo,

Thanks for reading these.

Etienne

On 10/11/2020 20:09, Pablo Estrada wrote:

Thanks Etienne!
I read your post on why we can't have multiple aggregations in Spark 
Streaming. It was informative. Thanks for writing these!

Best
-P.

On Tue, Nov 10, 2020 at 3:39 AM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:


Hi all,

In case anyone is interested, I started a blog [1] this year about
big
data technologies. There are 8 articles so far and they are mainly
related to Beam even if some are related to Spark (but with the
knowledge acquired while working on the Beam Spark runner).

I just published the first blog article of a serie of articles about
tricky use cases of Apache Beam. This one is about incremental join,
next one will be about custom combine and the last of the serie about
custom window.

[1] https://echauchot.blogspot.com/

Best,

Etienne

[blog about Beam]

2020-11-10 Thread Etienne Chauchot


Hi all,

In case anyone is interested, I started a blog [1] this year about big 
data technologies. There are 8 articles so far and they are mainly 
related to Beam even if some are related to Spark (but with the 
knowledge acquired while working on the Beam Spark runner).


I just published the first blog article of a serie of articles about 
tricky use cases of Apache Beam. This one is about incremental join, 
next one will be about custom combine and the last of the serie about 
custom window.


[1] https://echauchot.blogspot.com/

Best,

Etienne

[Beam Spark Structured Streaming runner]

2020-11-05 Thread Etienne Chauchot


Hi all,

In case anyone wanted some details about the new Beam runner based on 
Spark Structured Streaming framework, here are 2 talks I gave at the 
ApacheCon this year and last year about this subject.


https://www.youtube.com/watch?v=oEehQwOEFvg

https://www.youtube.com/watch?v=_dCmV1ZW3M4

Best,

Etienne

Re: Shutting down Perfkit Explorer

2020-09-25 Thread Etienne Chauchot


Thanks Kamil !

Etienne

On 25/09/2020 16:10, Kamil Wasilewski wrote:

They have been migrated: http://metrics.beam.apache.org

The website doesn't support HTTPS, so if you can't access it, you may 
need to add an exception to your browser extension.


On Fri, Sep 25, 2020 at 3:49 PM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:


Hi all,

I'm coming a bit after the battle, but how should we access the
dashboards now (load tests, nexmark etc...)?

Are the dashboards lost or have they migrated to another environment ?

Thanks

Etienne

On 24/09/2020 13:51, Robert Burke wrote:

LGTM
Good clear message.

On Thu, Sep 24, 2020, 4:47 AM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>> wrote:

The message has been updated:
https://apache-beam-testing.appspot.com/

On Thu, Sep 24, 2020 at 12:07 PM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>> wrote:

I'm not sure if such a jira issue exists. I'm also not
convinced that we need a new one. New jira means there is
an action to be taken, now or in the future. Our goal is
only to make sure no one would disable the app engine
accidently.

Putting some information instead of "hello world" is a
good idea. I will update it today. Thank you Tobiasz for
the link, it will be very helpful.


On Thu, Sep 24, 2020 at 11:32 AM Tobiasz Kędzierski
mailto:tobiasz.kedzier...@polidea.com>> wrote:

I agree with Robert, putting some information instead
of "hello world" could prevent disabling GAE and
connected problems in the future.
If a relevant Jira issue exists it would be a great
idea to put a link to it.
What do you think about adding a link to GCP
documentation mentioning this dependency [1] ?

BR Tobiasz

[1]

https://cloud.google.com/datastore/docs/reference/libraries#dependency_on_application



On Wed, Sep 23, 2020 at 5:33 PM Robert Burke
mailto:rob...@frantil.com>> wrote:

Perhaps instead of "hello world" the message
could refer to a jira about the Datastore IT tests?
I suspect if we don't we'll just have a repeat of
"we shut down app engine since it was just
running a hello world, and the Datastore tests died".

On Wed, Sep 23, 2020, 8:14 AM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>> wrote:

An error message that Udi sent is pretty self
explanatory. Disabling Google App Engine
caused Datastore to be not accessible too.
What's interesting, it doesn't make any
difference if the application on GAE is
actually using Datastore or not. GAE must be
simply turned on and that's the only requirement.

I replaced Perfkit Explorer with a simple
"hello world" running on Python 3.8. I ran
Datastore IT tests and they passed, so I
think this solves the problem. If something
goes wrong, let me know!
Sorry for inconvenience and thanks Tyson for
the rescue.

On Tue, Sep 22, 2020 at 11:25 PM Udi Meiri
mailto:eh...@google.com>>
wrote:

Thanks, Tyson!

On Tue, Sep 22, 2020 at 11:11 AM Tyson
Hamilton mailto:tyso...@google.com>> wrote:

I re-enabled the AppEngine app. Today
that app has both the required
datastore app and the perfkit app
baked into the container image. What
should happen, is that the perfkit
app is removed from that image, but
the datastore related stuff remains
functional.

On Tue, Sep 22, 2020 at 10:37 AM Udi
Meiri mailto:eh...@google.com>> wrote:

Is it possible to create a simple
"hello world" application instead?

On Tue, Se

Re: Shutting down Perfkit Explorer

2020-09-25 Thread Etienne Chauchot


Hi all,

I'm coming a bit after the battle, but how should we access the 
dashboards now (load tests, nexmark etc...)?


Are the dashboards lost or have they migrated to another environment ?

Thanks

Etienne

On 24/09/2020 13:51, Robert Burke wrote:

LGTM
Good clear message.

On Thu, Sep 24, 2020, 4:47 AM Kamil Wasilewski 
mailto:kamil.wasilew...@polidea.com>> 
wrote:


The message has been updated: https://apache-beam-testing.appspot.com/

On Thu, Sep 24, 2020 at 12:07 PM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>> wrote:

I'm not sure if such a jira issue exists. I'm also not
convinced that we need a new one. New jira means there is an
action to be taken, now or in the future. Our goal is only to
make sure no one would disable the app engine accidently.

Putting some information instead of "hello world" is a good
idea. I will update it today. Thank you Tobiasz for the link,
it will be very helpful.


On Thu, Sep 24, 2020 at 11:32 AM Tobiasz Kędzierski
mailto:tobiasz.kedzier...@polidea.com>> wrote:

I agree with Robert, putting some information instead of
"hello world" could prevent disabling GAE and connected
problems in the future.
If a relevant Jira issue exists it would be a great idea
to put a link to it.
What do you think about adding a link to GCP documentation
mentioning this dependency [1] ?

BR Tobiasz

[1]

https://cloud.google.com/datastore/docs/reference/libraries#dependency_on_application



On Wed, Sep 23, 2020 at 5:33 PM Robert Burke
mailto:rob...@frantil.com>> wrote:

Perhaps instead of "hello world" the message could
refer to a jira about the Datastore IT tests?
I suspect if we don't we'll just have a repeat of "we
shut down app engine since it was just running a hello
world, and the Datastore tests died".

On Wed, Sep 23, 2020, 8:14 AM Kamil Wasilewski
mailto:kamil.wasilew...@polidea.com>> wrote:

An error message that Udi sent is pretty self
explanatory. Disabling Google App Engine caused
Datastore to be not accessible too. What's
interesting, it doesn't make any difference if the
application on GAE is actually using Datastore or
not. GAE must be simply turned on and that's the
only requirement.

I replaced Perfkit Explorer with a simple "hello
world" running on Python 3.8. I ran Datastore IT
tests and they passed, so I think this solves the
problem. If something goes wrong, let me know!
Sorry for inconvenience and thanks Tyson for the
rescue.

On Tue, Sep 22, 2020 at 11:25 PM Udi Meiri
mailto:eh...@google.com>> wrote:

Thanks, Tyson!

On Tue, Sep 22, 2020 at 11:11 AM Tyson
Hamilton mailto:tyso...@google.com>> wrote:

I re-enabled the AppEngine app. Today that
app has both the required datastore app
and the perfkit app baked into the
container image. What should happen, is
that the perfkit app is removed from that
image, but the datastore related stuff
remains functional.

On Tue, Sep 22, 2020 at 10:37 AM Udi Meiri
mailto:eh...@google.com>> wrote:

Is it possible to create a simple
"hello world" application instead?

On Tue, Sep 22, 2020 at 10:35 AM Udi
Meiri mailto:eh...@google.com>> wrote:

Disabling this broke our Datastore
ITs. Apparently you must have an
application for Datastore to work.
From the Datastore dashboard:
The project apache-beam-testing
does not exist or it does not
contain an active Cloud Datastore
or Cloud Firestore database.
Please visit
http://console.cloud.google.com to
create a project or

https://console.cloud.google.com/datastore/setup?project=apache-be

Re: Chronically flaky tests

2020-08-04 Thread Etienne Chauchot

Hi all,

+1 on ping the assigned person.

For the flakes I know of (ESIO and CassandraIO), they are due to the
load of the CI server. These IOs are tested using real embedded backends
because those backends are complex and we need relevant tests.

Counter measures have been taken (retrial inside the test sensible to
load, add ranges of acceptable numbers, call internal backend mechanisms
to force refresh in case load prevented the backend to do so ...).

I recently got pinged my Ahmet (thanks to him!) about a flakiness that I
did not see. This seems to me the correct way to go. Systematically
retrying tests with a CI mechanism or disabling tests seem to me a risky
workaround that just allows to get the problem off our minds.

Etienne

On 20/07/2020 20:58, Brian Hulette wrote:
> I think we are missing a way for checking that we are making
progress on P1 issues. For example, P0 issues block releases and this
obviously results in fixing/triaging/addressing P0 issues at least
every 6 weeks. We do not have a similar process for flaky tests. I do
not know what would be a good policy. One suggestion is to ping
(email/slack) assignees of issues. I recently missed a flaky issue
that was assigned to me. A ping like that would have reminded me. And
if an assignee cannot help/does not have the time, we can try to find
a new assignee.

Yeah I think this is something we should address. With the new jira
automation at least assignees should get an email notification after
30 days because of a jira comment like [1], but that's too long to let
a test continue to flake. Could Beam Jira Bot ping every N days for
P1s that aren't making progress?

That wouldn't help us with P1s that have no assignee, or are assigned
to overloaded people. It seems we'd need some kind of dashboard or
report to capture those.

[1]
https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918

On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay > wrote:

Another idea, could we change our "Retest X" phrases with "Retest
X (Reason)" phrases? With this change a PR author will have to
look at failed test logs. They could catch new flakiness
introduced by their PR, file a JIRA for a flakiness that was not
noted before, or ping an existing JIRA issue/raise its severity.
On the downside this will require PR authors to do more.

On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton mailto:tyso...@google.com>> wrote:

Adding retries can be beneficial in two ways, unblocking a PR,
and collecting metrics about the flakes.

Makes sense. I think we will still need to have a plan to remove
retries similar to re-enabling disabled tests.

If we also had a flaky test leaderboard that showed which
tests are the most flaky, then we could take action on them.
Encouraging someone from the community to fix the flaky test
is another issue.

The test status matrix of tests that is on the GitHub landing
page could show flake level to communicate to users which
modules are losing a trustable test signal. Maybe this shows
up as a flake % or a code coverage % that decreases due to
disabled flaky tests.

+1 to a dashboard that will show a "leaderboard" of flaky tests.

I didn't look for plugins, just dreaming up some options.

On Thu, Jul 16, 2020, 5:58 PM Luke Cwik mailto:lc...@google.com>> wrote:

What do other Apache projects do to address this issue?

On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay
mailto:al...@google.com>> wrote:

I agree with the comments in this thread.
- If we are not re-enabling tests back again or we do
not have a plan to re-enable them again, disabling
tests only provides us temporary relief until
eventually users find issues instead of disabled tests.
- I feel similarly about retries. It is reasonable to
add retries for reasons we understand. Adding retries
to avoid flakes is similar to disabling tests. They
might hide real issues.

I think we are missing a way for checking that we are
making progress on P1 issues. For example, P0 issues
block releases and this obviously results in
fixing/triaging/addressing P0 issues at least every 6
weeks. We do not have a similar process for flaky
tests. I do not know what would be a good policy. One
suggestion is to ping (email/slack) assignees of
issues. I recently missed a flaky issue that was
assigned to me. A ping like that would have reminded
me. And if an assignee cannot help/does not ha

Re: [ANNOUNCE] New PMC Member: Alexey Romanenko

2020-06-19 Thread Etienne Chauchot


Congrats Alexey !

Well deserved !

Etienne

On 17/06/2020 16:30, Gleb Kanterov wrote:

Congratulations! Thanks for your hard work

On Wed, Jun 17, 2020 at 1:11 PM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:


Thank you Ismaël and everybody!
Happy to be a part of Beam community!


On 17 Jun 2020, at 09:31, Jan Lukavský mailto:je...@seznam.cz>> wrote:

Congrats Alexey!

On 6/17/20 9:22 AM, Reza Rokni wrote:

Congratulations!

On Wed, Jun 17, 2020 at 2:48 PM Michał Walenia
mailto:michal.wale...@polidea.com>>
wrote:

Congratulations!

On Tue, Jun 16, 2020 at 11:45 PM Rui Wang mailto:ruw...@google.com>> wrote:

Congrats!


-Rui

On Tue, Jun 16, 2020 at 2:42 PM Ankur Goenka
mailto:goe...@google.com>> wrote:

Congratulations Alexey!

On Tue, Jun 16, 2020 at 2:41 PM Thomas Weise
mailto:t...@apache.org>> wrote:

Congratulations!


On Tue, Jun 16, 2020 at 1:27 PM Valentyn
Tymofieiev mailto:valen...@google.com>> wrote:

Congratulations!

On Tue, Jun 16, 2020 at 11:41 AM Ahmet Altay
mailto:al...@google.com>>
wrote:

Congratulations!

On Tue, Jun 16, 2020 at 10:05 AM Pablo
Estrada mailto:pabl...@google.com>> wrote:

Yooohooo! Thanks for all your
contributions and hard work Alexey!:)

On Tue, Jun 16, 2020, 8:57 AM Ismaël
Mejía mailto:ieme...@gmail.com>> wrote:

Please join me and the rest of
Beam PMC in welcoming Alexey
Romanenko as our
newest PMC member.

Alexey has significantly
contributed to the project in
different ways: new
features and improvements in the
Spark runner(s) as well as
maintenance of
multiple IO connectors including
some of our most used ones
(Kafka and
Kinesis/Aws). Alexey is also
quite active helping new
contributors and our user
community in the mailing lists /
slack and Stack overflow.

Congratulations Alexey!  And
thanks for being a part of Beam!

Ismaël



-- 
Michał Walenia

Polidea  | Software Engineer

M: +48 791 432 002 
E: michal.wale...@polidea.com


Unique Tech
Check out our projects!

Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

2020-06-17 Thread Etienne Chauchot

Hi,

I forgot about this subject and came by this thread lately so I tested
again:

- what the new spark runner does (even in local mode) and I guess all
the other runners do:

- encodes PC1 with the user specified
NullableCoder(BigEndianLongCoder) to be able to pass data over the network

- encodes PC2 with the user specified VarlongCoder to be able
to pass data over the network

- union the 2 collections

- decodes using the specified output coder
NullableCoder(VarlongCoder)

=> there, when the output coder comes by an element encoded with
VarlongCoder it fails with a compatibility exception because, as Robert
said, coders are not compatible and elements encoded with them are not.

As a consequence I have some remarks:

=> The current spark runner (RDD) adds no control on the coders (at
least for flatten), so the test passes only because it is local so there
is no serialization happening. But such a pipeline with heterogeneous
coders will fail on a real cluster just like the above. For flink, an
exception will be raised saying that input coders are different so it
relies on the user to fix his pipeline.

=> I agree with Robert that such a problem should be dealt with at the
SDK/user side. The runner should not change the user specified output
coder to one compatible with the input coders or change the user
specified input coders to one compatible with the output coder. I think
this problem is not only for flatten but for all transforms as a given
pcollection could be set a coder that is not compatible with the coder
of the previous step of the pipeline.

=> As a consequence, I propose to let the CoderException be raised and
expect the exception in the test. But for local spark RDD runner used in
test, that will not try to serialize, the exception will not be throw.
For local Direct runner it will neither because it does not serialize
either and it seems enforceEncodability parameter does not lead to an
exception in that case.

WDYT?

Etienne

On 20/12/2019 19:40, Robert Bradshaw wrote:

The problem here is that T and Nullable are two different types,
but are not distinguished as such in the Java type system (and hence
are treated interchangeably there), modulo certain cases where one can
use a @Nullable annotation). They also have incompatible encodings.

In my view, it is the SDKs job to ensure the output type (and
corresponding coder) of a Flatten is a least upper bound of those of
its inputs. It could do this by being restrictive on the Coders of its
inputs (e.g. requiring them to be equal), being intelligent (actually
able to compute least upper bounds), or placing the burden on the user
(e.g. requiring the user to specify it, possibly only when there is
ambiguity about what coder to choose).

On the other hand, I don't think that in the portability protos we
should require that all coders to a flatten be equal, only that the
output coder be sufficiently powerful to encode the union of all
possible elements (equivalently, it is a valid graph transformation to
swap out all input coder for the given output coder). Runners should
and must do recodings as necessary depending on their flatten
implementations (which often will be as simple requesting the a
consistent coding on the output ports, but may involve injecting
identity operations to do (possibly virtual) recoding in some cases,
e.g.

pc1[int]
\
Flatten -- pc12[int]
/
pc2[int]
\
Flatten -- pc23[Optional[int]]
/
pc3[Optional[int]]

On Thu, Dec 19, 2019 at 3:09 PM Luke Cwik wrote:

I'm pretty sure that Flatten with different coders is well defined.
input: List>
output: PCollection

When flatten is executed using T vs encoded(T), transcoding can be optimized because the coder
for the output PCollection is assumed to be able to encode all T's. The DirectRunner
specifically does this transcoding check on elements to help pipeline authors catch this kind
of error. Alternatively an SDK could require a method like "bool canEncode(T)" on
coders which could be very cheap to ensure that values could be transcoded (this would work for
many but not all value types). When the execution is occurring on encoded(T), then the bytes
need to be transcoded somehow since the downstream transform is expected to get an encoding
compatible with output PCollections encoding.

For the example that flattens Nullable and Long would be valid since the
output PCollection accepts all the supported input types.

I believe all runners need to transcode if they are operating on encoded(T)
when the input PCollection coder is not the same as the output PCollection
coder. If they are operating on T's, then it's optional since its a choice
between performance and debuggability.

On Wed, Dec 11, 2019 at 3:47 AM Etienne Chauchot wrote:

Ok,

Thanks Kenn.

Le Flatten javadoc says that by default the coder of the output should

Re: Add options to CassandraIO

2020-05-14 Thread Etienne Chauchot


Hi Nathan,

Thanks for raising this, and thanks for the PR proposal.

I would recommend (as it was done in other IOs such as ElasticsearchIO) 
the third solution: you could add a method called 
withConnectTimeout(Integer) to both the Read and Write builders of the 
IO (there is no common conf object on this IO). Indeed, there is sort of 
a "no knob" philosophy of Beam to reduce to the minimum the conf 
parameters available to the users; hence the encapsulation.


Best

Etienne

On 14/05/2020 00:34, Nathan Fisher wrote:

Hi all,

I frequently test pipelines over a VPN link. As a result the default 
SocketOptions configuration results in timeout exceptions. I would 
like the ability to tweak the timeouts which requires the ability to 
get at Cassandras Cluster.Builder and set a custom socket option:


If I were to raise an issue/PR what would be preference of these options?

- expose only SocketOptions as a setter on the builder.
- allow setting the Cassandra Cluster.Builder on the IO builder.
- encapsulate the socketoptions behind additional methods on the beam 
IO builder.


Regards,
Nathan
--
Nathan Fisher
 w: http://junctionbox.ca/

Re: A new reworked Elasticsearch 7+ IO module

2020-04-09 Thread Etienne Chauchot

Hi Kenn,

The user does not specify the backendVersion targeted (at least on the
current version of the IO) it is transparent to him: the IO detects the
version with a REST call and adapts its behavior. But, anyway, I agree,
we need to put at least a WARN if detected version is 2. As the new IO
will not be compatible with ESV2 (because ES classes differ too much to
have a common production basis), the only option on the new IO is to
reject completely if version is 2 IMHO.

Best

Etienne

On 06/03/2020 18:49, Kenneth Knowles wrote:
Since the user provides backendVersion, here are some possible levels
of things to add in expand() based on that (these are extra niceties
beyond the agreed number of releases to remove)

- WARN for backendVersion < n
- reject for backendVersion < n with opt-in pipeline option to keep
it working one more version (gets their attention and indicates urgency)

- reject completely

Kenn

On Fri, Mar 6, 2020 at 2:26 AM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:

Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now
(multiple versions possible of course). The responses are the
following:

ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is
still not very representative.

I'm cross-posting to @users to let you know that I'm closing the
survey within 1 or 2 weeks. So please respond if you're using ESIO.

Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES
uses per version:

https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:

On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan
to remove it from the IO. In the past we
already retired versions (like spark 1.6 for
instance).

My only concern here is that there might be users who
use the existing module who might not be able to
easily upgrade the Beam version if we remove it. But
given that V2 is 5 versions behind the latest release
this might be OK.

It seems we have a consensus on this.
I think there should be another general discussion on the
long term support of our prefered tool IO modules.

=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old
version 0.9. We raised this question on user@ and it
appears that there are users who for some reasons still use
old Kafka versions. So, before dropping a support of any ES
versions, I’d suggest to ask it user@ and see if any people
will be affected by this.

Yes we can do a survey among users but the question is,
should we support an ES version that is no more supported by
Elastic themselves ?

+1 for asking in the user list. I guess this is more about
whether users need this specific version that we hope to drop
support for. Whether we need to support unsupported versions is
a more generic question that should prob. be addressed in the
dev list. (and I personally don't think we should unless there's
a large enough user base for a given version).

2. regarding the user: the aim is to unlock
some new features (listed by Ludovic) and give
the user more flexibility on his request. For
that, it requires to use high level java ES
client in place of the low level REST client
(that was used because it is the only one
compatible with all ES versions). We plan to
replace the API (json document in and out) by
more complete standard ES objects that contain
de request logic (insert/update, doc routing
etc...) and the data. There are already IOs
like SpannerIO that use similar objects in
input PCollection rather than pure POJOs.

Won't this be a breaking change for all users ?

Re: A new reworked Elasticsearch 7+ IO module

2020-03-31 Thread Etienne Chauchot

Hi all,

The survey regarding Elasticsearch support in Beam is now closed.

Here are the results after 38 days:

users using

ESv2: 0

ESV5: 1

ESV6: 5

ESV7: 8

So, the new version of ElasticsearchIO after the refactoring discussed
in this thread will no more support Elasticsearch v2.

Regards

Etienne Chauchot.

On 06/03/2020 11:26, Etienne Chauchot wrote:

Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now
(multiple versions possible of course). The responses are the following:

ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is still
not very representative.

I'm cross-posting to @users to let you know that I'm closing the
survey within 1 or 2 weeks. So please respond if you're using ESIO.

Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses
per version:

https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:

On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already
retired versions (like spark 1.6 for instance).

My only concern here is that there might be users who use
the existing module who might not be able to easily
upgrade the Beam version if we remove it. But given that
V2 is 5 versions behind the latest release this might be OK.

It seems we have a consensus on this.
I think there should be another general discussion on the
long term support of our prefered tool IO modules.

=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old
version 0.9. We raised this question on user@ and it appears
that there are users who for some reasons still use old Kafka
versions. So, before dropping a support of any ES versions, I’d
suggest to ask it user@ and see if any people will be affected
by this.

Yes we can do a survey among users but the question is, should
we support an ES version that is no more supported by Elastic
themselves ?

+1 for asking in the user list. I guess this is more about whether
users need this specific version that we hope to drop support for.
Whether we need to support unsupported versions is a more generic
question that should prob. be addressed in the dev list. (and I
personally don't think we should unless there's a large enough user
base for a given version).

2. regarding the user: the aim is to unlock some
new features (listed by Ludovic) and give the user
more flexibility on his request. For that, it
requires to use high level java ES client in place
of the low level REST client (that was used because
it is the only one compatible with all ES
versions). We plan to replace the API (json
document in and out) by more complete standard ES
objects that contain de request logic
(insert/update, doc routing etc...) and the data.
There are already IOs like SpannerIO that use
similar objects in input PCollection rather than
pure POJOs.

Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry
about changes to the underlying client library API.
Exception would be when underlying client library offers
a backwards compatibility guarantee that we can rely on
for the foreseeable future (for example, BQ TableRow).

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.

=> indeed it will be a breaking change, hence this email to
get a consensus on that. Also I think our wrappers of ES
request objects will offer a backward compatible as the
underlying objects

I just want to remind that according to what we agreed some
time ago on dev@ (at least, for IOs), all breaking user API
changes have to be added along with deprecation of old API that
could be removed after 3 consecutive Beam releases. In this
case, users will have a

Re: A new reworked Elasticsearch 7+ IO module

2020-03-06 Thread Etienne Chauchot

Hi all,

it's been 3 weeks since the survey on ES versions the users use.

The survey received very few responses: only 9 responses for now
(multiple versions possible of course). The responses are the following:

ES2: 0 clients, ES5: 1, ES6: 5, ES7: 8

It tends to go toward a drop of ES2 support but for now it is still not
very representative.

I'm cross-posting to @users to let you know that I'm closing the survey
within 1 or 2 weeks. So please respond if you're using ESIO.

Best

Etienne

On 13/02/2020 12:37, Etienne Chauchot wrote:

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses
per version:

https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:

On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot mailto:echauc...@apache.org>> wrote:

1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already
retired versions (like spark 1.6 for instance).

My only concern here is that there might be users who use
the existing module who might not be able to easily
upgrade the Beam version if we remove it. But given that
V2 is 5 versions behind the latest release this might be OK.

It seems we have a consensus on this.
I think there should be another general discussion on the long
term support of our prefered tool IO modules.

=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to
support different versions of Kafka, especially very old version
0.9. We raised this question on user@ and it appears that there
are users who for some reasons still use old Kafka versions. So,
before dropping a support of any ES versions, I’d suggest to ask
it user@ and see if any people will be affected by this.

Yes we can do a survey among users but the question is, should we
support an ES version that is no more supported by Elastic
themselves ?

+1 for asking in the user list. I guess this is more about whether
users need this specific version that we hope to drop support for.
Whether we need to support unsupported versions is a more generic
question that should prob. be addressed in the dev list. (and I
personally don't think we should unless there's a large enough user
base for a given version).

2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to
use high level java ES client in place of the low
level REST client (that was used because it is the
only one compatible with all ES versions). We plan
to replace the API (json document in and out) by
more complete standard ES objects that contain de
request logic (insert/update, doc routing etc...)
and the data. There are already IOs like SpannerIO
that use similar objects in input PCollection rather
than pure POJOs.

Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry
about changes to the underlying client library API.
Exception would be when underlying client library offers a
backwards compatibility guarantee that we can rely on for
the foreseeable future (for example, BQ TableRow).

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.

=> indeed it will be a breaking change, hence this email to get
a consensus on that. Also I think our wrappers of ES request
objects will offer a backward compatible as the underlying objects

I just want to remind that according to what we agreed some time
ago on dev@ (at least, for IOs), all breaking user API changes
have to be added along with deprecation of old API that could be
removed after 3 consecutive Beam releases. In this case, users
will have a time to move to new API smoothly.

We are more discussing the target architecture of the new module
here but the process of deprecation is important to recall, I
agree. When I say DTOs backward compatible above I mean between
per-version sub-modules inside the new module. Anyway, sure, for
some time, both modules (the old REST-based that supports v2-7

Re: Beam Emitted Metrics Reference

2020-03-02 Thread Etienne Chauchot


Hi,

There is a doc about metrics here: 
https://beam.apache.org/documentation/programming-guide/#metrics


You can also export the metrics to sinks (REST http endpoint and 
Graphite), see MetricsOptions class for configuration.


Still, there is no doc for export on website, I'll add some

Best

Etienne

On 28/02/2020 01:07, Pablo Estrada wrote:

Hi Daniel!
I think +Alex Amato  had tried to have an 
inventory of metrics at some point.

Other than that, I don't think we have a document outlining them.

Can you talk about what you plan to do with them? Do you plan to 
export them somehow? Do you plan to add your own?

Best
-P.

On Thu, Feb 27, 2020 at 11:33 AM Daniel Chen > wrote:


Hi all,

I some questions about the reference to the framework metrics
emitted by Beam. I would like to leverage these metrics to allow
better monitoring of by Beam jobs but cannot find any references
to the description or a complete set of emitted metrics.

Do we have this information documented anywhere?

Thanks,
Daniel

Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-03-02 Thread Etienne Chauchot


Hi,

+1 to what Kenn asked: your pipeline is in streaming mode and GIB 
preserves windowing, the elements are buffered until one of these 
conditions are true: batchsize reached or end of window. I your case I 
think it is the second one.


Best

Etienne

On 28/02/2020 19:15, Kenneth Knowles wrote:

What are the timestamps on the elements?

On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta > wrote:


Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
Issue Details:
Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1,
e-4, e-5
Batch Size: 5
Expected output: a-1,4, b-3, c-5, d-1, e-4,5
Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4,
c-5 etc
But i always got correct number of packets with BATCH_SIZE = 1

On 2020/02/27 20:40:16, Kenneth Knowles mailto:k...@apache.org>> wrote:
> Can you share some more details? What is the expected output and
what
> output are you seeing?
>
> On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta
mailto:dev.vasugu...@gmail.com>> wrote:
>
> > Hey folks, I am using Apache beam Framework in Java with
Direction Runner
> > for local testing purposes. When using GroupIntoBatches with
batch size 1
> > it works perfectly fine i.e. the output of the transform is
consistent and
> > as expected. But when using with batch size > 1 the output
Pcollection has
> > less data than it should be.
> >
> > Pipeline flow:
> > 1. A Transform for reading from pubsub
> > 2. Transform for making a KV out of the data
> > 3. A Fixed Window transform of 1 second
> > 4. Applying GroupIntoBatches transform
> > 5. And last, Logging the resulting Iterables.
> >
> > Weird thing is that it batch_size > 1 works great when running on
> > DataflowRunner but not with DirectRunner. I think the issue
might be with
> > Timer Expiry since GroupIntoBatches uses BagState internally.
> >
> > Any help will be much appreciated.
> >
>

Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-20 Thread Etienne Chauchot


Congrats Alex !

Well deserved !

Etienne

On 20/02/2020 12:23, Michał Walenia wrote:

Congratulations!

On Thu, Feb 20, 2020 at 2:31 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Congrats Alex!

On Wed, Feb 19, 2020 at 7:21 AM Ryan Skraba mailto:r...@skraba.com>> wrote:

Congratulations Alex!

On Wed, Feb 19, 2020 at 9:52 AM Katarzyna Kucharczyk
mailto:ka.kucharc...@gmail.com>> wrote:

Great news! Congratulations, Alex! 🎉

On Wed, Feb 19, 2020 at 9:14 AM Reza Rokni mailto:r...@google.com>> wrote:

Fantastic news! Congratulations :-)

On Wed, 19 Feb 2020 at 07:54, jincheng sun
mailto:sunjincheng...@gmail.com>> wrote:

Congratulations！
Best，
Jincheng


Robin Qiu mailto:robi...@google.com>>于2020年2月19日
周三05:52写道：

Congratulations, Alex!

On Tue, Feb 18, 2020 at 1:48 PM Valentyn
Tymofieiev mailto:valen...@google.com>> wrote:

Congratulations!

On Tue, Feb 18, 2020 at 10:38 AM Alex Van
Boxel mailto:a...@vanboxel.be>> wrote:

Thank you everyone!

 _/
_/ Alex Van Boxel


On Tue, Feb 18, 2020 at 7:05 PM
mailto:je...@seznam.cz>> wrote:

Congrats Alex!
Jan


Dne 18. 2. 2020 18:46 napsal
uživatel Thomas Weise
mailto:t...@apache.org>>:

Congratulations!


On Tue, Feb 18, 2020 at 8:33
AM Ismaël Mejía
mailto:ieme...@gmail.com>> wrote:

Congrats Alex! Well done!

On Tue, Feb 18, 2020 at
5:10 PM Gleb Kanterov
mailto:g...@spotify.com>>
wrote:

Congratulations!

On Tue, Feb 18, 2020
at 5:02 PM Brian
Hulette
mailto:bhule...@google.com>>
wrote:

Congratulations
Alex! Well deserved!

On Tue, Feb 18,
2020 at 7:49 AM
Pablo Estrada
mailto:pabl...@google.com>>
wrote:

Hi everyone,

Please join me
and the rest
of the Beam
PMC in
welcoming
a new committer:
Alex Van Boxel

Alex has
contributed to
Beam in many
ways - as an
organizer for
Beam Summit,
and meetups -
and also with
the Protobuf
extensions for
schemas.

In
consideration
of his
contributions,
the Beam PMC

Re: big data blog

2020-02-13 Thread Etienne Chauchot


Hi all,

I just sent the link to the blog articles on @ApacheBeam twitter as Kenn 
suggested.


Etienne

On 10/02/2020 10:01, Etienne Chauchot wrote:


Yes sure,

Here is the link to the spreadsheet for review of the tweet: 
https://docs.google.com/spreadsheets/d/1mz36njTtn1UJwDF50GdqyZVbX_F0n_A6eMYcxsktpSM/edit#gid=1413052381


thanks all for your encouragement !

Best

Etienne

On 08/02/2020 08:09, Kenneth Knowles wrote:
Nice! Yes, I think we should promote Beam articles that are 
insightful from a longtime contributor.


Etienne - can you add twitter announcements/retweets to the social 
media spreadsheet when you write new articles?


Kenn

On Fri, Feb 7, 2020 at 5:44 PM Ahmet Altay <mailto:al...@google.com>> wrote:


Cool, thank you. Would it make sense to promote Beam related
posts on our twitter channel?

On Fri, Feb 7, 2020 at 2:47 PM Pablo Estrada mailto:pabl...@google.com>> wrote:

Very nice. Thanks for sharing Etienne!

On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax mailto:re...@google.com>> wrote:

Cool!

On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi all,

FYI, I just started a blog around big data
technologies and for now it
is focused on Beam.

https://echauchot.blogspot.com/

Feel free to comment, suggest or anything.

Etienne

Re: A new reworked Elasticsearch 7+ IO module

2020-02-13 Thread Etienne Chauchot

Hi Cham, thanks for your comments !

I just sent an email to user ML with a survey link to count ES uses per
version:

https://lists.apache.org/thread.html/rc8185afb8af86a2a032909c13f569e18bd89e75a5839894d5b5d4082%40%3Cuser.beam.apache.org%3E

Best

Etienne

On 10/02/2020 19:46, Chamikara Jayalath wrote:

On Thu, Feb 6, 2020 at 8:13 AM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:

Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot mailto:echauc...@apache.org>> wrote:

1. regarding version support: ES v2 is no more
maintained by Elastic since 2018/02 so we plan to
remove it from the IO. In the past we already retired
versions (like spark 1.6 for instance).

My only concern here is that there might be users who use
the existing module who might not be able to easily upgrade
the Beam version if we remove it. But given that V2 is 5
versions behind the latest release this might be OK.

It seems we have a consensus on this.
I think there should be another general discussion on the long
term support of our prefered tool IO modules.

=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to support
different versions of Kafka, especially very old version 0.9. We
raised this question on user@ and it appears that there are users
who for some reasons still use old Kafka versions. So, before
dropping a support of any ES versions, I’d suggest to ask it
user@ and see if any people will be affected by this.

Yes we can do a survey among users but the question is, should we
support an ES version that is no more supported by Elastic
themselves ?

+1 for asking in the user list. I guess this is more about whether
users need this specific version that we hope to drop support for.
Whether we need to support unsupported versions is a more generic
question that should prob. be addressed in the dev list. (and I
personally don't think we should unless there's a large enough user
base for a given version).

2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to
use high level java ES client in place of the low
level REST client (that was used because it is the
only one compatible with all ES versions). We plan to
replace the API (json document in and out) by more
complete standard ES objects that contain de request
logic (insert/update, doc routing etc...) and the
data. There are already IOs like SpannerIO that use
similar objects in input PCollection rather than pure
POJOs.

Won't this be a breaking change for all users ? IMO using
POJOs in PCollections is safer since we have to worry about
changes to the underlying client library API. Exception
would be when underlying client library offers a backwards
compatibility guarantee that we can rely on for the
foreseeable future (for example, BQ TableRow).

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point
explains this.

=> indeed it will be a breaking change, hence this email to get
a consensus on that. Also I think our wrappers of ES request
objects will offer a backward compatible as the underlying objects

I just want to remind that according to what we agreed some time
ago on dev@ (at least, for IOs), all breaking user API changes
have to be added along with deprecation of old API that could be
removed after 3 consecutive Beam releases. In this case, users
will have a time to move to new API smoothly.

+1 for supporting both versions for at least three minor versions to
give users time to migrate. Also, we should try to produce a warning
for users who use the deprecated versions.

Thanks,
Cham

Best

Etienne

Re: big data blog

2020-02-10 Thread Etienne Chauchot


Yes sure,

Here is the link to the spreadsheet for review of the tweet: 
https://docs.google.com/spreadsheets/d/1mz36njTtn1UJwDF50GdqyZVbX_F0n_A6eMYcxsktpSM/edit#gid=1413052381


thanks all for your encouragement !

Best

Etienne

On 08/02/2020 08:09, Kenneth Knowles wrote:
Nice! Yes, I think we should promote Beam articles that are insightful 
from a longtime contributor.


Etienne - can you add twitter announcements/retweets to the social 
media spreadsheet when you write new articles?


Kenn

On Fri, Feb 7, 2020 at 5:44 PM Ahmet Altay <mailto:al...@google.com>> wrote:


Cool, thank you. Would it make sense to promote Beam related posts
on our twitter channel?

On Fri, Feb 7, 2020 at 2:47 PM Pablo Estrada mailto:pabl...@google.com>> wrote:

Very nice. Thanks for sharing Etienne!

On Fri, Feb 7, 2020 at 2:19 PM Reuven Lax mailto:re...@google.com>> wrote:

Cool!

On Fri, Feb 7, 2020 at 7:24 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Hi all,

FYI, I just started a blog around big data
technologies and for now it
is focused on Beam.

https://echauchot.blogspot.com/

Feel free to comment, suggest or anything.

Etienne

big data blog

2020-02-07 Thread Etienne Chauchot


Hi all,

FYI, I just started a blog around big data technologies and for now it 
is focused on Beam.


https://echauchot.blogspot.com/

Feel free to comment, suggest or anything.

Etienne

Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Etienne Chauchot


Hi,

please see my comments inline

On 06/02/2020 16:24, Alexey Romanenko wrote:

Please, see my comments inline.

On 6 Feb 2020, at 10:50, Etienne Chauchot <mailto:echauc...@apache.org>> wrote:



1. regarding version support: ES v2 is no more maintained
by Elastic since 2018/02 so we plan to remove it from the
IO. In the past we already retired versions (like spark
1.6 for instance).



My only concern here is that there might be users who use the
existing module who might not be able to easily upgrade the Beam
version if we remove it. But given that V2 is 5 versions behind
the latest release this might be OK.


It seems we have a consensus on this.
I think there should be another general discussion on the long term 
support of our prefered tool IO modules.


=> yes, consensus, let's drop ESV2

We had (and still have) a similar problem with KafkaIO to support 
different versions of Kafka, especially very old version 0.9. We 
raised this question on user@ and it appears that there are users who 
for some reasons still use old Kafka versions. So, before dropping a 
support of any ES versions, I’d suggest to ask it user@ and see if any 
people will be affected by this.
Yes we can do a survey among users but the question is, should we 
support an ES version that is no more supported by Elastic themselves ?



2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to use
high level java ES client in place of the low level REST
client (that was used because it is the only one
compatible with all ES versions). We plan to replace the
API (json document in and out) by more complete standard
ES objects that contain de request logic (insert/update,
doc routing etc...) and the data. There are already IOs
like SpannerIO that use similar objects in input
PCollection rather than pure POJOs.



Won't this be a breaking change for all users ? IMO using POJOs
in PCollections is safer since we have to worry about changes to
the underlying client library API. Exception would be when
underlying client library offers a backwards
compatibility guarantee that we can rely on for the
foreseeable future (for example, BQ TableRow).


Agreed but actually, there will be POJOs in order to abstract 
Elasticsearch's version support. The following third point explains 
this.


=> indeed it will be a breaking change, hence this email to get a 
consensus on that. Also I think our wrappers of ES request objects 
will offer a backward compatible as the underlying objects


I just want to remind that according to what we agreed some time ago 
on dev@ (at least, for IOs), all breaking user API changes have to be 
added along with deprecation of old API that could be removed after 3 
consecutive Beam releases. In this case, users will have a time to 
move to new API smoothly.


We are more discussing the target architecture of the new module here 
but the process of deprecation is important to recall, I agree. When I 
say DTOs backward compatible above I mean between per-version 
sub-modules inside the new module. Anyway, sure, for some time, both 
modules (the old REST-based that supports v2-7 and the new that supports 
v5-7) will cohabit and the old one will receive the deprecation 
annotations.


Best

Etienne

Re: A new reworked Elasticsearch 7+ IO module

2020-02-06 Thread Etienne Chauchot

Hi,

Thanks all for your comments, my comments are inline

On 06/02/2020 09:47, Ludovic Boutros wrote:

Hi all,

First, thank you all for your answers and especially, Etienne for your
time, advises and kindness :)

@Jean-Baptiste, any help on this module is welcome of course.

@Chamikara Jayalath, my aswers are inline.

Have a good day !

Ludovic

Le mer. 5 févr. 2020 à 20:15, Chamikara Jayalath <mailto:chamik...@google.com>> a écrit :

On Wed, Feb 5, 2020 at 6:35 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Still there is something I don't agree with is that IOs can be
tested on mock. We don't really test IO behavior with mocks:
there is always special behaviors that cannot be reproduced in
mocks (split, load, with corner cases etc...). There was in
the past IOs that were tested using mocks and that happened to
be nonfunctional.

Regarding ITests we have very few comparing to UTests and they
are not as closely observed as UTests.

Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module.
However, it’s tricky when we have such difference, including
on the user facing API. So, I would go with module per
version, or use a specified version for a target Beam release.

For the test, we should distinguish utest from itest. Utest
can be done with mock, the purpose is really to test the IO
behavior. Then we can have itest using concrete ES instance.

Anyway, I’m OK with the proposal and I would like to work on
this IO (I have other improvements coming on other IOs
anyway) with you guys (and Ludovic especially).

Regards
JB

Le 5 févr. 2020 à 10:44, Etienne Chauchot
mailto:echauc...@apache.org>> a écrit :

Hi all,

We had a long discussion with Ludovic about this IO. I'd
like to share with you to keep you informed and also gather
your opinions

1. regarding version support: ES v2 is no more maintained by
Elastic since 2018/02 so we plan to remove it from the IO.
In the past we already retired versions (like spark 1.6 for
instance).

My only concern here is that there might be users who use the
existing module who might not be able to easily upgrade the Beam
version if we remove it. But given that V2 is 5 versions behind
the latest release this might be OK.

It seems we have a consensus on this.
I think there should be another general discussion on the long term
support of our prefered tool IO modules.

=> yes, consensus, let's drop ESV2

2. regarding the user: the aim is to unlock some new
features (listed by Ludovic) and give the user more
flexibility on his request. For that, it requires to use
high level java ES client in place of the low level REST
client (that was used because it is the only one compatible
with all ES versions). We plan to replace the API (json
document in and out) by more complete standard ES objects
that contain de request logic (insert/update, doc routing
etc...) and the data. There are already IOs like SpannerIO
that use similar objects in input PCollection rather than
pure POJOs.

Won't this be a breaking change for all users ? IMO using POJOs in
PCollections is safer since we have to worry about changes to the
underlying client library API. Exception would be when underlying
client library offers a backwards compatibility guarantee that we
can rely on for the foreseeable future (for example, BQ TableRow).

Agreed but actually, there will be POJOs in order to abstract
Elasticsearch's version support. The following third point explains this.

=> indeed it will be a breaking change, hence this email to get a
consensus on that. Also I think our wrappers of ES request objects will
offer a backward compatible as the underlying objects

3. regarding multiple/single module: the aim is to have only
one production code to ease the maintenance. The problem is
that using high level client makes the code dependent to an
ES lib version. We would like to make it invisible to the
user. He should select only one jar and the IO should decide
the lib to use behind the scene. We are thinking about using
one module and sub-modules per version and use relocation,
wrappers and a factory that detects the version the IO
actually points to to instantiate the correct client
version. It would also require to have DTOs in the IO
because the high level ES java objects are not exactly the
same among the ES versions

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot

Still there is something I don't agree with is that IOs can be tested on 
mock. We don't really test IO behavior with mocks: there is always 
special behaviors that cannot be reproduced in mocks (split, load, with 
corner cases etc...). There was in the past IOs that were tested using 
mocks and that happened to be nonfunctional.


Regarding ITests we have very few comparing to UTests and they are not 
as closely observed as UTests.


Etienne

On 05/02/2020 11:32, Jean-Baptiste Onofre wrote:

Hi,

We talked in the past about multiple/single module.

IMHO the always preferred goal is to have a single module. However, 
it’s tricky when we have such difference, including on the user facing 
API. So, I would go with module per version, or use a specified 
version for a target Beam release.


For the test, we should distinguish utest from itest. Utest can be 
done with mock, the purpose is really to test the IO behavior. Then we 
can have itest using concrete ES instance.


Anyway, I’m OK with the proposal and I would like to work on this IO 
(I have other improvements coming on other IOs anyway) with you guys 
(and Ludovic especially).


Regards
JB

Le 5 févr. 2020 à 10:44, Etienne Chauchot <mailto:echauc...@apache.org>> a écrit :


Hi all,

We had a long discussion with Ludovic about this IO. I'd like to 
share with you to keep you informed and also gather your opinions


1. regarding version support: ES v2 is no more maintained by Elastic 
since 2018/02 so we plan to remove it from the IO. In the past we 
already retired versions (like spark 1.6 for instance).


2. regarding the user: the aim is to unlock some new features (listed 
by Ludovic) and give the user more flexibility on his request. For 
that, it requires to use high level java ES client in place of the 
low level REST client (that was used because it is the only one 
compatible with all ES versions). We plan to replace the API (json 
document in and out) by more complete standard ES objects that 
contain de request logic (insert/update, doc routing etc...) and the 
data. There are already IOs like SpannerIO that use similar objects 
in input PCollection rather than pure POJOs.


3. regarding multiple/single module: the aim is to have only one 
production code to ease the maintenance.  The problem is that using 
high level client makes the code dependent to an ES lib version. We 
would like to make it invisible to the user. He should select only 
one jar and the IO should decide the lib to use behind the scene. We 
are thinking about using one module and sub-modules per version and 
use relocation, wrappers and a factory that detects the version the 
IO actually points to to instantiate the correct client version. It 
would also require to have DTOs in the IO because the high level ES 
java objects are not exactly the same among the ES versions.


4. regarding tests: the aim is always to target real ES backends to 
have relevant tests (for reasons I already explained in another 
thread). The problem is that es-test-framework used today is version 
dependent and is a pain to use. We plan on using test containers per 
version (validated by ES dev advocate) and launching them as part of 
the UTests. Obviously we will launch only one container at the time 
per version and do all the test with it to avoid paying the cost of 
launch too much. And the tests will be shipped in per-version 
sub-modules and not in test dedicated modules like it is now.


WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less 
painful) issue for KafkaIO and it would be useful to have a general 
strategy for such cases about how to deal with that.


On 24 Jan 2020, at 21:54, Kenneth Knowles <mailto:k...@apache.org>> wrote:


Would it make sense to have different version-specialized 
connectors with a common core library and common API package?


On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath 
mailto:chamik...@google.com>> wrote:


Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
    compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be
released in Beam 2.19.

Would you think you can leverage this in your work on
adding new Elasticsearch7 features? IMHO, supporting two
different related IOs can be quite tough task and

Re: A new reworked Elasticsearch 7+ IO module

2020-02-05 Thread Etienne Chauchot

Hi all,

We had a long discussion with Ludovic about this IO. I'd like to share
with you to keep you informed and also gather your opinions

1. regarding version support: ES v2 is no more maintained by Elastic
since 2018/02 so we plan to remove it from the IO. In the past we
already retired versions (like spark 1.6 for instance).

2. regarding the user: the aim is to unlock some new features (listed by
Ludovic) and give the user more flexibility on his request. For that, it
requires to use high level java ES client in place of the low level REST
client (that was used because it is the only one compatible with all ES
versions). We plan to replace the API (json document in and out) by more
complete standard ES objects that contain de request logic
(insert/update, doc routing etc...) and the data. There are already IOs
like SpannerIO that use similar objects in input PCollection rather than
pure POJOs.

3. regarding multiple/single module: the aim is to have only one
production code to ease the maintenance. The problem is that using high
level client makes the code dependent to an ES lib version. We would
like to make it invisible to the user. He should select only one jar and
the IO should decide the lib to use behind the scene. We are thinking
about using one module and sub-modules per version and use relocation,
wrappers and a factory that detects the version the IO actually points
to to instantiate the correct client version. It would also require to
have DTOs in the IO because the high level ES java objects are not
exactly the same among the ES versions.

4. regarding tests: the aim is always to target real ES backends to have
relevant tests (for reasons I already explained in another thread). The
problem is that es-test-framework used today is version dependent and is
a pain to use. We plan on using test containers per version (validated
by ES dev advocate) and launching them as part of the UTests. Obviously
we will launch only one container at the time per version and do all the
test with it to avoid paying the cost of launch too much. And the tests
will be shipped in per-version sub-modules and not in test dedicated
modules like it is now.

WDYT ?

Best !

Etienne

On 30/01/2020 17:55, Alexey Romanenko wrote:
I’m second for this question. We have a similar (maybe a bit less
painful) issue for KafkaIO and it would be useful to have a general
strategy for such cases about how to deal with that.

On 24 Jan 2020, at 21:54, Kenneth Knowles <mailto:k...@apache.org>> wrote:

Would it make sense to have different version-specialized connectors
with a common core library and common API package?

On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Thanks for the contribution. I agree with Alexey that we should
try to add any new features brought in with the new PR into
existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we already have some support of Elasticsearch7
in current ElasticsearchIO (afaik, at least they are
compatible), thanks to Zhong Chen and Etienne Chauchot, who
were working on adding this [1][2] and it should be released
in Beam 2.19.

Would you think you can leverage this in your work on adding
new Elasticsearch7 features? IMHO, supporting two different
related IOs can be quite tough task and I‘d rather raise my
hand to add a new functionality into existing IO than
creating a new one, if it’s possible.

[1] https://issues.apache.org/jira/browse/BEAM-5192
[2] https://github.com/apache/beam/pull/10433

On 22 Jan 2020, at 19:23, Ludovic Boutros
mailto:boutr...@gmail.com>> wrote:

Dear all,

I have written a completely reworked Elasticsearch 7+ IO module.
It can be found here:

https://github.com/ludovic-boutros/beam/tree/fresh-reworked-elasticsearch-io-v7/sdks/java/io/elasticsearch7

This is a quite advance WIP work but I'm a quite new user of
Apache Beam and I would like to get some help on this :)

I can create a JIRA issue now but I prefer to wait for your
wise avises first.

_Why a new module ?_

The current module was compliant with Elasticsearch 2.x, 5.x
and 6.x. This seems to be a good point but so many things
have been changed since Elasticsearch 2.x.

Probably this is not correct anymore due to
https://github.com/apache/beam/pull/10433 ?

Elasticsearch 7.x is now partially supported (document type
are removed, occ, updates...).

A fresh new

Re: A new reworked Elasticsearch 7+ IO module

2020-01-30 Thread Etienne Chauchot


Hi Ludovic,

First of all thanks for your work.

Then, please be aware that the current ES IO on master supports ES7 
already and will be part of Beam 2.19.


I understand that your approach enables many new features which is great !

For the record, the current ES module was designed to have only one 
production code (only the test modules are different among the versions 
because they use and embedded ES):


- there is some if(version) but not that much.

- we use low level REST client because it is the only one that is 
compatible with all the versions of ES.


My only concern is to reduce the maintenance burden. Having 2 modules 
(v2,v5, v6, v7) and (v7) looks like difficult to maintain. It can be 
done for a reduced period of time by only doing bug fixes on the first 
module like you suggest, but pretty quickly we would need to have only 
one module. You mentioned that you tried supporting these new features 
in the actual module (which would be not maintainable due to using 
different ES classes) but have you tried supporting V2, v5, v6 and v7 of 
ES in your new module with the high level client to enable the new 
features and still get all the versions support with one production 
code? Would it be feasible with no spaghetti plate ? Because at some 
point, like you mention, we will end up retiring the old module (at a 
major version) and thus we would need to still support older versions of ES.


Regarding the question on MIT license, it is a category A license that 
can be included in ASF projects.


Best,

Etienne


On 25/01/2020 14:23, Ludovic Boutros wrote:

Hi all,

First, thank you for your great answers.
I thank Zhong Chen and Etienne Chauchot for their great job on this too !

Alexey and Chamikara, I understand your point of view.
Actually, I have the same as much as possible.

But in this case, my goal was to be able to do all the following 
things in a Beam pipeline with Elasticsearch:


- be compliant with Elasticsearch 7.x (as the current one now) ;
- be able to retrieve errors in order to some processing on them ;
- be able to (atomic) update (with scripts) or delete documents ;
- be able to manage Optimistic Concurrency Control ;
- be able to manage document versioning ;
- be able to test the module easily, even with SSL and so on (I'm 
using Testcontainers, is the MIT Licence compliant with the Apache one 
?) ;

- and even more.

I can insure you that I first tried to implement all these functions 
in the current module, but, because it tries to be compliant with 
Elasticsearch 2.x-7.x (it's already a big challenge ;)), the result 
would have been like a spaghetti plate with quite a lot of 
"if-then-else" blocs everywhere.


And finally, I really don't think this is the way to go.

I'm playing a lot with another Apache project, Apache Camel.
And I think Apache Camel is the Master of Component Management :)

What they'are doing in this case is exactly what I would like to 
achieve here.
They keep the old one, implement a new one which support newer 
versions in order to keep the module maintenable, precisely.


Both module supports Elasticsearch 7.x, but Elasticsearch 8 will be 
released in a few months with document type removal and more breaking 
changes. It will be harder and harder to maintain it.


What I would like to propose is:
- as Kenneth said, we can keep both modules (and I think Elasticsearch 
deserves it ;)) ;
- current users, can keep using the old one and can migrate or not to 
the new one ;
- evolutions and migrations should be done on the new one which 
directly uses the Elasticsearch classes ;

- only bug fixes are done on the old one.

Apache Camel does switch the modules really later on. Basically only 
on major releases (you can check the release note of the 3.0 release, 
see the mongoDb part).


And again, this is a work in progress, I would be glad and I will keep 
improving it (more documentations for instance).
I'm really happy to share this with you and I will take in account 
each remark.


Thank you again and have a good week-end,

Ludovic.




Le ven. 24 janv. 2020 à 21:54, Kenneth Knowles <mailto:k...@apache.org>> a écrit :


Would it make sense to have different version-specialized
connectors with a common core library and common API package?

On Fri, Jan 24, 2020 at 11:52 AM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Thanks for the contribution. I agree with Alexey that we
should try to add any new features brought in with the new PR
into existing connector instead of trying to maintain two
implementations.

Thanks,
Cham

On Fri, Jan 24, 2020 at 9:01 AM Alexey Romanenko
mailto:aromanenko@gmail.com>>
wrote:

Hi Ludovic,

Thank you for working on this and sharing the details with
us. This is really great job!

As I recall, we alrea

[Spark Structured Streaming runner] perfs and encoders

2019-12-23 Thread Etienne Chauchot


Hi all,

good news !

I did some refactoring of the encoders to improve maintenability and 
replace as much as possible string generated code with compiled code and 
the perf results are awesome !


Best

Etienne

Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

2019-12-11 Thread Etienne Chauchot


Ok,

Thanks Kenn.

Le Flatten javadoc says that by default the coder of the output should 
be the coder of the first input. But in the test, it sets the output 
coder to something different. Waiting for a consensus on this model 
point and a common impl in the runners, I'll just exclude this test as 
other runner do.


Etienne

On 11/12/2019 04:46, Kenneth Knowles wrote:
It is a good point. Nullable(VarLong) and VarLong are two different 
types, with least upper bound that is Nullable(VarLong). BigEndianLong 
and VarLong are two different types, with no least upper bound in the 
"coders" type system. Yet we understand that the values they encode 
are equal. I do not think this is clearly formalized anywhere what the 
rules are (corollary: not thought carefully about).


I think both possibilities are reasonable:

1. Make the rule that Flatten only accepts inputs with identical 
coders. This will be sometimes annoying, requiring vacuous "re-encode" 
noop ParDos (they will be fused away on maybe all runners).
2. Define types as the domain of values, and Flatten accepts sets of 
PCollections with the same domain of values. Runners must "do whatever 
it takes" to respect the coders on the collection.
2a. For very simple cases, Flatten takes the least upper bound of the 
input types. The output coder of Flatten has to be this least upper 
bound. For example, a non-nullable output coder would be an error.


Very interesting and nuanced problem. Flatten just became quite an 
interesting transform, for me :-)


Kenn

On Tue, Dec 10, 2019 at 12:37 AM Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:


Hi all,

I have an interrogation around testFlattenMultipleCoders test:

This test uses 2 collections

1. long and null data encoded using NullableCoder(BigEndianLongCoder)

2. long data encoded using VarlongCoder

It then flattens the 2 collections and set the coder of the resulting
collection to NullableCoder(VarlongCoder)

Most runners translate flatten as a simple union of the 2
PCollections
without any re-encoding. As a result all the runners exclude this
test
from the test set because of coders issues. For example flink
raises an
exception if the type of elements in PCollection1 is different of the
type of PCollection2 in flatten translation. Another example is
direct
runner and spark (RDD based) runner that do not exclude this test
simply
because they don't need to serialize elements so they don't even call
the coders.

That means that having an output PCollection of the flatten with
heterogeneous coders is not really tested so it is not really
supported.

Should we drop this test case (that is executed by no runner) or
should
we force each runner to re-encode ?

Best

Etienne

Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

2019-12-10 Thread Etienne Chauchot


Hi all,

I have an interrogation around testFlattenMultipleCoders test:

This test uses 2 collections

1. long and null data encoded using NullableCoder(BigEndianLongCoder)

2. long data encoded using VarlongCoder

It then flattens the 2 collections and set the coder of the resulting 
collection to NullableCoder(VarlongCoder)


Most runners translate flatten as a simple union of the 2 PCollections 
without any re-encoding. As a result all the runners exclude this test 
from the test set because of coders issues. For example flink raises an 
exception if the type of elements in PCollection1 is different of the 
type of PCollection2 in flatten translation. Another example is direct 
runner and spark (RDD based) runner that do not exclude this test simply 
because they don't need to serialize elements so they don't even call 
the coders.


That means that having an output PCollection of the flatten with 
heterogeneous coders is not really tested so it is not really supported.


Should we drop this test case (that is executed by no runner) or should 
we force each runner to re-encode ?


Best

Etienne

Re: consurrent PRs

2019-11-28 Thread Etienne Chauchot


Hi all,

FYI, I closed the most recent one (with explanation and a sorry message):

https://github.com/apache/beam/pull/10025

Etienne

On 26/11/2019 17:06, Robert Bradshaw wrote:

On Tue, Nov 26, 2019 at 6:15 AM Etienne Chauchot  wrote:

Hi guys,

I wanted your opinion about something:

I have 2 concurrent PRs that do the same:

https://github.com/apache/beam/pull/10010

https://github.com/apache/beam/pull/10025

The first one is a bit better because it addresses a deprecation that
the other does not address. Except that they are the same. The first one
the the older (1 day before) but the second one is the one that received
reviews.

I guess the problem is that there were 3 duplicate tickets of
Elasticsearch7 upgrade (because people do not search for existing
tickets before opening). As a result concurrent PRs were submitted
despite the PR link on jira. I removed the duplicates but I need to
close one of the PRs.

The question is: which one do you think should be closed?

Are there (summary) pros and cons of the various pros and cons that
you're looking for feedback on? Otherwise, I think you could make the
call. (It's a good reminder of trying to search for issues on JIRA
before filing a new one though.)

consurrent PRs

2019-11-26 Thread Etienne Chauchot


Hi guys,

I wanted your opinion about something:

I have 2 concurrent PRs that do the same:

https://github.com/apache/beam/pull/10010

https://github.com/apache/beam/pull/10025

The first one is a bit better because it addresses a deprecation that 
the other does not address. Except that they are the same. The first one 
the the older (1 day before) but the second one is the one that received 
reviews.


I guess the problem is that there were 3 duplicate tickets of 
Elasticsearch7 upgrade (because people do not search for existing 
tickets before opening). As a result concurrent PRs were submitted 
despite the PR link on jira. I removed the duplicates but I need to 
close one of the PRs.


The question is: which one do you think should be closed?

Thanks for you opinion guys

Etienne

Re: [spark structured streaming runner] available on master

2019-11-20 Thread Etienne Chauchot

Forgot to say thanks everyone for their contribution to this especially 
Alexey, Ryan and Ismael.


Etienne

On 20/11/2019 17:12, Etienne Chauchot wrote:

Hi all,

I'm glad to announce that the new Spark runner based on Spark 
structured streaming framework has been merged into master !


It is not based on RDD/DStream API. See 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html


It is still experimental, its coverage of the Beam model is partial:

- the runner passes 95% of the validates runner tests in batch mode.

- It does not have support for streaming yet (waiting for the 
multi-aggregations support in spark StructuredStreaming framework from 
the Spark community)


- Runner can execute Nexmark : perfkit dashboards yet to come

- Some things are not wired up yet:

    - Beam Schemas not wired up

    - Optional features of the model not implemented:  state api, 
timer api, splittable doFn api, …


I will submit a PR to update the capability matrix in the coming days.

Best

Etienne

[spark structured streaming runner] available on master

2019-11-20 Thread Etienne Chauchot


Hi all,

I'm glad to announce that the new Spark runner based on Spark structured 
streaming framework has been merged into master !


It is not based on RDD/DStream API. See 
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html


It is still experimental, its coverage of the Beam model is partial:

- the runner passes 95% of the validates runner tests in batch mode.

- It does not have support for streaming yet (waiting for the 
multi-aggregations support in spark StructuredStreaming framework from 
the Spark community)


- Runner can execute Nexmark : perfkit dashboards yet to come

- Some things are not wired up yet:

    - Beam Schemas not wired up

    - Optional features of the model not implemented:  state api, timer 
api, splittable doFn api, …


I will submit a PR to update the capability matrix in the coming days.

Best

Etienne

Re: [spark structured streaming runner] merge to master?

2019-11-13 Thread Etienne Chauchot


Ok for 1 jar with the 2 runners then.
I'll add the banner to the logs and the Experimental in the code and in 
in the javadocs.


Thanks for your opinions guys !

Etienne

On 08/11/2019 18:50, Kenneth Knowles wrote:
On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot <mailto:echauc...@apache.org>> wrote:

>
> Hi guys
>
> @Kenn,
>
> I just wanted to mention that I did answered your question on 
dependencies here: 
https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E


Ah, sorry! In that case there is no problem at all.


> I'm not in favor of having the 2 runners in one jar, the point about 
having 2 jars was to:

>
> - avoid making promises to users on a work in progress runner (make 
it explicit with a different jar)

> - avoid confusion for them (why are there 2 pipeline options? etc)
>
> If the community believes that there is no confusion or wrong 
promises with the one jar solution, we could leave the 2 runners in 
one jar.

>
> Maybe we could start a vote on that?

It seems unanimous among others to have one jar. There were some 
suggestions of how to avoid promises and confusion, like Ryan's most 
recent email. Did any of the ideas sound good to you?


Kenn


I have no objection to putting the experimental runner alongside the
stable, mature runner.  We have some precedence with the portable
spark runner, and that's worked out pretty well -- at least, I haven't
heard any complaints from confused users!

That being said:

1.  It really should be marked @Experimental in the code *and* clearly
warned in API (javadoc) and documentation.

2.  Ideally, I'd like to see a warning banner in the logs when it's
used, pointing to the stable SparkRunner and/or documentation on the
current known issues.

All my best, Ryan






> regarding jars:
>
> I don't like 3 jars either.
>
>
> Etienne
>
> On 31/10/2019 02:06, Kenneth Knowles wrote:
>
> Very good points. We definitely ship a lot of code/features in
very early stages, and there seems to be no problem.
>
> I intend mostly to leave this judgment to people like you who
know better about Spark users.
>
> But I do think 1 or 2 jars is better than 3. I really don't like
"3 jars" and I did give two reasons:
>
> 1. diamond deps where things overlap
> 2. figuring out which thing to depend on
>
> Both are annoying for users. I am not certain if it could lead
to a real unsolvable situation. This is just a Java ecosystem
problem so I feel qualified to comment.
>
> I did also ask if there were major dependency differences
between the two that could cause problems for users. This question
was dropped and no one cares to comment so I assume it is not an
issue. So then I favor having just 1 jar with both runners.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía mailto:ieme...@gmail.com>> wrote:
>>
>> I am still a bit lost about why we are discussing options
without giving any
>> arguments or reasons for the options? Why is 2 modules better
than 3 or 3 better
>> than 2, or even better, what forces us to have something
different than a single
>> module?
>>
>> What are the reasons for wanting to have separate jars? If the
issue is that the
>> code is unfinished or not passing the tests, the impact for end
users is minimal
>> because they cannot accidentally end up running the new runner,
and if they
>> decide to do so we can warn them it is at their own risk and
not ready for
>> production in the documentation + runner.
>>
>> If the fear is that new code may end up being intertwined with
the classic and
>> portable runners and have some side effects. We have the
ValidatesRunner +
>> Nexmark in the CI to cover this so again I do not see what is
the problem that
>> requires modules to be separate.
>>
>> If the issue is being uncomfortable about having in-progress
code in released
>> artifacts we have been doing this in Beam forever, for example
most of the work
>> on portability and Schema/SQL, and all of those were still part
of artifacts
>> long time before they were ready for prime use, so I still
don't see why this
>> case is different to require different artifacts.
>>
>> I have the impression we are trying to solve a non-issue by
adding a lot of
>> artificial complexity (in particular to the users), or am I
missing somethin

Re: [spark structured streaming runner] merge to master?

2019-11-07 Thread Etienne Chauchot

Hi guys

@Kenn,

I just wanted to mention that I did answered your question on 
dependencies here: 
https://lists.apache.org/thread.html/5a85caac41e796c2aa351d835b3483808ebbbd4512b480940d494439@%3Cdev.beam.apache.org%3E

regarding jars:

I don't like 3 jars either.

I'm not in favor of having the 2 runners in one jar, the point about 
having 2 jars was to:

- avoid making promises to users on a work in progress runner (make it 
explicit with a different jar)

- avoid confusion for them (why are there 2 pipeline options? etc)

If the community believes that there is no confusion or wrong promises 
with the one jar solution, we could leave the 2 runners in one jar.

Maybe we could start a vote on that?

Etienne

On 31/10/2019 02:06, Kenneth Knowles wrote:
Very good points. We definitely ship a lot of code/features in very 
early stages, and there seems to be no problem.

I intend mostly to leave this judgment to people like you who know 
better about Spark users.

But I do think 1 or 2 jars is better than 3. I really don't like "3 
jars" and I did give two reasons:

1. diamond deps where things overlap
2. figuring out which thing to depend on

Both are annoying for users. I am not certain if it could lead to a 
real unsolvable situation. This is just a Java ecosystem problem so I 
feel qualified to comment.

I did also ask if there were major dependency differences between the 
two that could cause problems for users. This question was dropped and 
no one cares to comment so I assume it is not an issue. So then I 
favor having just 1 jar with both runners.

Kenn

On Wed, Oct 30, 2019 at 2:46 PM Ismaël Mejía <mailto:ieme...@gmail.com>> wrote:

I am still a bit lost about why we are discussing options without
giving any
arguments or reasons for the options? Why is 2 modules better than
3 or 3 better
than 2, or even better, what forces us to have something different
than a single
module?

What are the reasons for wanting to have separate jars? If the
issue is that the
code is unfinished or not passing the tests, the impact for end
users is minimal
because they cannot accidentally end up running the new runner,
and if they
decide to do so we can warn them it is at their own risk and not
ready for
production in the documentation + runner.

If the fear is that new code may end up being intertwined with the
classic and
portable runners and have some side effects. We have the
ValidatesRunner +
Nexmark in the CI to cover this so again I do not see what is the
problem that
requires modules to be separate.

If the issue is being uncomfortable about having in-progress code
in released
artifacts we have been doing this in Beam forever, for example
most of the work
on portability and Schema/SQL, and all of those were still part of
artifacts
long time before they were ready for prime use, so I still don't
see why this
case is different to require different artifacts.

I have the impression we are trying to solve a non-issue by adding
a lot of
artificial complexity (in particular to the users), or am I
missing something
else?

On Wed, Oct 30, 2019 at 7:40 PM Kenneth Knowles mailto:k...@apache.org>> wrote:
>
> Oh, I mean that we ship just 2 jars.
>
> And since Spark users always build an uber jar, they can still
depend on both of ours and be able to switch runners with a flag.
>
> I really dislike projects shipping overlapping jars. It is
confusing and causes major diamond dependency problems.
>
> Kenn
>
> On Wed, Oct 30, 2019 at 11:12 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:
>>
>> Yes, agree, two jars included in uber jar will work in the
similar way. Though having 3 jars looks still quite confusing for me.
>>
>> On 29 Oct 2019, at 23:54, Kenneth Knowles mailto:k...@apache.org>> wrote:
>>
>> Is it just as easy to have two jars and build an uber jar with
both included? Then the runner can still be toggled with a flag.
>>
>> Kenn
>>
>> On Tue, Oct 29, 2019 at 9:38 AM Alexey Romanenko
mailto:aromanenko@gmail.com>> wrote:
>>>
>>> Hmm, I don’t think that jar size should play a big role
comparing to the whole size of shaded jar of users job. Even more,
I think it will be quite confusing for users to choose which jar
to use if we will have 3 different ones for similar purposes.
Though, let’s see what others think.
>>>
>>> On 29 Oct 2019, at 15:32, Etienne Chauchot
mailto:echauc...@apache.org>> wrote:
>>>
>>> Hi Alexey,
>>>
>>> Thanks for your opinion !
>

Re: [spark structured streaming runner] merge to master?

2019-10-29 Thread Etienne Chauchot


Hi Alexey,

Thanks for your opinion !

Comments inline

Etienne

On 28/10/2019 17:34, Alexey Romanenko wrote:

Let me share some of my thoughts on this.


    - shall we filter out the package name from the release?

Until new runner is not ready to be used in production (or, at least, 
be used for beta testing but users should be clearly warned about that 
in this case), I believe we need to filter out its classes from 
published jar to avoid a confusion.

Yes that is what I think also


    - should we release 2 jars: one for the old and one for the new ?

    - should we release 3 jars: one for the new, one for the new and 
one for both ?


Once new runner will be released, then I think we need to provide only 
one single jar and allow user to switch between different Spark 
runners with CLI option.


I would vote for 3 jars: one for new, one for old, and one for both. 
Indeed, in some cases, users are looking very closely at the size of 
jars. This solution meets all use cases



    - should we create a special entry to the capability matrix ?

Sure, since it has its own uniq characteristics and implementation, 
but again, only once new runner will be "officially released".

+1



On 28 Oct 2019, at 10:27, Etienne Chauchot <mailto:echauc...@apache.org>> wrote:


Hi guys,

Any opinions on the point2 communication to users ?

Etienne

On 24/10/2019 15:44, Etienne Chauchot wrote:


Hi guys,

I'm glad to announce that the PR for the merge to master of the new 
runner based on Spark Structured Streaming framework is submitted:


https://github.com/apache/beam/pull/9866


1. Regarding the status of the runner:

-the runner passes 93% of the validates runner tests in batch mode.

-Streaming mode is barely started (waiting for the 
multi-aggregations support in spark Structured Streaming framework 
from the Spark community)


-Runner can execute Nexmark

-Some things are not wired up yet

  -Beam Schemas not wired with Spark Schemas

  -Optional features of the model not implemented: state api, timer 
api, splittable doFn api, …



2. Regarding the communication to users:

- for reasons explained by Ismael: the runner is in the same module 
as the "older" one. But it is in a different sub-package and both 
runners share the same build.


- How should we communicate to users:

    - shall we filter out the package name from the release?

    - should we release 2 jars: one for the old and one for the new ?

    - should we release 3 jars: one for the new, one for the new and 
one for both ?


    - should we create a special entry to the capability matrix ?

WDYT ?

Best

Etienne


On 23/10/2019 19:11, Mikhail Gryzykhin wrote:

+1 to merge.

It is worth keeping things in master with explicitly marked status. 
It will make effort more visible to users and easier to get 
feedback upon.


--Mikhail

On Wed, Oct 23, 2019 at 8:36 AM Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:


Hi guys,

The new spark runner now supports beam coders and passes 93% of
the batch validates runner tests (+4%). I think it is time to
merge it to master. I will submit a PR in the coming days.

next steps: support schemas and thus better leverage catalyst
optimizer (among other things optims based on data), port perfs
optims that were done in the current runner.

Best

Etienne

On 11/10/2019 22:48, Pablo Estrada wrote:

+1 for merging : )

On Fri, Oct 11, 2019 at 12:43 PM Robert Bradshaw
mailto:rober...@google.com>> wrote:

Sounds like a good plan to me.

    On Fri, Oct 11, 2019 at 6:20 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:

Comments inline

On 10/10/2019 23:44, Ismaël Mejía wrote:

+1

The earlier we get to master the better to encourage not only code
contributions but as important to have early user feedback.


Question is: do we keep the "old" spark runner for a while or not 
(or just keep on previous version/tag on git) ?

It is still too early to even start discussing when to remove the
classical runner given that the new runner is still a WIP. However 
the
overall goal is that this runner becomes the de-facto one once the 
VR
tests and the performance become at least equal to the classical
runner, in the meantime the best for users is that they co-exist,
let’s not forget that the other runner has been already battle 
tested
for more than 3 years and has had lots of improvements in the last
year.


+1 on what Ismael says: no soon removal,

The plan I had in mind at first (that I showed at the
apacheCon) was this but I'm proposing moving the first
gray label to before the red box.





I don't think the number of commits

Re: [spark structured streaming runner] merge to master?

2019-10-11 Thread Etienne Chauchot

Hi Kenn,

Comments inline

On 10/10/2019 21:02, Kenneth Knowles wrote:

+1

I think our experiences with things that go to master early have been 
very good. So I am in favor ASAP. We can exclude it from releases 
easily until it is ready for end users.
I have mixed emotions on exclusion: I'd like users to try it for 
feedback but I don't want to confuse them and make promises on features 
completeness. How do you think we should we proceed?

I have the same question as Robert - how much is modifications and how 
much is new? I notice it is in a subdirectory of the 
beam-runners-spark module.

cf my answers to Robert.

I did not see any major changes to dependencies but I will also ask if 
it has major version differences so that you might want a separate 
artifact?

There is no deps change at all, the 2 runners are in sync on 
deps/versions. The trick is that spark 2.4.x provides both RDD/Dstream 
and Structured Streaming.

Etienne

Kenn

On Thu, Oct 10, 2019 at 11:50 AM Robert Bradshaw <mailto:rober...@google.com>> wrote:

On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot
mailto:echauc...@apache.org>> wrote:
>
> Hi guys,
>
> You probably know that there has been for several months an work
> developing a new Spark runner based on Spark Structured Streaming
> framework. This work is located in a feature branch here:
>
https://github.com/apache/beam/tree/spark-runner_structured-streaming
>
> To attract more contributors and get some user feedback, we
think it is
> time to merge it to master. Before doing so, some steps need to
be achieved:
>
> - finish the work on spark Encoders (that allow to call Beam coders)
> because, right now, the runner is in an unstable state (some
transforms
> use the new way of doing ser/de and some use the old one, making a
> pipeline incoherent toward serialization)
>
> - clean history: The history contains commits from November 2018, so
> there is a good amount of work, thus a consequent number of commits.
> They were already squashed but not from September 2019

I don't think the number of commits should be an issue--we shouldn't
just squash years worth of history away. (OTOH, if this is a case of
this branch containing lots of little, irrelevant commits that would
have normally been squashed away in the normal review process we do
for the main branch, then, yes, some cleanup could be nice.)

> Regarding status:
>
> - the runner passes 89% of the validates runner tests in batch
mode. We
> hope to pass more with the new Encoders
>
> - Streaming mode is barely started (waiting for the
multi-aggregations
> support in spark SS framework from the Spark community)
>
> - Runner can execute Nexmark
>
> - Some things are not wired up yet
>
>      - Beam Schemas not wired with Spark Schemas
>
>      - Optional features of the model not implemented: state
api, timer
> api, splittable doFn api, …
>
> WDYT, can we merge it to master once the 2 steps are done ?

I think that as long as it sits parallel to the existing runner, and
is clearly marked with its status, it makes sense to me. How many
changes does it make to the existing codebase (as opposed to add new
code)?

Re: [spark structured streaming runner] merge to master?

2019-10-11 Thread Etienne Chauchot


Hi Robert comments inline:

On 10/10/2019 20:49, Robert Bradshaw wrote:

On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot  wrote:

Hi guys,

You probably know that there has been for several months an work
developing a new Spark runner based on Spark Structured Streaming
framework. This work is located in a feature branch here:
https://github.com/apache/beam/tree/spark-runner_structured-streaming

To attract more contributors and get some user feedback, we think it is
time to merge it to master. Before doing so, some steps need to be achieved:

- finish the work on spark Encoders (that allow to call Beam coders)
because, right now, the runner is in an unstable state (some transforms
use the new way of doing ser/de and some use the old one, making a
pipeline incoherent toward serialization)

- clean history: The history contains commits from November 2018, so
there is a good amount of work, thus a consequent number of commits.
They were already squashed but not from September 2019



I don't think the number of commits should be an issue--we shouldn't
just squash years worth of history away. (OTOH, if this is a case of
this branch containing lots of little, irrelevant commits that would
have normally been squashed away in the normal review process we do
for the main branch, then, yes, some cleanup could be nice.)

+1. yes tiny ones were already squashed except from sept to now.



Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We
hope to pass more with the new Encoders

- Streaming mode is barely started (waiting for the multi-aggregations
support in spark SS framework from the Spark community)

- Runner can execute Nexmark

- Some things are not wired up yet

  - Beam Schemas not wired with Spark Schemas

  - Optional features of the model not implemented:  state api, timer
api, splittable doFn api, …

WDYT, can we merge it to master once the 2 steps are done ?

I think that as long as it sits parallel to the existing runner, and
is clearly marked with its status, it makes sense to me. How many
changes does it make to the existing codebase (as opposed to add new
code)?


the 2 runners live in the same module but in different java packages. 
New runner was re+coded from scratch (cf explanations here 
https://www.slideshare.net/EtienneChauchot/etienne-chauchot-spark-structured-streaming-runner). 



There is no shared code between the runners (except beam sdk and core of 
course :) ).


Etienne

Re: [spark structured streaming runner] merge to master?

2019-10-11 Thread Etienne Chauchot

I think that it is important to also provide in the build a jar that 
only contains the old runner, for people who want to ship only one.


Etienne

On 10/10/2019 15:40, Alexey Romanenko wrote:

+1 for merging this new runner too (even if it’s not 100% ready for the moment) 
in case if it doesn’t break/fail/affect all other tests and Jenkins jobs. I 
mean, it should be transparent for other Beam components.

Also, since it won’t be officially “released” right after merging, we need to 
clearly warn users that it’s not ready to use in production.


On 10 Oct 2019, at 15:25, Ryan Skraba  wrote:

Merging to master sounds like a really good idea, even if it is not
feature-complete yet.

It's already a pretty big accomplishment getting it to the current
state (great job all!).  Merging it into master would give it a pretty
good boost for visibility and encouraging some discussion about where
it's going.

I don't think there's any question about removing the RDD-based
(a.k.a. old/legacy/stable) spark runner yet!

All my best, Ryan


On Thu, Oct 10, 2019 at 2:47 PM Jean-Baptiste Onofré  wrote:

+1

As the runner seems almost "equivalent" to the one we have, it makes sense.

Question is: do we keep the "old" spark runner for a while or not (or
just keep on previous version/tag on git) ?

Regards
JB

On 10/10/2019 09:39, Etienne Chauchot wrote:

Hi guys,

You probably know that there has been for several months an work
developing a new Spark runner based on Spark Structured Streaming
framework. This work is located in a feature branch here:
https://github.com/apache/beam/tree/spark-runner_structured-streaming

To attract more contributors and get some user feedback, we think it is
time to merge it to master. Before doing so, some steps need to be
achieved:

- finish the work on spark Encoders (that allow to call Beam coders)
because, right now, the runner is in an unstable state (some transforms
use the new way of doing ser/de and some use the old one, making a
pipeline incoherent toward serialization)

- clean history: The history contains commits from November 2018, so
there is a good amount of work, thus a consequent number of commits.
They were already squashed but not from September 2019

Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We
hope to pass more with the new Encoders

- Streaming mode is barely started (waiting for the multi-aggregations
support in spark SS framework from the Spark community)

- Runner can execute Nexmark

- Some things are not wired up yet

- Beam Schemas not wired with Spark Schemas

- Optional features of the model not implemented:  state api, timer
api, splittable doFn api, …

WDYT, can we merge it to master once the 2 steps are done ?

Best

Etienne


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[spark structured streaming runner] merge to master?

2019-10-10 Thread Etienne Chauchot


Hi guys,

You probably know that there has been for several months an work 
developing a new Spark runner based on Spark Structured Streaming 
framework. This work is located in a feature branch here: 
https://github.com/apache/beam/tree/spark-runner_structured-streaming


To attract more contributors and get some user feedback, we think it is 
time to merge it to master. Before doing so, some steps need to be achieved:


- finish the work on spark Encoders (that allow to call Beam coders) 
because, right now, the runner is in an unstable state (some transforms 
use the new way of doing ser/de and some use the old one, making a 
pipeline incoherent toward serialization)


- clean history: The history contains commits from November 2018, so 
there is a good amount of work, thus a consequent number of commits. 
They were already squashed but not from September 2019


Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We 
hope to pass more with the new Encoders


- Streaming mode is barely started (waiting for the multi-aggregations 
support in spark SS framework from the Spark community)


- Runner can execute Nexmark

- Some things are not wired up yet

    - Beam Schemas not wired with Spark Schemas

    - Optional features of the model not implemented:  state api, timer 
api, splittable doFn api, …


WDYT, can we merge it to master once the 2 steps are done ?

Best

Etienne

Re: Feedback on how we use Apache Beam in my company

2019-10-09 Thread Etienne Chauchot


Very nice !

Thanks

ccing dev list

Etienne

On 09/10/2019 16:55, Pierre Vanacker wrote:


Hi Apache Beam community,

We’ve been working with Apache Beam in production for a few years now 
in my company (Dailymotion).


If you’re interested to know how we use Apache Beam in combination 
with Google Dataflow, we shared this experience in the following 
article : 
https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816


Thanks to the developers for your great work !

Regards,

Pierre

Re: Cassandra flaky on Jenkins?

2019-09-19 Thread Etienne Chauchot

Hi all,I just created a PR (1) that tries to fix the flakiness of 
CassandraIOTest (underlying ticket 
https://jira.apache.org/jira/browse/BEAM-8025 that was assigned to me). We will 
see with the test repetitions if it is
no more flaky.
JB, I don't know if my PR will also fix the ticket 
https://issues.apache.org/jira/browse/BEAM-7355 assigned to you, or
if the tickets are the same/related. I hope it does.

[1] https://github.com/apache/beam/pull/9614 id="-x-evo-selection-start-marker">
Best,Etienne Le mercredi 04 septembre 2019 à 16:27 +0200, Jean-Baptiste Onofré 
a écrit :
> Thanks David,
> it makes sense, it gives me time to investigate and fix.
> RegardsJB
> On 04/09/2019 15:01, David Morávek wrote:
> Hi, temporarily disabling the test, 
> until BEAM-8025<
> https://jira.apache.org/jira/browse/BEAM-8025> is resolved (marking itas 
> blocker for 2.16), so we can unblock ongoing
> pull requests.
> Best,D.
> On Tue, Sep 3, 2019 at 3:57 PM Jean-Baptiste Onofré 
> mailto:j...@nanthrax.net>> wrote:
> Hi Max,
> yup, I'm starting the investigation.
> I keep you posted.
> RegardsJB
> On 03/09/2019 15:34, Maximilian Michels wrote:> The newest 
> incarnation of this is here:> 
> https://jira.apache.org/jira/browse/BEAM-80255>> Would be good if you 
> could take a look JB.>>
> Thanks,> Max>> On 03.09.19 15:32, David Morávek wrote:>> yes, 
> that looks similar. example:>>>>
> https://github.com/apache/beam/pull/94644>>>> D.>>>> On 3 Sep 
> 2019, at 15:18, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>>> 
> >>
> wrote:>>>>> Thanks David,>>>>>> the build is running on my 
> machine to see if I can
> reproducelocally.>>>>>> It sounds like 
> https://issues.apache.org/jira/browse/BEAM-7355right
> ?>>>>>> Regards>>> JB>>>>>> On 03/09/2019 15:11, David 
> Morávek wrote: I’m running into
> these failures too D. Sent from my iPhone
> > On 3 Sep 2019, at 14:34,
> Jean-Baptiste Onofré mailto:j...@nanthrax.net>
> >  >> wrote:>> Hi,>> 
> Let me take a look. Do you always have this
> issue on Jenkins or> randomly ?>> Regards> JB 
>>>> On 03/09/2019 14:19,
> Alex Van Boxel wrote:>> Hi, is it only me that are bumping on the 
> flaky Cassandra on>> Jenkins?
> I>> like to get my PR approved but I can't get past the Cassandra
> >> error...>>>> *
> org.apache.beam.sdk.io
> .cassandra.CassandraIOTest.classMethod
> >>   //builds.apache.org/job/beam_PreCommit_Java_Phrase/1300/testReport/junit/org.apache.beam.sdk.io.cassandra/CassandraIOT
> est/classMethod/>>>>>>>>>>> _/
> >> _/ Alex Van Boxel>>
> -- > Jean-Baptiste Onofré> jbono...@apache.org
>  >> 
> http://blog.nanthrax.nett> Talend - http://www.talend.comm>>>
> >>> -- >>> Jean-Baptiste
> Onofré>>> jbono...@apache.org 
>  >>>> http://blog.nanthrax.nett>>> Talend 
> - http://www.talend.com
> -- Jean-Baptiste Onofréjbono...@apache.org 
> http://blog.nanthrax.netTal
> end - http://www.talend.com
> 
>

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-19 Thread Etienne Chauchot

Hi Rahul and Xinyu,I just added you to the list of guests in the meeting. Time 
is 5pm GMT +2. That being said, for some
reason last meeting scheduled was 08/28. Ismael initially created the meeting, 
I do not have the rights to add a new
date. Ismael can you add a date ?  I suggest 09/25. WDYT ?
BestEtienne
Le jeudi 19 septembre 2019 à 00:49 +0530, rahul patwari a écrit :
> Hi, 
> I would love to join the call. 
> Can you also share the meeting invitation with me?
> 
> Thanks,
> Rahul
> On Wed 18 Sep, 2019, 11:48 PM Xinyu Liu,  wrote:
> > Alexey and Etienne: I'm very happy to join the sync-up meeting. Please 
> > forward the meeting info to me. I am based in
> > California, US and hopefully the time will work :).
> > Thanks,
> > Xinyu
> > On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot  
> > wrote:
> > > Hi Xinyu,
> > > Thanks for offering help ! My comments are inline:
> > > Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit :
> > > > Hi, Etienne,
> > > > The slides are very informative! Thanks for sharing the details about 
> > > > how the Beam API are mapped into Spark
> > > > Structural Streaming. 
> > > 
> > > Thanks !
> > > > We (LinkedIn) are also interested in trying the new SparkRunner to run 
> > > > Beam pipeine in batch, and contribute to
> > > > it too. From my understanding, seems the functionality on batch side is 
> > > > mostly complete and covers quite a large
> > > > percentage of the tests (a few missing pieces like state and timer in 
> > > > ParDo and SDF). 
> > > 
> > > Correct, it passes 89% of the tests, but there is more than SDF, state 
> > > and timer missing, there is also ongoing
> > > encoders work that I would like to commit/push before merging.
> > > > If so, is it possible to merge the new runner sooner into master so 
> > > > it's much easier for us to pull it in (we
> > > > have an internal fork) and contribute back?
> > > 
> > > Sure, see my other mail on this thread. As Alexey mentioned, please join 
> > > the sync meeting we have, the more the
> > > merrier !
> > > > Also curious about the scheme part in the runner. Seems we can leverage 
> > > > the schema-aware work in PCollection and
> > > > translate from Beam schema to Spark, so it can be optimized in the 
> > > > planner layer. It will be great to hear back
> > > > your plans on that.
> > > 
> > > Well, it is not designed yet but, if you remember my talk, we need to 
> > > store beam windowing information with the
> > > data itself, so ending up having a dataset . One lead that 
> > > was discussed is to store it as a Spark
> > > schema such as this:
> > > 1. field1: binary data for beam windowing information (cannot be mapped 
> > > to fields  because beam windowing info is
> > > complex structure)
> > > 2. fields of data as defined in the Beam schema if there is one 
> > > 
> > > > Congrats on this great work!
> > > Thanks !
> > > Best,
> > > Etienne
> > > > Thanks,
> > > > Xinyu
> > > > On Wed, Sep 11, 2019 at 6:02 PM Rui Wang  wrote:
> > > > > Hello Etienne,
> > > > > Your slide mentioned that streaming mode development is blocked 
> > > > > because Spark lacks supporting multiple-
> > > > > aggregations in its streaming mode but design is ongoing. Do you have 
> > > > > a link or something else to their design
> > > > > discussion/doc?
> > > > > 
> > > > > 
> > > > > -Rui  
> > > > > On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot 
> > > > >  wrote:
> > > > > > Hi Rahul,Sure, and great ! Thanks for proposing !If you want 
> > > > > > details, here is the presentation I did 30 mins
> > > > > > ago at the apachecon. You will find the video on youtube shortly 
> > > > > > but in the meantime, here is my
> > > > > > presentation slides.
> > > > > > And here is the structured streaming branch. I'll be happy to 
> > > > > > review your PRs, thanks ! 
> > > > > > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > > > > > BestEtienne
> > > > > > Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
> > > > > > > Hi Etienne,
> > > > > > > 
> > > > > > > I came to know about the work going on in Structured Streaming 
> > > > > > > Spark Runner from Apache Beam Wiki - Works
> > > > > > > in Progress.
> > > > > > > I have contributed to BeamSql earlier. And I am working on 
> > > > > > > supporting PCollectionView in BeamSql.
> > > > > > > 
> > > > > > > I would love to understand the Runner's side of Apache Beam and 
> > > > > > > contribute to the Structured Streaming
> > > > > > > Spark Runner.
> > > > > > > 
> > > > > > > Can you please point me in the right direction?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Rahul

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Etienne Chauchot

Hi Xinyu,
Thanks for offering help ! My comments are inline:
Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit :
> Hi, Etienne,
> The slides are very informative! Thanks for sharing the details about how the 
> Beam API are mapped into Spark
> Structural Streaming. 

Thanks !
> We (LinkedIn) are also interested in trying the new SparkRunner to run Beam 
> pipeine in batch, and contribute to it
> too. From my understanding, seems the functionality on batch side is mostly 
> complete and covers quite a large
> percentage of the tests (a few missing pieces like state and timer in ParDo 
> and SDF). 

Correct, it passes 89% of the tests, but there is more than SDF, state and 
timer missing, there is also ongoing encoders
work that I would like to commit/push before merging.
> If so, is it possible to merge the new runner sooner into master so it's much 
> easier for us to pull it in (we have an
> internal fork) and contribute back?

Sure, see my other mail on this thread. As Alexey mentioned, please join the 
sync meeting we have, the more the merrier
!
> Also curious about the scheme part in the runner. Seems we can leverage the 
> schema-aware work in PCollection and
> translate from Beam schema to Spark, so it can be optimized in the planner 
> layer. It will be great to hear back your
> plans on that.

Well, it is not designed yet but, if you remember my talk, we need to store 
beam windowing information with the data
itself, so ending up having a dataset . One lead that was 
discussed is to store it as a Spark schema such
as this:
1. field1: binary data for beam windowing information (cannot be mapped to 
fields  because beam windowing info is
complex structure)
2. fields of data as defined in the Beam schema if there is one 

> Congrats on this great work!
Thanks !
Best,
Etienne
> Thanks,
> Xinyu
> On Wed, Sep 11, 2019 at 6:02 PM Rui Wang  wrote:
> > Hello Etienne,
> > Your slide mentioned that streaming mode development is blocked because 
> > Spark lacks supporting multiple-aggregations 
> > in its streaming mode but design is ongoing. Do you have a link or 
> > something else to their design discussion/doc?
> > 
> > 
> > -Rui  
> > On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot  
> > wrote:
> > > Hi Rahul,Sure, and great ! Thanks for proposing !If you want details, 
> > > here is the presentation I did 30 mins ago
> > > at the apachecon. You will find the video on youtube shortly but in the 
> > > meantime, here is my presentation slides.
> > > And here is the structured streaming branch. I'll be happy to review your 
> > > PRs, thanks ! 
> > > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > > BestEtienne
> > > Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
> > > > Hi Etienne,
> > > > 
> > > > I came to know about the work going on in Structured Streaming Spark 
> > > > Runner from Apache Beam Wiki - Works in
> > > > Progress.
> > > > I have contributed to BeamSql earlier. And I am working on supporting 
> > > > PCollectionView in BeamSql.
> > > > 
> > > > I would love to understand the Runner's side of Apache Beam and 
> > > > contribute to the Structured Streaming Spark
> > > > Runner.
> > > > 
> > > > Can you please point me in the right direction?
> > > > 
> > > > Thanks,
> > > > Rahul

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Etienne Chauchot

Hi Rui,Thanks for proposing to contribute to this new runner !
Here are the pointers:- SS runner branch: 
https://github.com/apache/beam/tree/spark-runner_structured-streaming- spark
design doc for multiple watermarks support: 
https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#t#
  . There is also a good
discussion in this Spark PR branch : https://github.com/apache/spark/pull/23576
As Alexey mentioned in this thread, the SS runner feature branch will be merged 
into master when the runner is in good
shape. I think we will not wait for the streaming part as it requires a deep 
change in the spark core + impl of the
streaming part of the Beam runner, so it would take too long. IMHO we need to 
get batch mode of the new runner in a
stable state (encoders ongoing work, fix bad perf of the 2 nexmark queries, 
...) before merging.
Best,Etienne
Le mercredi 11 septembre 2019 à 18:02 -0700, Rui Wang a écrit :
> Hello Etienne,
> Your slide mentioned that streaming mode development is blocked because Spark 
> lacks supporting multiple-aggregations
> in its streaming mode but design is ongoing. Do you have a link or something 
> else to their design discussion/doc?
> 
> 
> -Rui  
> On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot  wrote:
> > Hi Rahul,Sure, and great ! Thanks for proposing !If you want details, here 
> > is the presentation I did 30 mins ago at
> > the apachecon. You will find the video on youtube shortly but in the 
> > meantime, here is my presentation slides.
> > And here is the structured streaming branch. I'll be happy to review your 
> > PRs, thanks ! 
> > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > BestEtienne
> > Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
> > > Hi Etienne,
> > > 
> > > I came to know about the work going on in Structured Streaming Spark 
> > > Runner from Apache Beam Wiki - Works in
> > > Progress.
> > > I have contributed to BeamSql earlier. And I am working on supporting 
> > > PCollectionView in BeamSql.
> > > 
> > > I would love to understand the Runner's side of Apache Beam and 
> > > contribute to the Structured Streaming Spark
> > > Runner.
> > > 
> > > Can you please point me in the right direction?
> > > 
> > > Thanks,
> > > Rahul

[Off for 3 weeks]

2019-07-19 Thread Etienne Chauchot

Hi guys,

Just to let you know, I'll be off for 3 weeks starting tonight.

See you when I get back

Etienne

Re: [Current spark runner] Combine globally translation is risky and not very performant

2019-07-01 Thread Etienne Chauchot

Hi Jan,The collect call is before the extraction so it is collecting a value 
par accumulator to the spark driver, see
sparkCombineFn.extractOutput(maybeAccumulated.get()); call implementation. So 
potentially more than one value per
window.
For the new spark runner, what I'm using is native combine that all happens at 
the dataset (equivalent of rdd to
simplify) side, so it is all in parallel.
Etienne 

Le jeudi 27 juin 2019 à 15:13 +0200, Jan Lukavský a écrit :
> Hi Etienne,
> I saw that too while working on solving [1]. It seems a little weird and I 
> was a little tempted to changed it to
> something roughly equivalent to Combine.perKey with single key. But, actually 
> the Combine.globally should be rather
> small, right? There will be single value for each window. And even if we 
> change it to Combine.perKey with single key I
> think the problem of potential OOM will be just moved to some worker. Or 
> would you see some other option?
> Jan
> [1] https://issues.apache.org/jira/browse/BEAM-7574
> On 6/27/19 11:43 AM, Etienne Chauchot wrote:
> Hi guys,
> FYI, while I'm working on the combine translation for the new spark runner 
> poc, I saw something that do not seem right
> in the current runner: https://issues.apache.org/jira/browse/BEAM-7647
> Best,Etienne

[Current spark runner] Combine globally translation is risky and not very performant

2019-06-27 Thread Etienne Chauchot

Hi guys,

FYI, while I'm working on the combine translation for the new spark runner poc, 
I saw something that do not seem right
in the current runner: https://issues.apache.org/jira/browse/BEAM-7647

Best,
Etienne

Re: Congrats to Beam's first 6 Google Open Source Peer Bonus recipients!

2019-05-09 Thread Etienne Chauchot

Congrats ! 
Etienne
Le lundi 06 mai 2019 à 22:28 -0700, Joana Filipa Bernardo Carrasqueira a écrit :
> Thank you for your work in the community and Congratulations!! :) 
> 
> On Thu, May 2, 2019 at 9:44 PM Ankur Goenka  wrote:
> > Congratulations and thank you for making Beam awesome! 
> > From: Chamikara Jayalath 
> > Date: Thu, May 2, 2019, 4:03 PM
> > To: dev
> > 
> > > Congratulations!
> > > On Thu, May 2, 2019 at 10:28 AM Udi Meiri  wrote:
> > > > Congrats everyone!
> > > > On Thu, May 2, 2019 at 9:55 AM Ahmet Altay  wrote:
> > > > > Congratulations!
> > > > > 
> > > > > On Thu, May 2, 2019 at 9:54 AM Yifan Zou  wrote:
> > > > > > Congratulations! Well deserved!
> > > > > > On Thu, May 2, 2019 at 9:37 AM Rui Wang  wrote:
> > > > > > > Congratulations!
> > > > > > > 
> > > > > > > -Rui  
> > > > > > > On Thu, May 2, 2019 at 8:23 AM Michael Luckey 
> > > > > > >  wrote:
> > > > > > > > Congrats! Well deserved!
> > > > > > > > On Thu, May 2, 2019 at 3:29 PM Alexey Romanenko 
> > > > > > > >  wrote:
> > > > > > > > > Congrats! 
> > > > > > > > > 
> > > > > > > > > > On 2 May 2019, at 10:06, Gleb Kanterov  
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > Congratulations! Well deserved!
> > > > > > > > > > 
> > > > > > > > > > On Thu, May 2, 2019 at 10:00 AM Ismaël Mejía 
> > > > > > > > > >  wrote:
> > > > > > > > > > > Congrats everyone !
> > > > > > > > > > > 
> > > > > > > > > > > On Thu, May 2, 2019 at 9:14 AM Robert Bradshaw 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > Congratulation, and thanks for all the great 
> > > > > > > > > > > > contributions each one of you has made to Beam! 
> > > > > > > > > > > > On Thu, May 2, 2019 at 5:51 AM Ruoyun Huang 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > Congratulations everyone!  Well deserved! 
> > > > > > > > > > > > > On Wed, May 1, 2019 at 8:38 PM Kenneth Knowles 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > Congrats! All well deserved!
> > > > > > > > > > > > > > Kenn
> > > > > > > > > > > > > > On Wed, May 1, 2019 at 8:09 PM Reza Rokni 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > Congratulations! 
> > > > > > > > > > > > > > > On Thu, 2 May 2019 at 10:53, Connell O'Callaghan 
> > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > > Well done - congratulations to you all!!! Rose 
> > > > > > > > > > > > > > > > thank you for sharing this news!!!
> > > > > > > > > > > > > > > > On Wed, May 1, 2019 at 19:45 Rose Nguyen 
> > > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > > > Matthias Baetens, Lukazs Gajowy, Suneel 
> > > > > > > > > > > > > > > > > Marthi, Maximilian Michels, Alex Van Boxel,
> > > > > > > > > > > > > > > > > and Thomas Weise:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Thank you for your exceptional contributions 
> > > > > > > > > > > > > > > > > to Apache Beam.👏 I'm looking forward t
> > > > > > > > > > > > > > > > > o seeing this project grow and for more folks 
> > > > > > > > > > > > > > > > > to contribute and be recognized!
> > > > > > > > > > > > > > > > > Everyone can read more about this award on 
> > > > > > > > > > > > > > > > > the Google Open Source blog: 
> > > > > > > > > > > > > > > > > https://opensource.googleblog.com/2019/04/google-open-source-peer-bonus-winners.html
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Cheers,   
> > > > > > > > > > > > > > > > > -- 
> > > > > > > > > > > > > > > > > Rose Thị Nguyễn
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > -- 
> > > > > > > > > > > > > > > This email may be confidential and privileged. If 
> > > > > > > > > > > > > > > you received this communication by
> > > > > > > > > > > > > > > mistake, please don't forward it to anyone else, 
> > > > > > > > > > > > > > > please erase all copies and attachments,
> > > > > > > > > > > > > > > and please let me know that it has gone to the 
> > > > > > > > > > > > > > > wrong person. 
> > > > > > > > > > > > > > > The above terms reflect a potential business 
> > > > > > > > > > > > > > > arrangement, are provided solely as a basis
> > > > > > > > > > > > > > > for further discussion, and are not intended to 
> > > > > > > > > > > > > > > be and do not constitute a legally binding
> > > > > > > > > > > > > > > obligation. No legally binding obligations will 
> > > > > > > > > > > > > > > be created, implied, or inferred until an
> > > > > > > > > > > > > > > agreement in final form is executed in writing by 
> > > > > > > > > > > > > > > all parties involved.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > -- 
> > > > > > > > > > > > > Ruoyun  Huang
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > -- 
> > > > > > > > > > Cheers,Gleb
> 
>

Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-05-09 Thread Etienne Chauchot

Congrats !
Etienne
Le vendredi 12 avril 2019 à 15:53 -0700, Thomas Weise a écrit :
> Congrats!
> 
> On Thu, Apr 11, 2019 at 6:03 PM Reuven Lax  wrote:
> > Congratulations Boyuan!
> > On Thu, Apr 11, 2019 at 4:53 PM Ankur Goenka  wrote:
> > > Congrats Boyuan!
> > > On Thu, Apr 11, 2019 at 4:52 PM Mark Liu  wrote:
> > > > Congrats Boyuan!
> > > > On Thu, Apr 11, 2019 at 9:53 AM Alexey Romanenko 
> > > >  wrote:
> > > > > > since early 2018
> > > > > 
> > > > > > 100+ pull requests
> > > > > 
> > > > > 
> > > > > 
> > > > > Wow, this is impressive! Great job, congrats!
> > > > > 
> > > > > 
> > > > > 
> > > > > > On 11 Apr 2019, at 15:08, Maximilian Michels  
> > > > > > wrote:
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Great work! Congrats.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > On 11.04.19 13:41, Robert Bradshaw wrote:
> > > > > 
> > > > > >> Congratulations!
> > > > > 
> > > > > >> On Thu, Apr 11, 2019 at 12:29 PM Michael Luckey 
> > > > > >>  wrote:
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> Congrats and welcome, Boyuan
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> On Thu, Apr 11, 2019 at 12:27 PM Tim Robertson 
> > > > > >>>  wrote:
> > > > > 
> > > > >  
> > > > > 
> > > > >  Many congratulations Boyuan!
> > > > > 
> > > > >  
> > > > > 
> > > > >  On Thu, Apr 11, 2019 at 10:50 AM Łukasz Gajowy 
> > > > >   wrote:
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Congrats Boyuan! :)
> > > > > 
> > > > > > 
> > > > > 
> > > > > > śr., 10 kwi 2019 o 23:49 Chamikara Jayalath 
> > > > > >  napisał(a):
> > > > > 
> > > > > >> 
> > > > > 
> > > > > >> Congrats Boyuan!
> > > > > 
> > > > > >> 
> > > > > 
> > > > > >> On Wed, Apr 10, 2019 at 11:14 AM Yifan Zou 
> > > > > >>  wrote:
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> Congratulations Boyuan!
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> On Wed, Apr 10, 2019 at 10:49 AM Daniel Oliveira 
> > > > > >>>  wrote:
> > > > > 
> > > > >  
> > > > > 
> > > > >  Congrats Boyuan!
> > > > > 
> > > > >  
> > > > > 
> > > > >  On Wed, Apr 10, 2019 at 10:20 AM Rui Wang 
> > > > >   wrote:
> > > > > 
> > > > > > 
> > > > > 
> > > > > > So well deserved!
> > > > > 
> > > > > > 
> > > > > 
> > > > > > -Rui
> > > > > 
> > > > > > 
> > > > > 
> > > > > > On Wed, Apr 10, 2019 at 10:12 AM Pablo Estrada 
> > > > > >  wrote:
> > > > > 
> > > > > >> 
> > > > > 
> > > > > >> Well deserved : ) congrats Boyuan!
> > > > > 
> > > > > >> 
> > > > > 
> > > > > >> On Wed, Apr 10, 2019 at 10:08 AM Aizhamal Nurmamat kyzy 
> > > > > >>  wrote:
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> Congratulations Boyuan!
> > > > > 
> > > > > >>> 
> > > > > 
> > > > > >>> On Wed, Apr 10, 2019 at 9:52 AM Ruoyun Huang 
> > > > > >>>  wrote:
> > > > > 
> > > > >  
> > > > > 
> > > > >  Thanks for your contributions and congratulations Boyuan!
> > > > > 
> > > > >  
> > > > > 
> > > > >  On Wed, Apr 10, 2019 at 9:00 AM Kenneth Knowles 
> > > > >   wrote:
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Hi all,
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Please join me and the rest of the Beam PMC in 
> > > > > > welcoming a new committer: Boyuan Zhang.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Boyuan has been contributing to Beam since early 2018. 
> > > > > > She has proposed 100+ pull requests
> > > > > across a wide range of topics: bug fixes, to integration tests, build 
> > > > > improvements, metrics features, release
> > > > > automation. Two big picture things to highlight are 
> > > > > building/releasing Beam Python wheels and managing the
> > > > > donation of the Beam Dataflow Java Worker, including help with I.P. 
> > > > > clearance.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > In consideration of Boyuan's contributions, the Beam 
> > > > > > PMC trusts Boyuan with the responsibilities
> > > > > of a Beam committer [1].
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Thank you, Boyuan, for your contributions.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Kenn
> > > > > 
> > > > > > 
> > > > > 
> > > > > > [1] 
> > > > > > https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > >  --
> > > > > 
> > > > >  
> > > > > 
> > > > >  Ruoyun  Huang
> > > > > 
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> 
>

Re: [ANNOUNCE] New committer announcement: Mark Liu

2019-05-09 Thread Etienne Chauchot

Congrats !
Le lundi 25 mars 2019 à 10:55 -0700, Chamikara Jayalath a écrit :
> Congrats Mark!
> 
> On Mon, Mar 25, 2019 at 10:50 AM Alexey Romanenko  
> wrote:
> > Congratulations, Mark!
> > 
> > > On 25 Mar 2019, at 18:36, Mark Liu  wrote:
> > > 
> > > Thank you all! It's a great pleasure to work on Beam!
> > > Mark
> > > 
> > > On Mon, Mar 25, 2019 at 10:18 AM Robin Qiu  wrote:
> > > > Congratulations, Mark!
> > > > 
> > > > On Mon, Mar 25, 2019 at 9:31 AM Udi Meiri  wrote:
> > > > > Congrats Mark!
> > > > > 
> > > > > On Mon, Mar 25, 2019 at 9:24 AM Ahmet Altay  wrote:
> > > > > > Congratulations, Mark! 🎉
> > > > > > 
> > > > > > On Mon, Mar 25, 2019 at 7:24 AM Tim Robertson 
> > > > > >  wrote:
> > > > > > > Congratulations Mark!
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Mon, Mar 25, 2019 at 3:18 PM Michael Luckey 
> > > > > > >  wrote:
> > > > > > > > Nice! Congratulations, Mark.
> > > > > > > > 
> > > > > > > > On Mon, Mar 25, 2019 at 2:42 PM Katarzyna Kucharczyk 
> > > > > > > >  wrote:
> > > > > > > > > Congratulations, Mark! 🎉
> > > > > > > > > 
> > > > > > > > > On Mon, Mar 25, 2019 at 11:24 AM Gleb Kanterov 
> > > > > > > > >  wrote:
> > > > > > > > > > Congratulations!
> > > > > > > > > > 
> > > > > > > > > > On Mon, Mar 25, 2019 at 10:23 AM Łukasz Gajowy 
> > > > > > > > > >  wrote:
> > > > > > > > > > > Congrats! :)
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > pon., 25 mar 2019 o 08:11 Aizhamal Nurmamat kyzy 
> > > > > > > > > > >  napisał(a):
> > > > > > > > > > > > Congratulations, Mark!
> > > > > > > > > > > > On Sun, Mar 24, 2019 at 23:18 Pablo Estrada 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > Yeaah  Mark! : ) Congrats : D
> > > > > > > > > > > > > On Sun, Mar 24, 2019 at 10:32 PM Yifan Zou 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > Congratulations Mark!
> > > > > > > > > > > > > > On Sun, Mar 24, 2019 at 10:25 PM Connell 
> > > > > > > > > > > > > > O'Callaghan  wrote:
> > > > > > > > > > > > > > > Well done congratulations Mark!!! 
> > > > > > > > > > > > > > > On Sun, Mar 24, 2019 at 10:17 PM Robert Burke 
> > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > > Congratulations Mark! 🎉
> > > > > > > > > > > > > > > > On Sun, Mar 24, 2019, 10:08 PM Valentyn 
> > > > > > > > > > > > > > > > Tymofieiev  wrote:
> > > > > > > > > > > > > > > > > Congratulations, Mark!
> > > > > > > > > > > > > > > > > Thanks for your contributions, in particular 
> > > > > > > > > > > > > > > > > for your efforts to parallelize test
> > > > > > > > > > > > > > > > > execution for Python SDK and increase the 
> > > > > > > > > > > > > > > > > speed of Python precommit checks. 
> > > > > > > > > > > > > > > > > On Sun, Mar 24, 2019 at 9:40 PM Kenneth 
> > > > > > > > > > > > > > > > > Knowles  wrote:
> > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > > Please join me and the rest of the Beam PMC 
> > > > > > > > > > > > > > > > > > in welcoming a new committer: Mark Liu.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Mark has been contributing to Beam since 
> > > > > > > > > > > > > > > > > > late 2016! He has proposed 100+ pull
> > > > > > > > > > > > > > > > > > requests. Mark was instrumental in 
> > > > > > > > > > > > > > > > > > expanding test and infrastructure coverage,
> > > > > > > > > > > > > > > > > > especially for Python. In consideration of 
> > > > > > > > > > > > > > > > > > Mark's contributions, the Beam PMC trusts
> > > > > > > > > > > > > > > > > > Mark with the responsibilities of a Beam 
> > > > > > > > > > > > > > > > > > committer [1].
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thank you, Mark, for your contributions.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Kenn
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > [1] 
> > > > > > > > > > > > > > > > > > https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> > > > > > > > > > > > -- 
> > > > > > > > > > > > 
> > > > > > > > > > > > Aizhamal Nurmamat kyzy
> > > > > > > > > > > > Open Source Program Manager
> > > > > > > > > > > > 646-355-9740 Mobile
> > > > > > > > > > > > 601 North 34th Street, Seattle, WA 98103
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >

Structured streaming based spark runner.

2019-04-30 Thread Etienne Chauchot

Hi guys, 
As part of the ongoing work on spark runner POC based on structured streaming 
framework, I sketched up a design doc (1)
to share context and design principles.
Feel free to comment.

[1] https://s.apache.org/spark-structured-streaming-runner
Etienne

Re: CVE audit gradle plugin

2019-04-26 Thread Etienne Chauchot

Hi all,
Just to let you know, you can now check the vulnerabilities in libraries by 
running gradlew audit --info. It is a
separate task that is not in the dependencies of the build (the normal build 
will not fail if there are vulnerabilities
found).When you run it, It gives an output similar to mvn dependency:tree with 
red vulnerability arrow and the build
fails in case of vulnerabilities found. If there is none, the build succeeds.
For now, there is no more than that but it can be included into jenkins. As we 
did not agree on something, I did not do
the integration.
WDYT?
EtienneLe mercredi 24 avril 2019 à 15:56 +0200, Etienne Chauchot a écrit :
> Hi all,FYI I just submitted a PR (1) to add the CVE audit plugin to the build 
> as an optional task gradlew audit --
> info.
> [1] https://github.com/apache/beam/pull/8388
> Etienne
> Le mardi 23 avril 2019 à 17:25 +0200, Etienne Chauchot a écrit :
> > Hi,should I merge my branch 
> > https://github.com/echauchot/beam/tree/cve_audit_plugin to master to 
> > include this tool
> > to the build system then ?It will not fail the build but add an audit task 
> > to it.
> > EtienneLe vendredi 19 avril 2019 à 10:54 -0700, Lukasz Cwik a écrit :
> > >  Common Vulnerabilities and Exposures (CVE)
> > > 
> > > On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:
> > > > Ah! What's CVE stand for then?
> > > > 
> > > > Re the PR: Sadly, it's more complicated than that, which I'll explain 
> > > > in the PR. Otherwise it would have been
> > > > done already. It's not too bad if the time is put in though.
> > > > On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
> > > > > Robert, I believe what is being suggested is a tool that integrates 
> > > > > into CVE reports automatically and tells
> > > > > us if we have a dependency with a security issue (not just whether 
> > > > > there is a newer version). Also, there is a
> > > > > sweet draft PR to add Go modules[1].
> > > > > 1: https://github.com/apache/beam/pull/8354
> > > > > On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  
> > > > > wrote:
> > > > > > If we move to Go Modules, the go.mod file specifies direct 
> > > > > > dependencies and versions, and the go.sum file
> > > > > > includes checksums of the full transitive set of dependencies. 
> > > > > > There's likely going to be a tool for
> > > > > > detecting if an update is possible, if one doesn't exist in the go 
> > > > > > tooling already.
> > > > > > On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
> > > > > > > This seems worthwhile IMO.
> > > > > > > Ahmet, Pyup[1] is free for open source projects and has an API 
> > > > > > > that allows for dependency checking. They
> > > > > > > can scan Github repos automatically it seems but it may not be 
> > > > > > > compatible with how Apache permissions with
> > > > > > > Github work. I'm not sure if there is such a thing for Go.
> > > > > > > 
> > > > > > > 1: https://pyup.io/
> > > > > > > 
> > > > > > > On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  
> > > > > > > wrote:
> > > > > > > > I want to bring this subject back, any chance we can get this 
> > > > > > > > running
> > > > > > > > 
> > > > > > > > in or main repo maybe in a weekly basis like we do for the 
> > > > > > > > dependency
> > > > > > > > 
> > > > > > > > reports. It looks totallly worth.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > >
> > > > > > > > 
> > > > > > > > > Thank you, I agree this is very important. Does anyone know a 
> > > > > > > > > similar tool for python and go?
> > > > > > > > 
> > > > > > > > >
> > > > > > > > 
> > > > > > > > > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> > &

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-26 Thread Etienne Chauchot

Hi,
Thanks for all your work and patience Andrew !

PS: as a side note, there were 5 binding votes (I voted +1)

Etienne

Le jeudi 25 avril 2019 à 11:16 -0700, Andrew Pilloud a écrit :
> I reran the Nexmark tests, each runner passed. I compared the numbers
> on the direct runner to the dashboard and they are where they should
> be.
> 
> With that, I'm happy to announce that we have unanimously approved this 
> release.
> 
> There are 8 approving votes, 4 of which are binding:
> * Jean-Baptiste Onofré
> * Lukasz Cwik
> * Maximilian Michels
> * Ahmet Altay
> 
> There are no disapproving votes.
> 
> Thanks everyone!
>

Re: CVE audit gradle plugin

2019-04-24 Thread Etienne Chauchot

Hi all,FYI I just submitted a PR (1) to add the CVE audit plugin to the build 
as an optional task gradlew audit --info.
[1] https://github.com/apache/beam/pull/8388
Etienne
Le mardi 23 avril 2019 à 17:25 +0200, Etienne Chauchot a écrit :
> Hi,should I merge my branch 
> https://github.com/echauchot/beam/tree/cve_audit_plugin to master to include 
> this tool to
> the build system then ?It will not fail the build but add an audit task to it.
> EtienneLe vendredi 19 avril 2019 à 10:54 -0700, Lukasz Cwik a écrit :
> >  Common Vulnerabilities and Exposures (CVE)
> > 
> > On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:
> > > Ah! What's CVE stand for then?
> > > 
> > > Re the PR: Sadly, it's more complicated than that, which I'll explain in 
> > > the PR. Otherwise it would have been done
> > > already. It's not too bad if the time is put in though.
> > > On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
> > > > Robert, I believe what is being suggested is a tool that integrates 
> > > > into CVE reports automatically and tells us
> > > > if we have a dependency with a security issue (not just whether there 
> > > > is a newer version). Also, there is a
> > > > sweet draft PR to add Go modules[1].
> > > > 1: https://github.com/apache/beam/pull/8354
> > > > On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  
> > > > wrote:
> > > > > If we move to Go Modules, the go.mod file specifies direct 
> > > > > dependencies and versions, and the go.sum file
> > > > > includes checksums of the full transitive set of dependencies. 
> > > > > There's likely going to be a tool for detecting
> > > > > if an update is possible, if one doesn't exist in the go tooling 
> > > > > already.
> > > > > On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
> > > > > > This seems worthwhile IMO.
> > > > > > Ahmet, Pyup[1] is free for open source projects and has an API that 
> > > > > > allows for dependency checking. They can
> > > > > > scan Github repos automatically it seems but it may not be 
> > > > > > compatible with how Apache permissions with
> > > > > > Github work. I'm not sure if there is such a thing for Go.
> > > > > > 
> > > > > > 1: https://pyup.io/
> > > > > > 
> > > > > > On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  
> > > > > > wrote:
> > > > > > > I want to bring this subject back, any chance we can get this 
> > > > > > > running
> > > > > > > 
> > > > > > > in or main repo maybe in a weekly basis like we do for the 
> > > > > > > dependency
> > > > > > > 
> > > > > > > reports. It looks totallly worth.
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  
> > > > > > > wrote:
> > > > > > > 
> > > > > > > >
> > > > > > > 
> > > > > > > > Thank you, I agree this is very important. Does anyone know a 
> > > > > > > > similar tool for python and go?
> > > > > > > 
> > > > > > > >
> > > > > > > 
> > > > > > > > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> > > > > > > >  wrote:
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> Hi guys,
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> I came by this [1] gradle plugin that is a client to the 
> > > > > > > >> Sonatype OSS Index CVE database.
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> I have set it up here in a branch [2], though the cache is not 
> > > > > > > >> configured and the number of requests is
> > > > > > > limited. It can be run with "gradle --info audit"
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> It could be nice to have something like this to track the CVEs 
> > > > > > > >> in the libs we use. I know we have been
> > > > > > > spammed by libs upgrade automatic requests in the past but CVE 
> > > > > > > are more important IMHO.
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> This plugin is in BSD-3-Clause which is compatible with Apache 
> > > > > > > >> V2 licence [3]
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> WDYT ?
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> Etienne
> > > > > > > 
> > > > > > > >>
> > > > > > > 
> > > > > > > >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> > > > > > > 
> > > > > > > >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> > > > > > > 
> > > > > > > >> [3] https://www.apache.org/legal/resolved.html
> > > > > > >

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-24 Thread Etienne Chauchot

Reuven, 
Nexmark tests are indeed run as PostCommits (each commit on master). I guess we 
have been flooded with jenkins
notification emails.
Etienne
Le mardi 23 avril 2019 à 15:24 -0700, Reuven Lax a écrit :
> I mistakenly though that Java PostCommit would run these tests, and I merged 
> based on PostCommit passing. That's how
> the bug got into master.
> On Tue, Apr 23, 2019 at 3:21 PM Kenneth Knowles  wrote:
> > What can we do to make this part of day-to-day workflow instead of finding 
> > out during release validation? Was this
> > just a failing test that was missed?
> > Kenn
> > On Tue, Apr 23, 2019 at 3:02 PM Andrew Pilloud  wrote:
> > > It looks like Java Nexmark tests are on the validation sheet but we've 
> > > missed it the last few releases. Thanks for
> > > checking it Etienne! Does the current release process require everything 
> > > to be tested before making the release
> > > final?
> > > I fully agree with you on point 2. All of these issues were in RC1 and 
> > > could have been fixed for RC2.
> > > 
> > > Andrew
> > > On Tue, Apr 23, 2019 at 2:58 PM Ahmet Altay  wrote:
> > > > Thank you Andrew. I will suggest two improvements to the release 
> > > > process:1. We can include benchmarks in the
> > > > validation sheet ("Apache Beam Release Acceptance Criteria"). They are 
> > > > used part of the validation process and
> > > > we can ensure that we check those for each release.
> > > > 2. For RC validation, we can continue to exhaustively validate each RC 
> > > > even after the first -1 vote. Otherwise
> > > > we end up with not discovering all issues in a given RC and find them a 
> > > >  successive RC, increasing the number of
> > > > iterations required.
> > > > 
> > > > 
> > > > On Tue, Apr 23, 2019 at 2:11 PM Andrew Pilloud  
> > > > wrote:
> > > > > Please consider the vote for RC4 canceled. I'll quickly follow up 
> > > > > with a new RC.
> > > > > Thanks for the complete testing everyone!
> > > > > Andrew
> > > > > On Tue, Apr 23, 2019 at 2:06 PM Reuven Lax  wrote:
> > > > > > -1 
> > > > > > we need to cherry pick pr/8325 and pr/8385 to fix the above issue
> > > > > > On Tue, Apr 23, 2019 at 1:48 PM Andrew Pilloud 
> > > > > >  wrote:
> > > > > > > I believe the breakage of Nexmark on Dataflow is 
> > > > > > > https://issues.apache.org/jira/browse/BEAM-7002, which
> > > > > > > went in before the release was cut. It looks like this might be a 
> > > > > > > release blocker based on the fix: 
> > > > > > > https://github.com/apache/beam/pull/8325.
> > > > > > > 
> > > > > > > The degraded performance is after the release is cut, so we 
> > > > > > > should be good there.
> > > > > > > 
> > > > > > > Andrew
> > > > > > > On Tue, Apr 23, 2019 at 8:44 AM Ismaël Mejía  
> > > > > > > wrote:
> > > > > > > > Etienne RC1 vote happened in 04/03 and there have not been any 
> > > > > > > > cherry
> > > > > > > > 
> > > > > > > > picks on the spark runner afterwards so if there is a commit 
> > > > > > > > that
> > > > > > > > 
> > > > > > > > degraded performance around 04/10 it is not part of the release 
> > > > > > > > we are
> > > > > > > > 
> > > > > > > > voting, so please consider reverting your -1.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > However the issue you are reporting looks important, from a 
> > > > > > > > quick look
> > > > > > > > 
> > > > > > > > I am guessing it could be related to BEAM-5775 that was merged 
> > > > > > > > on
> > > > > > > > 
> > > > > > > > 12/04 however the performance regressions started happening 
> > > > > > > > since
> > > > > > > > 
> > > > > > > > 09/04 so it could be unrelated. Maybe it could be due to 
> > > > > > > > ch

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-24 Thread Etienne Chauchot

Hi, I agree that checking Nexmark  should be a mandatory task of the release 
process, I think it is already mentioned in
the spreadsheet. Indeed it detects both functional and performances regressions 
and on all the beam model scope.The only
lacking things in Nexmark are with 2 runners:- gearpump is not integrated 
because streaming pipelines (there is no batch
mode) never end (timeout in awaitTermination is not respected)- spark runner in 
streaming mode is not integrated for the
same reason

Etienne

Le mardi 23 avril 2019 à 15:02 -0700, Andrew Pilloud a écrit :
> It looks like Java Nexmark tests are on the validation sheet but we've missed 
> it the last few releases. Thanks for
> checking it Etienne! Does the current release process require everything to 
> be tested before making the release final?
> I fully agree with you on point 2. All of these issues were in RC1 and could 
> have been fixed for RC2.
> 
> Andrew
> On Tue, Apr 23, 2019 at 2:58 PM Ahmet Altay  wrote:
> > Thank you Andrew. I will suggest two improvements to the release process:1. 
> > We can include benchmarks in the
> > validation sheet ("Apache Beam Release Acceptance Criteria"). They are used 
> > part of the validation process and we
> > can ensure that we check those for each release.
> > 2. For RC validation, we can continue to exhaustively validate each RC even 
> > after the first -1 vote. Otherwise we
> > end up with not discovering all issues in a given RC and find them a  
> > successive RC, increasing the number of
> > iterations required.
> > 
> > 
> > On Tue, Apr 23, 2019 at 2:11 PM Andrew Pilloud  wrote:
> > > Please consider the vote for RC4 canceled. I'll quickly follow up with a 
> > > new RC.
> > > Thanks for the complete testing everyone!
> > > Andrew
> > > On Tue, Apr 23, 2019 at 2:06 PM Reuven Lax  wrote:
> > > > -1 
> > > > we need to cherry pick pr/8325 and pr/8385 to fix the above issue
> > > > On Tue, Apr 23, 2019 at 1:48 PM Andrew Pilloud  
> > > > wrote:
> > > > > I believe the breakage of Nexmark on Dataflow is 
> > > > > https://issues.apache.org/jira/browse/BEAM-7002, which went
> > > > > in before the release was cut. It looks like this might be a release 
> > > > > blocker based on the fix: 
> > > > > https://github.com/apache/beam/pull/8325.
> > > > > 
> > > > > The degraded performance is after the release is cut, so we should be 
> > > > > good there.
> > > > > 
> > > > > Andrew
> > > > > On Tue, Apr 23, 2019 at 8:44 AM Ismaël Mejía  
> > > > > wrote:
> > > > > > Etienne RC1 vote happened in 04/03 and there have not been any 
> > > > > > cherry
> > > > > > 
> > > > > > picks on the spark runner afterwards so if there is a commit that
> > > > > > 
> > > > > > degraded performance around 04/10 it is not part of the release we 
> > > > > > are
> > > > > > 
> > > > > > voting, so please consider reverting your -1.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > However the issue you are reporting looks important, from a quick 
> > > > > > look
> > > > > > 
> > > > > > I am guessing it could be related to BEAM-5775 that was merged on
> > > > > > 
> > > > > > 12/04 however the performance regressions started happening since
> > > > > > 
> > > > > > 09/04 so it could be unrelated. Maybe it could be due to changes in
> > > > > > 
> > > > > > our infrastructure. Maybe the change in the workers to be tracked, 
> > > > > > but
> > > > > > 
> > > > > > definitely not a release blocker at least for the Spark runner.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Tue, Apr 23, 2019 at 5:12 PM Etienne Chauchot 
> > > > > >  wrote:
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Hi guys ,
> > > >

Re: CVE audit gradle plugin

2019-04-23 Thread Etienne Chauchot

Hi,should I merge my branch 
https://github.com/echauchot/beam/tree/cve_audit_plugin to master to include 
this tool to
the build system then ?It will not fail the build but add an audit task to it.
EtienneLe vendredi 19 avril 2019 à 10:54 -0700, Lukasz Cwik a écrit :
>  Common Vulnerabilities and Exposures (CVE)
> 
> On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:
> > Ah! What's CVE stand for then?
> > 
> > Re the PR: Sadly, it's more complicated than that, which I'll explain in 
> > the PR. Otherwise it would have been done
> > already. It's not too bad if the time is put in though.
> > On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
> > > Robert, I believe what is being suggested is a tool that integrates into 
> > > CVE reports automatically and tells us if
> > > we have a dependency with a security issue (not just whether there is a 
> > > newer version). Also, there is a sweet
> > > draft PR to add Go modules[1].
> > > 1: https://github.com/apache/beam/pull/8354
> > > On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  wrote:
> > > > If we move to Go Modules, the go.mod file specifies direct dependencies 
> > > > and versions, and the go.sum file
> > > > includes checksums of the full transitive set of dependencies. There's 
> > > > likely going to be a tool for detecting
> > > > if an update is possible, if one doesn't exist in the go tooling 
> > > > already.
> > > > On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
> > > > > This seems worthwhile IMO.
> > > > > Ahmet, Pyup[1] is free for open source projects and has an API that 
> > > > > allows for dependency checking. They can
> > > > > scan Github repos automatically it seems but it may not be compatible 
> > > > > with how Apache permissions with Github
> > > > > work. I'm not sure if there is such a thing for Go.
> > > > > 
> > > > > 1: https://pyup.io/
> > > > > 
> > > > > On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  
> > > > > wrote:
> > > > > > I want to bring this subject back, any chance we can get this 
> > > > > > running
> > > > > > 
> > > > > > in or main repo maybe in a weekly basis like we do for the 
> > > > > > dependency
> > > > > > 
> > > > > > reports. It looks totallly worth.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Thank you, I agree this is very important. Does anyone know a 
> > > > > > > similar tool for python and go?
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> > > > > > >  wrote:
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> Hi guys,
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> I came by this [1] gradle plugin that is a client to the 
> > > > > > >> Sonatype OSS Index CVE database.
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> I have set it up here in a branch [2], though the cache is not 
> > > > > > >> configured and the number of requests is
> > > > > > limited. It can be run with "gradle --info audit"
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> It could be nice to have something like this to track the CVEs 
> > > > > > >> in the libs we use. I know we have been
> > > > > > spammed by libs upgrade automatic requests in the past but CVE are 
> > > > > > more important IMHO.
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> This plugin is in BSD-3-Clause which is compatible with Apache 
> > > > > > >> V2 licence [3]
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> WDYT ?
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> Etienne
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> > > > > > 
> > > > > > >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> > > > > > 
> > > > > > >> [3] https://www.apache.org/legal/resolved.html
> > > > > >

Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-23 Thread Etienne Chauchot

Hi guys ,I will vote -1 (binding) on this RC (although degradation is before 
RC4 cut date). I took a look at Nexmark
graphs for the 3 major runners :- there seem to have functional regressions on 
Dataflow:  
https://apache-beam-testing.appspot.com/explore?dashboard=5647201107705856 . 13 
queries fail in batch mode starting on
04/17- there is a perf degradation (+200%) in spark runner starting on 04/10 
for all the queries: 
https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
Sorry Andrew for the added work
Etienne
Le lundi 22 avril 2019 à 12:21 -0700, Andrew Pilloud a écrit :
> I signed the wheels files and updated the build process to not require giving 
> travis apache credentials. (You should
> probably change your password if you haven't already.)
> Andrew
> On Mon, Apr 22, 2019 at 12:18 PM Ahmet Altay  wrote:
> > +1 (binding)
> > 
> > Verified the python 2 wheel files with quick start examples.
> > On Mon, Apr 22, 2019 at 11:26 AM Ahmet Altay  wrote:
> > > I built the wheel files. They are in the usual place along with other 
> > > python artifacts. I will test them a bit and
> > > update here. Could someone else please try the wheel files as well?
> > > Andrew, could you sign and hash the wheel files? 
> > > On Mon, Apr 22, 2019 at 10:11 AM Ahmet Altay  wrote:
> > > > I verified- signatures and hashes. - python streaming quickstart guide
> > > > 
> > > > I would like to verify the wheel files before voting. Please let us 
> > > > know when they are ready. Also, if you need
> > > > help with building wheel files I can help/build.
> > > > 
> > > > Ahmet
> > > > On Mon, Apr 22, 2019 at 3:33 AM Maximilian Michels  
> > > > wrote:
> > > > > +1 (binding)
> > > > > 
> > > > > 
> > > > > 
> > > > > Found a minor bug while testing, but not a blocker: 
> > > > > 
> > > > > https://jira.apache.org/jira/browse/BEAM-7128
> > > > > 
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Max
> > > > > 
> > > > > 
> > > > > 
> > > > > On 20.04.19 23:02, Pablo Estrada wrote:
> > > > > 
> > > > > > +1
> > > > > 
> > > > > > Ran SQL postcommit, and Dataflow Portability Java validatesrunner 
> > > > > > tests.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > -P.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > On Wed, Apr 17, 2019 at 1:38 AM Jean-Baptiste Onofré 
> > > > > >  > > > > 
> > > > > > > wrote:
> > > > > 
> > > > > > 
> > > > > 
> > > > > > +1 (binding)
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Quickly checked with beam-samples.
> > > > > 
> > > > > > 
> > > > > 
> > > > > > Regards
> > > > > 
> > > > > > JB
> > > > > 
> > > > > > 
> > > > > 
> > > > > > On 16/04/2019 00:50, Andrew Pilloud wrote:
> > > > > 
> > > > > >  > Hi everyone,
> > > > > 
> > > > > >  >
> > > > > 
> > > > > >  > Please review and vote on the release candidate #4 for the 
> > > > > > version
> > > > > 
> > > > > >  > 2.12.0, as follows:
> > > > > 
> > > > > >  >
> > > > > 
> > > > > >  > [ ] +1, Approve the release
> > > > > 
> > > > > >  > [ ] -1, Do not approve the release (please provide specific 
> > > > > > comments)
> > > > > 
> > > > > >  >
> > > > > 
> > > > > >  > The complete staging area is available for your review, which
> > > > > 
> > > > > > includes:
> > > > > 
> > > > > >  > * JIRA release notes [1],
> > > > > 
> > > > > >  > * the official Apache source release to be deployed to
> > > > > 
> > > > > > dist.apache.org 
> > > > > 
> > > > > >  >  [2], which is signed with the key 
> > > > > > with
> > > > > 
> > > > > >  > fingerprint 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [3],
> > > > > 
> > > > > >  > * all artifacts to be deployed to the Maven Central 
> > > > > > Repository [4],
> > > > > 
> > > > > >  > * source code tag "v2.12.0-RC4" [5],
> > > > > 
> > > > > >  > * website pull request listing the release [6], publishing 
> > > > > > the API
> > > > > 
> > > > > >  > reference manual [7], and the blog post [8].
> > > > > 
> > > > > >  > * Java artifacts were built with Gradle/5.2.1 and 
> > > > > > OpenJDK/Oracle JDK
> > > > > 
> > > > > >  > 1.8.0_181.
> > > > > 
> > > > > >  > * Python artifacts are deployed along with the source 
> > > > > > release to the
> > > > > 
> > > > > >  > dist.apache.org  
> > > > > > 
> > > > > 
> > > > > > [2].
> > > > > 
> > > > > >  > * Validation sheet with a tab for 2.12.0 release to help with
> > > > > 
> > > > > > validation
> > > > > 
> > > > > >  > [9].
> > > > > 
> > > > > >  >
> > > > > 
> > > > > >  > The vote will be open for at least 72 hours. It is adopted by
> > > > > 
> > > > > > majority
> > > > > 
> > > > > >  > approval, with at least 3 PMC affirmative votes.
> > > > > 
> > > > > >  >
> > > > > 
> > > > > >  > Thanks,
> > > > > 
> > > > > >  > Andrew
> > > >

Re: [PROPOSAL] commit granularity in master

2019-04-04 Thread Etienne Chauchot

Brain,It is good that you automated commits quality checks, thanks. But it 
don't agree with reducing the commit history
of a PR to only one commit, I was just referring about meaningless commits such 
as fixup, checktyle, spotless ... I
prefer not to squash everything and only squash meaningless commits because:- 
sometimes small related fixes to different
parts (with different jiras) are done in the same PR and they should stay 
separate commits because they deal with
different problems- more important, keeping the commit at a relative small size 
but still isolable is better to track
bugs/regressions (among other things during bisect sessions) that if the commit 
is big.
Etienne

Le vendredi 22 mars 2019 à 09:38 -0700, Brian Hulette a écrit :
> It sounds like maybe we've already reached a consensus that committers just 
> need to be more vigilant about squashing
> fixup commits, and hopefully automate it as much as possible. But I just 
> thought I'd also point out how the arrow
> project handles this as a point of reference, since it's kind of interesting. 
> 
> They've written a merge_arrow_pr.py script [1], which committers run to merge 
> a PR. It enforces that the PR has an
> associated JIRA in the title, squashes the entire PR into a single commit, 
> and closes the associated JIRA with the
> appropriate fix version.
> 
> As a result, the commit granularity is equal to the granularity of PRs, JIRAs 
> are always linked to PRs, and the commit
> history on master is completely linear (they also have to force push master 
> after releases in order to maintain this,
> which is the subject of much consternation and debate).
> 
> The simplicity of 1 PR = 1 commit is a nice way to avoid the manual 
> intervention required to squash fixup commits and
> enforce that every commit has passed CI, but it does have down-sides as 
> Etienne already pointed out.
> 
> Brian
> 
> [1] https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py
> 
> 
> On Fri, Mar 22, 2019 at 7:46 AM Mikhail Gryzykhin 
>  wrote:
> > I agree with keeping history clean.
> > Although, Small commits like address PR comments are useful during review 
> > process. They allow reviewer to see only
> > new changes, not review whole diff again. Best to squash then before/on 
> > merge though.
> > On Fri, Mar 22, 2019, 07:34 Ismaël Mejía  wrote:
> > > > I like the extra delimitation the brackets give, worth the two extra
> > > 
> > > > characters to me. More importantly, it's nice to have consistency, and
> > > 
> > > > the only way to be consistent with the past is leave them there.
> > > 
> > > 
> > > 
> > > My point with the brackets is that we are 'getting close' to 10K issue
> > > 
> > > so we will then have 3 chars less, probably it does not change much
> > > 
> > > but still.
> > > 
> > > 
> > > 
> > > On Fri, Mar 22, 2019 at 3:19 PM Robert Bradshaw  
> > > wrote:
> > > 
> > > >
> > > 
> > > > On Fri, Mar 22, 2019 at 3:02 PM Ismaël Mejía  wrote:
> > > 
> > > > >
> > > 
> > > > > It is good to remind committers of their responsability on the
> > > 
> > > > > 'cleanliness' of the merged code. Github sadly does not have an easy
> > > 
> > > > > interface to do this and this should be done manually in many cases,
> > > 
> > > > > sadly I have seen many committers just merging code with multiple
> > > 
> > > > > 'fixup' style commits by clicking Github's merge button. Maybe it is
> > > 
> > > > > time to find a way to automatically detect these cases and disallow
> > > 
> > > > > the merge or maybe we should reconsider the policy altogether if they
> > > 
> > > > > are people who don't see the value of this.
> > > 
> > > >
> > > 
> > > > I agree about keeping our history clean and useful, and think those
> > > 
> > > > four points summarize things well (but a clarification on fixup
> > > 
> > > > commits would be good).
> > > 
> > > >
> > > 
> > > > +1 to an automated check that there are many extraneous commits.
> > > 
> > > > Anything the person hitting the merge button would easily see before
> > > 
> > > > doing the merge.
> > > 
> > > >
> > > 
> > > > > I would like to propose a small modification to th

Re: [PROPOSAL] commit granularity in master

2019-03-22 Thread Etienne Chauchot

Thanks Alexey to point this out. I did not know about these 4 points in the 
guide. I agree with them also. I would just
add "Avoid keeping in history formatting messages such as checktyle or spotless 
fixes"If it is ok, I'll submit a PR to
add this point.Le vendredi 22 mars 2019 à 11:33 +0100, Alexey Romanenko a écrit 
:
> Etienne, thanks for bringing this topic.
> 
> I think it was already discussed several times before and we have finally 
> came to what we have in the current
> Committer guide “Granularity of changes" [1].
> Personally, I completely agree with these 4 rules presented there. The main 
> concern is that all committers should
> follow them as well, otherwise we still have sometimes a bunch of small 
> commits with inexpressive messages (I believe
> they were added during review process and were not squashed before merging).
> 
> In my opinion, the most important rule is that every commit should be atomic 
> in terms of added/fixed functionality and
> rolling it back should not break master branch.
> 
> [1] 
> https://beam.apache.org/contribute/committer-guide/#pull-request-review-objectives
> 
> 
> > On 22 Mar 2019, at 10:16, Etienne Chauchot  wrote:
> > 
> > Hi all,
> > It has already been discussed partially but I would like that we agree on 
> > the commit granularity that we want in our
> > history.
> > Some features were squashed to only one commit which seems a bit too 
> > granular to me for a big feature.
> > On the contrary I see PRs with very small commits such as "apply spotless" 
> > or "fix checkstyle".
> > 
> > IMHO I think a good commit size is an isolable portion of a feature such as 
> > for ex "implement Read part of Kudu IO"
> > or "reduce concurrency in Test A". Such a granularity allows to isolate 
> > problems easily (git bisect for ex) and
> > rollback only a part if necessary. 
> > WDYT about:
> > - squashing non meaningful commits such as "apply review comments" (and 
> > rather state what they do and group them if
> > needed), or "apply spotless" or "fix checkstyle"
> > - trying to stick to a commit size as described above
> > 
> > => and of course update the contribution guide at the end
> > ?
> > 
> > Best
> > Etienne

[PROPOSAL] commit granularity in master

2019-03-22 Thread Etienne Chauchot

Hi all,
It has already been discussed partially but I would like that we agree on the 
commit granularity that we want in our
history.
Some features were squashed to only one commit which seems a bit too granular 
to me for a big feature.
On the contrary I see PRs with very small commits such as "apply spotless" or 
"fix checkstyle".

IMHO I think a good commit size is an isolable portion of a feature such as for 
ex "implement Read part of Kudu IO" or
"reduce concurrency in Test A". Such a granularity allows to isolate problems 
easily (git bisect for ex) and rollback
only a part if necessary. 
WDYT about:
- squashing non meaningful commits such as "apply review comments" (and rather 
state what they do and group them if
needed), or "apply spotless" or "fix checkstyle"
- trying to stick to a commit size as described above

=> and of course update the contribution guide at the end
?

Best
Etienne

[spark runner dataset POC] workCount works !

2019-03-21 Thread Etienne Chauchot

Hi guys,

We are glad to announce that the spark runner POC that was re-written from 
scratch using the structured-streaming
framework and the dataset API can now run WordCount !

It is still embryonic. For now it only runs in batch mode and there is no fancy 
stuff like state, timer, SDF, metrics, 
... but it is still a major step forward ! 

Streaming support work has just started.

You can find the branch here:  
https://github.com/apache/beam/tree/spark-runner_structured-streaming

Enjoy,

Etienne

Re: [PROPOSAL] Preparing for Beam 2.12.0 release

2019-03-18 Thread Etienne Chauchot

Sounds great, thanks for volunteering to do the release.
Etienne
Le mercredi 13 mars 2019 à 12:08 -0700, Andrew Pilloud a écrit :
> Hello Beam community!
> Beam 2.12 release branch cut date is March 27th according to the release 
> calendar [1]. I would like to volunteer
> myself to do this release. I intend to cut the branch as planned on March 
> 27th and cherrypick fixes if needed.
> 
> If you have releasing blocking issues for 2.12 please mark their "Fix 
> Version" as 2.12.0. Kenn created a 2.13.0
> release in JIRA in case you would like to move any non-blocking issues to 
> that version.
> 
> Does this sound reasonable?
> 
> Andrew
> 
> [1] 
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles

Re: JIRA hygiene

2019-03-18 Thread Etienne Chauchot

Well, I agree but a contributor might not have the rights on jira and more 
important he might be unable to chose a
target version for the jira. Targeting the ticket to the correct version 
requires to know the release cut date which is
not the date of the commit to which the release tag points in case of cherry 
picks. It seems a bit complicated for a
one-time contributor. This is why I proposed that the committer/reviewer does 
the jira closing.
Etienne

Le mercredi 13 mars 2019 à 17:08 -0700, Ahmet Altay a écrit :
> I agree with defining the workflow for closing JIRAs. Would not contributor 
> be in a better position to close JIRAs or
> keep it open? It would make sense for the committer to ask about this but I 
> think contributor (presumably the person
> who is the assignee of the JIRA) could be the responsible party for updating 
> their JIRAs. On the other hand, I
> understand the argument that committer could do this at the time of merging 
> and fill a gap in the process.
> On Wed, Mar 13, 2019 at 4:59 PM Michael Luckey  wrote:
> > Hi,
> > 
> > definitely +1 to properly establish a workflow to maintain jira status. 
> > Naively I d think, the reporter should close
> > as she is the one to confirm whether the reported issue is fixed or not. 
> > But for obvious reasons that will not work
> > here, so - although it puts another burden on committers, you are probably 
> > right that the committer is the best
> > choice to ensure that the ticket gets promoted. Whether it will be resolved 
> > or clarified what's still to be done.
> > 
> > Looking into the current state, we seem to have tons of issues whith merged 
> > PRs, which for anyone trying to find an
> > existing jira issue to work on makes it unnecessary difficult to decide 
> > whether to look into that or  not. From my
> > personal experience, it is somehow frustrating going through open issues, 
> > selecting one and after investing some (or
> > even more) time to first understand a problem and then the PR to realise 
> > nothing has to be done anymore. Or not
> > knowing what's left out and for what reason. But of course, this is another 
> > issue which we definitely need to invest
> > time into - kenn already asked for our support here.
> > 
> > thx,
> > 
> > michel
> > On Tue, Mar 12, 2019 at 11:30 AM Etienne Chauchot  
> > wrote:
> > > Hi Thomas,
> > > I agree, the committer that merges a PR should close the ticket. And, if 
> > > needed, he could discuss with the author
> > > (inside the PR) to assess if the PR covers the ticket scope.
> > > This is the rule I apply to myself when I merge a PR (even thought it has 
> > > happened that I forgot to close one or
> > > two tickets :) ) .
> > > Etienne
> > > 
> > > Le lundi 11 mars 2019 à 14:17 -0700, Thomas Weise a écrit :
> > > > JIRA probably deserves a separate discussion. It is messy.. We also 
> > > > have examples of tickets being referenced by
> > > > users that were not closed, although the feature long implemented or 
> > > > issue fixed.
> > > > 
> > > > There is no clear ownership in our workflow.
> > > > 
> > > > A while ago I proposed in another context to make resolving JIRA part 
> > > > of committer duty. I would like to bring
> > > > this up for discussion again:
> > > > 
> > > > https://github.com/apache/beam/pull/7129#discussion_r236405202
> > > > 
> > > > Thomas
> > > > 
> > > > 
> > > > On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:
> > > > > I agree this is a good idea. I used the same technique for 2.11 blog 
> > > > > post (JIRA release notes -> editorialized
> > > > > list + diffed the dependencies).
> > > > > 
> > > > > On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  
> > > > > wrote:
> > > > > > That is a good idea. The blog post is probably the main avenue 
> > > > > > where folks will find out about new features
> > > > > > or big fixes.
> > > > > > When I did 2.10.0 I just used the automated Jira release notes and 
> > > > > > pulled out significant things based on my
> > > > > > judgment. I would also suggest that our Jira hygiene could be 
> > > > > > significantly improved to make this process
> > > > > > more effective.
> > > > > > 
> > > > > 
> > > > > +1 to improving JIRA

Re: New Contributor

2019-03-13 Thread Etienne Chauchot

Welcome ! 
Etienne
Le mardi 05 mars 2019 à 16:28 -0800, Daniel Oliveira a écrit :
> Welcome to Beam Boris!
> 
> On Tue, Mar 5, 2019 at 2:03 PM Mikhail Gryzykhin  wrote:
> > Welcome to the community!
> > --Mikhail
> > Have feedback? 
> > 
> > 
> > On Tue, Mar 5, 2019 at 1:53 PM Ruoyun Huang  wrote:
> > > Welcome Boris! 
> > > 
> > > On Tue, Mar 5, 2019 at 1:34 PM Ahmet Altay  wrote:
> > > > Welcome Boris!
> > > > 
> > > > On Mon, Mar 4, 2019 at 5:40 PM Ismaël Mejía  wrote:
> > > > > Done, welcome!
> > > > > 
> > > > > 
> > > > > 
> > > > > On Tue, Mar 5, 2019 at 1:25 AM Boris Shkolnik  
> > > > > wrote:
> > > > > 
> > > > > >
> > > > > 
> > > > > >
> > > > > 
> > > > > > Hi,
> > > > > 
> > > > > >
> > > > > 
> > > > > > My name is Boris Shkolnik. I am a committer in Hadoop and Samza 
> > > > > > Apache projects.
> > > > > 
> > > > > > I would like to contribute to beam.
> > > > > 
> > > > > > Could you please add me to the beam project.
> > > > > 
> > > > > >
> > > > > 
> > > > > > My user name is boryas @apache.org
> > > > > 
> > > > > >
> > > > > 
> > > > > > Thanks,
> > > > > 
> > > > > > -Boris.
> > > > > 
> > > 
> > >

Re: New contributor to Beam

2019-03-13 Thread Etienne Chauchot

Welcome !
Etienne
Le lundi 11 mars 2019 à 14:15 -0700, Kenneth Knowles a écrit :
> Welcome!
> 
> On Mon, Mar 11, 2019 at 12:22 PM Melissa Pashniak  
> wrote:
> > Welcome!
> > 
> > 
> > 
> > On Mon, Mar 11, 2019 at 12:16 PM Suneel Marthi  
> > wrote:
> > > Welcome Aizhamal
> > > 
> > > Sent from my iPhone
> > > On Mar 11, 2019, at 2:08 PM, Rose Nguyen  wrote:
> > > 
> > > > Welcome, Aizhamal!
> > > > 
> > > > On Mon, Mar 11, 2019 at 10:55 AM Ahmet Altay  wrote:
> > > > > Welcome!
> > > > > On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:
> > > > > > Done, welcome !
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
> > > > > > 
> > > > > >  wrote:
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Hello everyone!
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > My name is Aizhamal and I would like to start contributing to 
> > > > > > > Beam. Can anyone add me as a contributor for
> > > > > > Beam's Jira issue tracker? I would like to create and assign 
> > > > > > tickets.
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > My jira username is aizhamal.
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Thanks and excited to be part of this community!
> > > > > > 
> > > > > > > Aizhamal
> > > > > > 
> > > > 
> > > >

Re: hi from DevRel land

2019-03-13 Thread Etienne Chauchot

Welcome onboard Reza
Etienne
Le mardi 12 mars 2019 à 09:10 -0700, Rose Nguyen a écrit :
> Welcome, Reza! Really excited to have you!
> 
> On Tue, Mar 12, 2019 at 9:00 AM Reza Ardeshir Rokni  wrote:
> > Thanx folks! 
> > 
> > Oppsy on the link, here it is again:
> > 
https://stackoverflow.com/questions/54422510/how-to-solve-duplicate-values-exception-when-i-create-pcollectionviewmapstring/54623618#54623618
> > 
> > On Tue, 12 Mar 2019 at 23:32, Teja MVSR  wrote:
> > > Hi Reza,
> > > I am also interested to contribute towards documentation. Please let me 
> > > know if I can be of any help.
> > > 
> > > Thanks and Regards,
> > > Teja.
> > > On Tue, Mar 12, 2019, 11:30 AM Kenneth Knowles  wrote:
> > > > This is great news.
> > > > 
> > > > For the benefit of the list, I want to say how nice it has been when I 
> > > > have had a chance to work with you. I've
> > > > learned a great deal real and complex use cases through those 
> > > > opportunities. I'm really excited that you'll be
> > > > helping out Beam in this new role.
> > > > 
> > > > Kenn
> > > > On Tue, Mar 12, 2019 at 7:21 AM Valentyn Tymofieiev 
> > > >  wrote:
> > > > > Hi Reza!
> > > > > Welcome to Beam. Very nice to have you onboard. Btw, the link seems 
> > > > > broken.
> > > > > 
> > > > > Thanks,
> > > > > Valentyn
> > > > > On Tue, Mar 12, 2019 at 6:04 AM Reza Ardeshir Rokni 
> > > > >  wrote:
> > > > > > Hi Folks,
> > > > > > Just wanted to say hi to the good folks in the Beam community in my 
> > > > > > new capacity as a Developer advocate for
> > > > > > Beam/Dataflow @ Google. :-)
> > > > > > 
> > > > > > At the moment I am working on a couple of blogs around the Timer 
> > > > > > and State API as well as some work on
> > > > > > general patterns that I hope to contribute as documentation to the 
> > > > > > Beam site. An example of the patterns can
> > > > > > be seen here:  LINK
> > > > > > 
> > > > > > Hope to be adding many more in 2019 and really looking forward to 
> > > > > > being able to contribute to Beam in anyway
> > > > > > that I can!
> > > > > > 
> > > > > > Cheers
> > > > > > Reza
> > > > > > 
> > > > > > 
> 
>

Re: Apache Beam Newsletter - February/March 2019

2019-03-13 Thread Etienne Chauchot

tion, rather than editing
> > > > > > > > > > > > the past 'published' newsletter.  Put another way, save 
> > > > > > > > > > > > editing the past for corrections (typos,
> > > > > > > > > > > > things being incorrect).  Else, I imagine that I'm 
> > > > > > > > > > > > unlikely to catch a great announcement that
> > > > > > > > > > > > warranted being in the newsletter in the first place.  
> > > > > > > > > > > > This certainly works better with a
> > > > > > > > > > > > regular/frequent release cadence, like we arrived at 
> > > > > > > > > > > > for version releases (then, if something
> > > > > > > > > > > > misses one cut, it is not too big a deal, as the next 
> > > > > > > > > > > > release is coming soon).  
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > On Wed, Mar 6, 2019 at 12:50 PM Melissa Pashniak 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > For step #2 (publishing onto the website), I think it 
> > > > > > > > > > > > > would be good to stay consistent with
> > > > > > > > > > > > > our existing workflows if possible. Rather than using 
> > > > > > > > > > > > > an external tool, what about: 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > After a google doc newsletter draft is ready, convert 
> > > > > > > > > > > > > it into a standard markdown file and put
> > > > > > > > > > > > > it into our GitHub repo, perhaps in a new newsletter 
> > > > > > > > > > > > > directory in the website community
> > > > > > > > > > > > > directory [1]. These would be listed for browsing on 
> > > > > > > > > > > > > a Newsletters page as mentioned in step
> > > > > > > > > > > > > #4. People can then just open a PR to add missing 
> > > > > > > > > > > > > things to the pages later, and the
> > > > > > > > > > > > > newsletter will be automatically updated on the 
> > > > > > > > > > > > > website through our standard website workflow.
> > > > > > > > > > > > > It also avoids the potential issue of the source 
> > > > > > > > > > > > > google docs disappearing in the future, as
> > > > > > > > > > > > > they are stored in a community location.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [1] 
> > > > > > > > > > > > > https://github.com/apache/beam/tree/master/website/src/community
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Wed, Mar 6, 2019 at 10:36 AM Rose Nguyen 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > I think that would be a great idea to change 
> > > > > > > > > > > > > > formats to help with distribution. I'm open to
> > > > > > > > > > > > > > suggestions! I'm currently using a Google doc to 
> > > > > > > > > > > > > > collect and edit, then copy/paste sending
> > > > > > > > > > > > > > the newsletter out directly, based on an 
> > > > > > > > > > > > > > interpretation of this discussion.
> > > > > > > > > > > > > > How about this doc->website->Beam site workflow?:
> > > > > > > > > > > > > > The same usual newsletter [CALL FOR ITEMS] where

Re: JIRA hygiene

2019-03-12 Thread Etienne Chauchot

Hi Thomas,
I agree, the committer that merges a PR should close the ticket. And, if 
needed, he could discuss with the author
(inside the PR) to assess if the PR covers the ticket scope.
This is the rule I apply to myself when I merge a PR (even thought it has 
happened that I forgot to close one or two
tickets :) ) .
Etienne

Le lundi 11 mars 2019 à 14:17 -0700, Thomas Weise a écrit :
> JIRA probably deserves a separate discussion. It is messy.. We also have 
> examples of tickets being referenced by users
> that were not closed, although the feature long implemented or issue fixed.
> 
> There is no clear ownership in our workflow.
> 
> A while ago I proposed in another context to make resolving JIRA part of 
> committer duty. I would like to bring this up
> for discussion again:
> 
> https://github.com/apache/beam/pull/7129#discussion_r236405202
> 
> Thomas
> 
> 
> On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:
> > I agree this is a good idea. I used the same technique for 2.11 blog post 
> > (JIRA release notes -> editorialized list
> > + diffed the dependencies).
> > 
> > On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  wrote:
> > > That is a good idea. The blog post is probably the main avenue where 
> > > folks will find out about new features or big
> > > fixes.
> > > When I did 2.10.0 I just used the automated Jira release notes and pulled 
> > > out significant things based on my
> > > judgment. I would also suggest that our Jira hygiene could be 
> > > significantly improved to make this process more
> > > effective.
> > > 
> > 
> > +1 to improving JIRA notes as well. Often times issues are closed with no 
> > real comments on what happened, has it
> > been resolved or not. It becomes an exercise on reading the linked PRs to 
> > figure out what happened.
> >  
> > > Kenn
> > > On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  wrote:
> > > > Ahmet, thanks managing the release!
> > > > I have a suggestion (not specific to only this release): 
> > > > 
> > > > The release blogs could be more useful to users. In this case, we have 
> > > > a long list of dependency updates on the
> > > > top, but probably the improvements and features section should come 
> > > > first. I was also very surprised to find
> > > > "Portable Flink runner support for running cross-language transforms." 
> > > > mentioned, since that is only being
> > > > worked on now. On the other hand, there are probably items that we miss.
> > > > 
> > > > Since this can only be addressed by more eyes, I suggest that going 
> > > > forward the blog pull request is included
> > > > and reviewed as part of the release vote.
> > > > 
> > > > Also, we should make announcing the release on Twitter part of the 
> > > > process.
> > 
> > This is actually part of the release process 
> > (https://beam.apache.org/contribute/release-guide/#social-media). I
> > missed it for 2.11. I will send an announcement on Twitter shortly. 
> > > > Thanks,
> > > > Thomas
> > > > 
> > > > 
> > > > On Mon, Mar 11, 2019 at 10:46 AM Ahmet Altay  wrote:
> > > > > I updated the JIRAs for these two PRs to set the fix version 
> > > > > correctly as 2.12.0. That should fix the release
> > > > > notes issue.
> > > > > 
> > > > > On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:
> > > > > > Hi Etienne,
> > > > > > 
> > > > > > I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM 
> > > > > > PST -- github timestamp). Release tag, as
> > > > > > you pointed out, points to a commit on Feb 25, 2019 11:48 PM PST. 
> > > > > > And that is a commit on the release
> > > > > > branch. 
> > > > > > 
> > > > > > After cutting the release branch, I only merged cherry picks from 
> > > > > > master to the release branch if a JIRA was
> > > > > > tagged as a release blocker and there was a PR to fix that specific 
> > > > > > issue. In case of these two PRs, they
> > > > > > were merged at Feb 20 and Feb 18 respectively. They were not 
> > > > > > included in the branch cut and I did not cherry
> > > > > > pick them either. I apologize if I missed a request to cherry pick 
> > > > > > these PRs.
>

Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-12 Thread Etienne Chauchot

Ahmet, Yes  in a comment in this ticket 
https://issues.apache.org/jira/browse/BEAM-6292  I supposed there were cherry
picks.Thanks for the confirmation !
No problem about not having cherry picked these PRs they were not release 
blockers. It is just that these features were
announced for 2.11 in the release notes. My bad, I did not know the release cut 
date of 02/14 I should have targeted
them to 2.12 then.
Thanks
Etienne
Le lundi 11 mars 2019 à 10:44 -0700, Ahmet Altay a écrit :
> Hi Etienne,
> 
> I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST -- 
> github timestamp). Release tag, as you
> pointed out, points to a commit on Feb 25, 2019 11:48 PM PST. And that is a 
> commit on the release branch. 
> 
> After cutting the release branch, I only merged cherry picks from master to 
> the release branch if a JIRA was tagged as
> a release blocker and there was a PR to fix that specific issue. In case of 
> these two PRs, they were merged at Feb 20
> and Feb 18 respectively. They were not included in the branch cut and I did 
> not cherry pick them either. I apologize
> if I missed a request to cherry pick these PRs.
> 
> Does this answer your question?
> 
> Ahmet
> 
> [1] 
> https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0
> On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot  wrote:
> > @Ahmet sorry I did not have time to check 2.11 release but a fellow Beam 
> > contributor drew my attention on something:
> > the 2.11 release tag points on commit of 02/26 and this[1] PR was merged 
> > 02/20 and that [2] PR was merged on 02/18.
> > So, both commits should be in the released code but they are not. 
> > [1] https://github.com/apache/beam/pull/7348[2] 
> > https://github.com/apache/beam/pull/7751
> > So at least for those 2 features the release notes do not comply to content 
> > of the release. Is there a real problem
> > or did I miss something ?
> > Etienne
> > Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
> > > Thank you for the additional votes and validations.
> > > Update: Binaries are pushed. Website updates are blocked on an issue that 
> > > is preventing beam-site changes to be
> > > synced the beam website. (INFRA-17953). I am waiting for that to be 
> > > resolved before sending an announcement.
> > > On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw  
> > > wrote:
> > > > I see the vote has passed, but +1 (binding) from me as well.
> > > > 
> > > > 
> > > > 
> > > > On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
> > > >  wrote:
> > > > 
> > > > >
> > > > 
> > > > > +1 (binding)
> > > > 
> > > > >
> > > > 
> > > > > Tested with beam-samples.
> > > > 
> > > > >
> > > > 
> > > > > Regards
> > > > 
> > > > > JB
> > > > 
> > > > >
> > > > 
> > > > > On 26/02/2019 10:40, Ahmet Altay wrote:
> > > > 
> > > > > > Hi everyone,
> > > > 
> > > > > >
> > > > 
> > > > > > Please review and vote on the release candidate #2 for the version
> > > > 
> > > > > > 2.11.0, as follows:
> > > > 
> > > > > >
> > > > 
> > > > > > [ ] +1, Approve the release
> > > > 
> > > > > > [ ] -1, Do not approve the release (please provide specific 
> > > > > > comments)
> > > > 
> > > > > >
> > > > 
> > > > > > The complete staging area is available for your review, which 
> > > > > > includes:
> > > > 
> > > > > > * JIRA release notes [1],
> > > > 
> > > > > > * the official Apache source release to be deployed to 
> > > > > > dist.apache.org
> > > > 
> > > > > > <http://dist.apache.org> [2], which is signed with the key with
> > > > 
> > > > > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
> > > > 
> > > > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > 
> > > > > > * source code tag "v2.11.0-RC2" [5],
> > > > 
> > > > > > * website pull request listing the release [6] and publishing the 
> > > > > > API
> > > > 
> > > > > &

Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-11 Thread Etienne Chauchot

Congrats ! Well deserved
Etienne
Le lundi 11 mars 2019 à 13:22 +0100, Alexey Romanenko a écrit :
> My congratulations, Raghu!
> 
> > On 8 Mar 2019, at 10:39, Łukasz Gajowy  wrote:
> > 
> > Congratulations! :)
> > pt., 8 mar 2019 o 10:16 Gleb Kanterov  napisał(a):
> > > Congratulations!
> > > On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey  
> > > wrote:
> > > > Congrats Raghu!
> > > > On Thu, Mar 7, 2019 at 8:06 PM Mark Liu  wrote:
> > > > > Congrats!
> > > > > On Thu, Mar 7, 2019 at 10:45 AM Rui Wang  wrote:
> > > > > > Congrats Raghu!
> > > > > > 
> > > > > > -Rui
> > > > > > On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise  
> > > > > > wrote:
> > > > > > > Congrats!
> > > > > > > 
> > > > > > > On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson 
> > > > > > >  wrote:
> > > > > > > > Congrats Raghu
> > > > > > > > 
> > > > > > > > On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  
> > > > > > > > wrote:
> > > > > > > > > Congratulations!
> > > > > > > > > 
> > > > > > > > > On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang 
> > > > > > > > >  wrote:
> > > > > > > > > > Thank you Raghu for your contribution! 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan 
> > > > > > > > > >  wrote:
> > > > > > > > > > > Congratulation Raghu!!! Thank you for sharing Kenn!!! 
> > > > > > > > > > > On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > Congrats !
> > > > > > > > > > > > Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy 
> > > > > > > > > > > >  a écrit :
> > > > > > > > > > > > > Congratulations, Raghu!!!
> > > > > > > > > > > > > On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > Please join me and the rest of the Beam PMC in 
> > > > > > > > > > > > > > welcoming a new committer: Raghu Angadi
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Raghu has been contributing to Beam since early 
> > > > > > > > > > > > > > 2016! He has continuously improved KafkaIO
> > > > > > > > > > > > > > and supported on the user@ list but his community 
> > > > > > > > > > > > > > contributions are even more extensive,
> > > > > > > > > > > > > > including reviews, dev@ list discussions, 
> > > > > > > > > > > > > > improvements and ideas across SqsIO, FileIO,
> > > > > > > > > > > > > > PubsubIO, and the Dataflow and Samza runners. In 
> > > > > > > > > > > > > > consideration of Raghu's contributions, the
> > > > > > > > > > > > > > Beam PMC trusts Raghu with the responsibilities of 
> > > > > > > > > > > > > > a Beam committer [1].
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thank you, Raghu, for your contributions.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Kenn
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > [1] 
> > > > > > > > > > > > > > https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> > > > > > > > > > 
> > > > > > > > > > -- 
> > > > > > > > > > Ruoyun  Huang
> > > > > > > > > > 
> > > > > > > > > > 
> > > 
> > > -- 
> > > Cheers,Gleb

Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Etienne Chauchot

@Ahmet sorry I did not have time to check 2.11 release but a fellow Beam 
contributor drew my attention on something:
the 2.11 release tag points on commit of 02/26 and this[1] PR was merged 02/20 
and that [2] PR was merged on 02/18. So,
both commits should be in the released code but they are not. 
[1] https://github.com/apache/beam/pull/7348[2] 
https://github.com/apache/beam/pull/7751
So at least for those 2 features the release notes do not comply to content of 
the release. Is there a real problem or
did I miss something ?
Etienne
Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
> Thank you for the additional votes and validations.
> Update: Binaries are pushed. Website updates are blocked on an issue that is 
> preventing beam-site changes to be synced
> the beam website. (INFRA-17953). I am waiting for that to be resolved before 
> sending an announcement.
> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw  wrote:
> > I see the vote has passed, but +1 (binding) from me as well.
> > 
> > 
> > 
> > On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré  
> > wrote:
> > 
> > >
> > 
> > > +1 (binding)
> > 
> > >
> > 
> > > Tested with beam-samples.
> > 
> > >
> > 
> > > Regards
> > 
> > > JB
> > 
> > >
> > 
> > > On 26/02/2019 10:40, Ahmet Altay wrote:
> > 
> > > > Hi everyone,
> > 
> > > >
> > 
> > > > Please review and vote on the release candidate #2 for the version
> > 
> > > > 2.11.0, as follows:
> > 
> > > >
> > 
> > > > [ ] +1, Approve the release
> > 
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > 
> > > >
> > 
> > > > The complete staging area is available for your review, which includes:
> > 
> > > > * JIRA release notes [1],
> > 
> > > > * the official Apache source release to be deployed to dist.apache.org
> > 
> > > >  [2], which is signed with the key with
> > 
> > > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
> > 
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > 
> > > > * source code tag "v2.11.0-RC2" [5],
> > 
> > > > * website pull request listing the release [6] and publishing the API
> > 
> > > > reference manual [7].
> > 
> > > > * Python artifacts are deployed along with the source release to the
> > 
> > > > dist.apache.org  [2].
> > 
> > > > * Validation sheet with a tab for 2.11.0 release to help with validation
> > 
> > > > [8].
> > 
> > > >
> > 
> > > > The vote will be open for at least 72 hours. It is adopted by majority
> > 
> > > > approval, with at least 3 PMC affirmative votes.
> > 
> > > >
> > 
> > > > Thanks,
> > 
> > > > Ahmet
> > 
> > > >
> > 
> > > > [1]
> > 
> > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344775
> > 
> > > > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
> > 
> > > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > 
> > > > [4] 
> > > > https://repository.apache.org/content/repositories/orgapachebeam-1064/
> > 
> > > > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
> > 
> > > > [6] https://github.com/apache/beam/pull/7924
> > 
> > > > [7] https://github.com/apache/beam-site/pull/587
> > 
> > > > [8]
> > 
> > > > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
> > 
> > > >
> > 
> > >
> > 
> > > --
> > 
> > > Jean-Baptiste Onofré
> > 
> > > jbono...@apache.org
> > 
> > > http://blog.nanthrax.net
> > 
> > > Talend - http://www.talend.com
> >

Re: [BEAM-6759] CassandraIOTest failing in presubmit in multiple PRs

2019-03-06 Thread Etienne Chauchot

Hi guys, As I introduced embedded Cassandra backend in the IO tests, I'll fix 
this issue. It is very common (as
discussed) that embedded backends cause flakiness. But it is the price to pay 
for more relevant tests :)
Etienne
Le lundi 04 mars 2019 à 11:28 +0100, Maximilian Michels a écrit :
> Hey Alex,
> This is a duplicate: https://issues.apache.org/jira/browse/BEAM-6722
> There is a pending PR, though we haven't agreed on merging it. Would be nice 
> to fix this.
> Cheers,Max
> On 01.03.19 20:09, Alex Amato wrote:
> https://issues.apache.org/jira/browse/BEAM-6759
> 
> Hi, I have seen this test failing in presubmit in multiple PRs, which does 
> seem to be related to the changes. Any
> ideas why this is failing at the moment?
> CassandraIOTest - scans
> 
https://builds.apache.org/job/beam_PreCommit_Java_Commit/4586/testReport/junit/org.apache.beam.sdk.io.cassandra/CassandraIOTest/classMethod/
> https://scans.gradle.com/s/btppkeky63a5g/console-log?task=:beam-sdks-java-io-cassandra:test#L7
> java.lang.NullPointerException at
> org.cassandraunit.utils.EmbeddedCassandraServerHelper.dropKeyspacesWithNativeDriver(EmbeddedCassandraServerHelper.java
> :285) at 
> org.cassandraunit.utils.EmbeddedCassandraServerHelper.dropKeyspaces(EmbeddedCassandraServerHelper.java:281)
> at
> org.cassandraunit.utils.EmbeddedCassandraServerHelper.cleanEmbeddedCassandra(EmbeddedCassandraServerHelper.java:193)
> at 
> org.apache.beam.sdk.io.cassandra.CassandraIOTest.stopCassandra(CassandraIOTest.java:129)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at
> java.lang.reflect.Method.invoke(Method.java:498) at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at
> org.junit.internal.runners.statements.RunAfters.invokeMethod(RunAfters.java:46)
>  at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) at
> org.junit.runners.ParentRunner.run(ParentRunner.java:396) at
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>  at
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>  at
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>  at
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassPro
> cessor.java:62) at
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at
> java.lang.reflect.Method.invoke(Method.java:498) at
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>  at
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>  at
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>  at
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
>  at
> com.sun.proxy.$Proxy2.processTestClass(Unknown Source) at
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at
> java.lang.reflect.Method.invoke(Method.java:498) at
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>  at
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>  at
> org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObje
> ctConnection.java:175) at
> org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObje
> ctConnection.java:157) at 
> org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:404)
>  at
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:63)
>  at
> org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:46)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at
> java.util.concurr

Re: Apache Beam Newsletter - February/March 2019

2019-03-06 Thread Etienne Chauchot

Hi,I would add in what's been done: 
Work on cassandraIO (Etienne Chauchot, Mathieu Blanchard, Frank Shahar) : 
refactorings, bugfixes, new where clause,
security fix
Etienne
Le lundi 04 mars 2019 à 18:36 +0100, Suneel Marthi a écrit :
> Is this the final draft? - we had 2 beam talks at Big Data Tech Warsaw last 
> Wednesday - I can send the updates
> offline.
> On Mon, Mar 4, 2019 at 6:16 PM Rose Nguyen  wrote:
> > February-March 2019 | Newsletter
> > 
> > What’s been done
> > 
> > 
> > Apache Beam 2.10.0 released (by: many contributors)
> > Download the release here.See the blog post for more details.
> > Apache Beam awarded the 2019 Technology of the Year Award!
> > InfoWorld just awarded Beam the 2019 Technology of the Year Award.See this  
> > article for more details.
> > Kettle Beam 0.5 released with support for flink (by: Matt Casters)
> > Kettle now supports Apache Flink as well as Cloud Dataflow and Spark.See 
> > Matt’s Blog for more details.
> > 
> > 
> > What we’re working on...
> > 
> > 
> > Apache Beam 2.11.0 release (by: many contributors)
> > 
> > 
> > Hive Metastore Table provider for SQL (by: Anton Kedin)
> > Support for plugging table providers through Beam SQL API to allow 
> > obtaining table schemas from external sources.See
> > the PR for more details.
> > User Defined Coders for the Beam Go SDK (by: Robert Burke)
> > Working on expanding the variety of user defined types that can be a member 
> > of a PCollection in the Go SDK.See BEAM-
> > 3306 for more details.
> > Python 3 (by: Ahmet Altay, Robert Bradshaw, Charles Chen, Mark Liu, Robbe 
> > Sneyders, Juta Staes, Valentyn Tymofieiev)
> > Beam 2.11.0 is the first release offering partial Python 3 support. Many 
> > thanks to all contributors who helped to
> > reach this milestone.IO availablility on Python 3 is currently limited and 
> > only Python 3.5 version has been tested
> > extensively. Stay tuned on BEAM-1251 for more details.
> > Notebooks for quickstarts and custom I/O (by: David Cavazos)
> > Adding IPython notebooks and snippetsSee [BEAM-6557] for more details.
> > 
> > 
> >   New members
> > 
> > 
> > New PMC member!
> > Etienne Chauchot, Nantes, France
> > New Committers!
> > Gleb Kanterov, Stockholm, SwedenMichael Luckey
> > New Contributors!
> > Kyle Weaver, San Francisco, CAWould like to help begin implementing 
> > portability support for the Spark runnerTanay
> > Tummapalli, Delhi, IndiaWould like to contribute to Open Source this summer 
> > as part of Google Summer of CodeBrian
> > Hulette, Seattle, WAContributing to Beam PortabilityMichał Walenia, Warsaw, 
> > PolandWorking on integration and load
> > testing Daniel Chen, San Francisco, CAWorking on Beam Samza runner
> > 
> >   Talks & meetups
> > 
> > 
> > 
> > Plugin Machine Intelligence and Apache Beam with Pentaho - Feb 7 @ London
> > Watch the How to Run Kettle on Apache Beam video here. See event details 
> > here..
> > Beam @Lyft / Streaming, TensorFlow and use-cases - Feb 7 @ San Francisco, CA
> > Organized by Thomas Weise and Austin Bennet, with speakers Tyler Akidau, 
> > Robert Crowe, Thomas Weise and Amar PaiSee
> > event details here and the slides for these presentation: Overview of 
> > Apache Beam and TensorFlow Transform (TFX)
> > with Apache Beam, Python Streaming Pipelines with Beam on Flink, Dynamic 
> > pricing of Lyft rides using streaming.
> > Flink meetup - Feb 21@ Seattle, WA
> > Speakers from Alibaba, Google, and Uber gave talks about Apache Flink with 
> > Hive, Tensorflow, Beam, and AthenaX.See
> > event details here and presentations here. 
> > Beam Summit Europe 2019 - June 19-20 @ Berlin
> > Beam Summit Europe 2019 will take place in Berlin on June 19-20.Speaker CfP 
> > and other details to follow soon!Twitter
> > announcement!
> > 
> >   Resources
> > 
> > 
> > Apache Jira Beginner’s Guide (by:  Daniel Oliveira)
> > A guide to introduce Beam contributors to the basics of using the Apache 
> > Jira for Beam development. Feedback
> > welcomed!
> > An approach to community building from Apache Beam (by: Kenn Knowles)
> > The Apache Software Foundation has published committer guidelines to help 
> > Beam's community building work.See the
> > post on the ASF blog.
> > Exploring Beam SQL on Google Cloud Platform (by: Graham Polley)
> > “In this article, I’ll dive into this new feature of Beam, and see how it 
> > works by using a pipeline to read a data
> > file from GCS, transform it, and then perform a basic calculation on the 
> > values contained in the file”.See article
> > and full source code.
> > 
> > 
> > Until Next Time!-- 
> > Rose Thị Nguyễn

CVE audit gradle plugin

2019-02-28 Thread Etienne Chauchot

Hi guys,

I came by this [1] gradle plugin that is a client to the Sonatype OSS Index CVE 
database.

I have set it up here in a branch [2], though the cache is not configured and 
the number of requests is limited. It can
be run with "gradle --info audit"

It could be nice to have something like this to track the CVEs in the libs we 
use. I know we have been spammed by libs
upgrade automatic requests in the past but CVE are more important IMHO.

This plugin is in BSD-3-Clause which is compatible with Apache V2 licence [3]

WDYT ?

Etienne

[1] https://github.com/OSSIndex/ossindex-gradle-plugin
[2] https://github.com/echauchot/beam/tree/cve_audit_plugin
[3] https://www.apache.org/legal/resolved.html

Re: Signing off

2019-02-15 Thread Etienne Chauchot

Thank you for your contributions Scott ! Your new project seems very fun. Enjoy 
!
Etienne
Le vendredi 15 février 2019 à 15:01 +0100, Ismaël Mejía a écrit :
> Your work and willingness to make Beam better will be missed.Good luck for 
> the next phase!
> On Fri, Feb 15, 2019 at 1:39 PM Łukasz Gajowy  wrote:
> 
> Good luck!
> pt., 15 lut 2019 o 11:24 Alexey Romanenko  
> napisał(a):
> 
> Good luck, Scott, with your new adventure!
> On 15 Feb 2019, at 11:22, Maximilian Michels  wrote:
> Thank you for your contributions Scott. Best of luck!
> On 15.02.19 10:48, Michael Luckey wrote:
> Hi Scott,yes, thanks for all your time and all the best!michelOn Fri, Feb 15, 
> 2019 at 5:47 AM Kenneth Knowles
> mailto:k...@apache.org>> wrote:   +1   Thanks for the 
> contributions to community & code, and enjoy
> the new   chapter!   Kenn   On Thu, Feb 14, 2019 at 3:25 PM Thomas Weise 
> mailto:t...@apache.org>>
> wrote:   Hi Scott,   Thank you for the many contributions to Beam and 
> best of luck   with the new
> endeavor!   Thomas   On Thu, Feb 14, 2019 at 10:37 AM Scott Wegner
> mailto:sc...@apache.org>> wrote:   I wanted 
> to let you all know that I've decided to
> pursue a   new adventure in my career, which will take me away from   
> Apache Beam
> development.   It's been a fun and fulfilling journey. Apache Beam 
> has been   my first significant
> experience working in open source. I'm   inspired observing how the 
> community has come together
> to   deliver something great.   Thanks for everything. If 
> you're curious what's next:
> I'll   be working on Federated Learning at Google:   
> https://ai.googleblog.com/2017/04/federated-learning-collaborative.html   
> Take
> care,   Scott   Got feedback? tinyurl.com/swegner-feedback
><
> https://tinyurl.com/swegner-feedback>
>

Re: [VOTE] Release 2.10.0, release candidate #3

2019-02-08 Thread Etienne Chauchot

Thanks Robert !
Etienne 
Le vendredi 08 février 2019 à 16:42 +0100, Robert Bradshaw a écrit :
> +1 (binding)
> 
> I have verified that the artifacts and their checksums/signatures look good, 
> and also checked the Python wheels
> against simple pipelines. 
> On Fri, Feb 8, 2019 at 4:29 PM Etienne Chauchot  wrote:
> > Hi,I did the same visual checks of Nexmark that I did on RC2 for both 
> > functional regressions (output size) and
> > performance regressions (execution time) on all the runners/modes for RC3 
> > cut date (02/06) and I saw no regression
> > except the one that I already mentioned (end of october perf degradation on 
> > Q7 in spark batch mode) but it was
> > already in previous version.
> > Though I did not have time to check the artifacts. +1 (binding) provided 
> > that artifacts are correct
> > Etienne
> > Le jeudi 07 février 2019 à 10:25 -0800, Scott Wegner a écrit :
> > > +1
> > > I validated running:
> > > * Java Quickstart (Direct)
> > > * Java Quickstart (Apex local)
> > > * Java Quickstart (Flink local)
> > > * Java Quickstart (Spark local)
> > > * Java Quickstart (Dataflow)
> > > * Java Mobile Game (Dataflow) 
> > > 
> > > On Wed, Feb 6, 2019 at 2:28 PM Kenneth Knowles  wrote:
> > > > Hi everyone,
> > > > 
> > > > Please review and vote on the release candidate #3 for the version 
> > > > 2.10.0, as follows:
> > > > 
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > > 
> > > > The complete staging area is available for your review, which includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release to be deployed to dist.apache.org 
> > > > [2], which is signed with the key with
> > > > fingerprint 6ED551A8AE02461C [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "v2.10.0-RC3" [5],
> > > > * website pull request listing the release [6] and publishing the API 
> > > > reference manual [7].
> > > > * Python artifacts are deployed along with the source release to the 
> > > > dist.apache.org [2].
> > > > * Validation sheet with a tab for 2.10.0 release to help with 
> > > > validation [7].
> > > > 
> > > > The vote will be open for at least 72 hours. It is adopted by majority 
> > > > approval, with at least 3 PMC
> > > > affirmative votes.
> > > > 
> > > > Thanks,
> > > > Kenn
> > > > 
> > > > [1] 
> > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
> > > > [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
> > > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > > [4] 
> > > > https://repository.apache.org/content/repositories/orgapachebeam-1058/
> > > > [5] https://github.com/apache/beam/tree/v2.10.0-RC3
> > > > [6] https://github.com/apache/beam/pull/7651/files
> > > > [7] https://github.com/apache/beam-site/pull/586
> > > > [8] 
> > > > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529
> > > 
> > >

Re: [VOTE] Release 2.10.0, release candidate #3

2019-02-08 Thread Etienne Chauchot

Hi,I did the same visual checks of Nexmark that I did on RC2 for both 
functional regressions (output size) and
performance regressions (execution time) on all the runners/modes for RC3 cut 
date (02/06) and I saw no regression
except the one that I already mentioned (end of october perf degradation on Q7 
in spark batch mode) but it was already
in previous version.
Though I did not have time to check the artifacts. +1 (binding) provided that 
artifacts are correct
Etienne
Le jeudi 07 février 2019 à 10:25 -0800, Scott Wegner a écrit :
> +1
> I validated running:
> * Java Quickstart (Direct)
> * Java Quickstart (Apex local)
> * Java Quickstart (Flink local)
> * Java Quickstart (Spark local)
> * Java Quickstart (Dataflow)
> * Java Mobile Game (Dataflow) 
> 
> On Wed, Feb 6, 2019 at 2:28 PM Kenneth Knowles  wrote:
> > Hi everyone,
> > 
> > Please review and vote on the release candidate #3 for the version 2.10.0, 
> > as follows:
> > 
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> > 
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org [2], 
> > which is signed with the key with
> > fingerprint 6ED551A8AE02461C [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.10.0-RC3" [5],
> > * website pull request listing the release [6] and publishing the API 
> > reference manual [7].
> > * Python artifacts are deployed along with the source release to the 
> > dist.apache.org [2].
> > * Validation sheet with a tab for 2.10.0 release to help with validation 
> > [7].
> > 
> > The vote will be open for at least 72 hours. It is adopted by majority 
> > approval, with at least 3 PMC
> > affirmative votes.
> > 
> > Thanks,
> > Kenn
> > 
> > [1] 
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4] https://repository.apache.org/content/repositories/orgapachebeam-1058/
> > [5] https://github.com/apache/beam/tree/v2.10.0-RC3
> > [6] https://github.com/apache/beam/pull/7651/files
> > [7] https://github.com/apache/beam-site/pull/586
> > [8] 
> > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529
> 
>

Re: Another another new contributor! :)

2019-02-07 Thread Etienne Chauchot

Hi,
Help much appreciated !And welcome !
Etienne
Le jeudi 07 février 2019 à 15:44 +0800, Reza Ardeshir Rokni a écrit :
> Welcome!
> On Tue, 5 Feb 2019 at 23:34, Kenneth Knowles  wrote:
> > Welcome Kyle!
> > On Tue, Feb 5, 2019 at 4:34 AM Maximilian Michels  wrote:
> > > Welcome Kyle! Excited to see the Spark Runner moving towards portability!
> > > 
> > > 
> > > 
> > > On 05.02.19 01:14, Connell O'Callaghan wrote:
> > > 
> > > > Welcome Kyle!
> > > 
> > > > 
> > > 
> > > > On Mon, Feb 4, 2019 at 3:18 PM Ahmet Altay  > > 
> > > > > wrote:
> > > 
> > > > 
> > > 
> > > > Welcome!
> > > 
> > > > 
> > > 
> > > > On Mon, Feb 4, 2019 at 3:13 PM Rui Wang  > > 
> > > > > wrote:
> > > 
> > > > 
> > > 
> > > > Welcome!
> > > 
> > > > 
> > > 
> > > > -Rui
> > > 
> > > > 
> > > 
> > > > On Mon, Feb 4, 2019 at 2:50 PM Kyle Weaver  > > 
> > > > > wrote:
> > > 
> > > > 
> > > 
> > > > Hello Beam developers,
> > > 
> > > > 
> > > 
> > > > My name is Kyle Weaver (alias "ibzib" on Github/Slack). Like
> > > 
> > > > Brian, I recently switched roles at Google (I previously
> > > 
> > > > worked on Prow, Kubernetes' CI system). My goal in the
> > > 
> > > > coming weeks is to help begin implementing portability
> > > 
> > > > support for the Spark runner. I look forward to
> > > 
> > > > collaborating with all of you!
> > > 
> > > > 
> > > 
> > > > Kyle
> > > 
> > > > 
> > > 
> > > > Kyle Weaver |  Software Engineer |
> > > > kcwea...@google.com
> > > 
> > > >  | +1650203
> > > 
> > > > 
> > > 
> > > > 
> > >

Re: [VOTE] Release 2.10.0, release candidate #1

2019-02-06 Thread Etienne Chauchot

Hi,
I just fixed both (one was not a bug but an error in test code) in this [1] 
PR[1] 
https://github.com/apache/beam/pull/7751
Etienne
Le mardi 05 février 2019 à 17:37 +0100, Etienne Chauchot a écrit :
> Hi guys,
> I just found 2 bugs while replacing the mock in CassandraIO by a proper 
> instance:
> https://issues.apache.org/jira/browse/BEAM-6592https://issues.apache.org/jira/browse/BEAM-6591
> I don't think they are release blockers because they have been there since 
> CassandraIO first version.One of them is
> quite tricky, IMHO I don't think we should wait for the fix before the 
> release.
> Etienne
> Le mercredi 30 janvier 2019 à 10:01 -0800, Chamikara Jayalath a écrit :
> > FYI, created another blocker: 
> > https://issues.apache.org/jira/browse/BEAM-6552
> > 
> > Thanks,
> > Cham
> > On Tue, Jan 29, 2019 at 4:38 PM Ahmet Altay  wrote:
> > > -1, I ran into a new blocking issue: 
> > > https://issues.apache.org/jira/browse/BEAM-6545
> > > On Tue, Jan 29, 2019 at 4:08 PM Kenneth Knowles  wrote:
> > > > I have done this in the least vulnerable way I can think of. I have 
> > > > filed 
> > > > https://issues.apache.org/jira/browse/BEAM-6544 as a blocker to fix the 
> > > > release process.
> > > > Kenn
> > > > On Tue, Jan 29, 2019 at 3:07 PM Kenneth Knowles  wrote:
> > > > > Yes, the instructions for building the wheels includes inputting my 
> > > > > ASF credentials into Travis-CI. I've been
> > > > > trying to understand why and what I can do instead.
> > > > > (The release guide says that the release script builds the binaries, 
> > > > > but from what I can tell it does not.
> > > > > This makes sense because the instructions are highly manual too.)
> > > > > Kenn
> > > > > On Tue, Jan 29, 2019 at 12:38 AM Robert Bradshaw 
> > > > >  wrote:
> > > > > > The artifacts and signatures look good. But we're missing Python 
> > > > > > wheels.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > On Tue, Jan 29, 2019 at 6:08 AM Kenneth Knowles  
> > > > > > wrote:
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Ah, I did not close the staging repository. Thanks for letting me 
> > > > > > > know. Try now.
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > Kenn
> > > > > > 
> > > > > > >
> > > > > > 
> > > > > > > On Mon, Jan 28, 2019 at 2:31 PM Ismaël Mejía  
> > > > > > > wrote:
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> I think there is an issue, [4] does not open?
> > > > > > 
> > > > > > >>
> > > > > > 
> > > > > > >> On Mon, Jan 28, 2019 at 6:24 PM Kenneth Knowles 
> > > > > > >>  wrote:
> > > > > > 
> > > > > > >> >
> > > > > > 
> > > > > > >> > Hi everyone,
> > > > > > 
> > > > > > >> >
> > > > > > 
> > > > > > >> > Please review and vote on the release candidate #1 for the 
> > > > > > >> > version 2.10.0, as follows:
> > > > > > 
> > > > > > >> >
> > > > > > 
> > > > > > >> > [ ] +1, Approve the release
> > > > > > 
> > > > > > >> > [ ] -1, Do not approve the release (please provide specific 
> > > > > > >> > comments)
> > > > > > 
> > > > > > >> >
> > > > > > 
> > > > > > >> > The complete staging area is available for your review, which 
> > > > > > >> > includes:
> > > > > > 
> > > > > > >> > * JIRA release notes [1],
> > > > > > 
> > > > > > >> > * the official Apache source release to be deployed to 
> > > > > > >> > dist.apache.org [2], which is signed with the
> > > > > > key

Re: [DISCUSSION] UTests and embedded backends

2019-02-06 Thread Etienne Chauchot

Hi guys,  I just submitted the PR: https://github.com/apache/beam/pull/7751. It 
contains  refactorings, tests
improvements/fixes and production code fixing.
I wanted to give a little feedback because replacing the mock by a real 
instance allowed to - improve the tests: fix bad
tests- add missing split test - and more important to discover a bug in the 
production code of the split and fix it.
=> So I would love if we all agree to avoid mocks when possible.  Of course, as 
mentioned, some times mocks cannot be
avoided e.g. for hosted backends.
Etienne
Le lundi 28 janvier 2019 à 11:16 +0100, Etienne Chauchot a écrit :
> Guys,
> I will try using mocks where I see it is needed. As there is a current PR 
> opened on Cassandra, I will take this
> opportunity to add the embedded cassandra server 
> (https://github.com/jsevellec/cassandra-unit) to the UTests.Ticket
> was opened while ago: https://issues.apache.org/jira/browse/BEAM-4164
> Etienne
> Le mardi 22 janvier 2019 à 09:26 +0100, Robert Bradshaw a écrit :
> > On Mon, Jan 21, 2019 at 10:42 PM Kenneth Knowles  wrote:
> > Robert - you meant this as a mostly-automatic thing that we would engineer, 
> > yes?
> > Yes, something like TestPipeline that buffers up the pipelines andthen 
> > executes on class teardown (details TBD).
> > A lighter-weight fake, like using something in-process sharing a Java 
> > interface (versus today a locally running
> > service sharing an RPC interface) is still much better than a mock.
> > +1
> > 
> > Kenn
> > On Mon, Jan 21, 2019 at 7:17 AM Jean-Baptiste Onofré  
> > wrote:
> > Hi,
> > it makes sense to use embedded backend when:
> > 1. it's possible to easily embed the backend2. when the backend is 
> > "predictable".
> > If it's easy to embed and the backend behavior is predictable, then itmakes 
> > sense.In other cases, we can fallback to
> > mock.
> > RegardsJB
> > On 21/01/2019 10:07, Etienne Chauchot wrote:Hi guys,
> > Lately I have been fixing various Elasticsearch flakiness issues in 
> > theUTests by: introducing timeouts, countdown
> > latches, force refresh,embedded cluster size decrease ...
> > These flakiness issues are due to the embedded Elasticsearch not copingwell 
> > with the jenkins overload. Still, IMHO I
> > believe that havingembedded backend for UTests are a lot better than mocks. 
> > Even if theyare less tolerant to load, I
> > prefer having UTests 100% representative ofreal backend and add 
> > countermeasures to protect against jenkins overload.
> > WDYT ?
> > Etienne
> > 
> > 
> > --Jean-Baptiste Onofréjbonofre@apache.orghttp://blog.nanthrax.netTalend - 
> > http://www.talend.com

Re: [VOTE] Release 2.10.0, release candidate #2

2019-02-06 Thread Etienne Chauchot

Hi,I checked Nexmark on both output size (functional regression detection) and 
run time (performance regression). The
only thing I see is a performance regression on query7 (side input + fanout) in 
spark runner but this regression is
there since the previous release cut.Indeed 2.9 was cut on 18/12/06 and the 
perf regression started on 18/10/05. I don't
think it is a blocker, then.
Also I see this ticket tagged as blocker: 
https://issues.apache.org/jira/browse/BEAM-3261 it is a very old ticket.
Should we target it for later on ?
Etienne
Le mercredi 06 février 2019 à 11:26 +0100, Jean-Baptiste Onofré a écrit :
> +1 (binding)
> Quickly tested on beam-samples.
> RegardsJB
> On 05/02/2019 23:57, Kenneth Knowles wrote:
> Hi everyone,
> Please review and vote on the release candidate #2 for theversion 2.10.0, as 
> follows:
> [ ] +1, Approve the release[ ] -1, Do not approve the release (please provide 
> specific comments)
> The complete staging area is available for your review, which includes:* JIRA 
> release notes [1],* the official Apache
> source release to be deployed to dist.apache.orgt; 
> [2], which is signed with the key
> withfingerprint 6ED551A8AE02461C [3],* all artifacts to be deployed to the 
> Maven Central Repository [4],* source code
> tag "v2.10.0-RC1" [5],* website pull request listing the release [6] and 
> publishing the APIreference manual [7].*
> Python artifacts are deployed along with the source release tothe 
> dist.apache.org  [2].*
> Validation sheet with a tab for 2.10.0 release to help with validation[7].
> The vote will be open for at least 72 hours. It is adopted by 
> majorityapproval, with at least 3 PMC affirmative votes.
> Thanks,Kenn
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540[2]
>  https://dist.apache.
> org/repos/dist/dev/beam/2.10.0/[3] 
> https://dist.apache.org/repos/dist/release/beam/KEYS dist/release/beam/KEYS>[4] 
> https://repository.apache.org/content/repositories/orgapachebeam-
> 1057/[5] https://github.com/apache/beam/tree/v2.10.0-
> RC2[6] https://github.com/apache/beam/pull/7651/files[7] 
> https://github.com/apache/beam-
> site/pull/586[8] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-
> oLFo_ZXBpJw/edit#gid=2053422529
>

Re: [VOTE] Release 2.10.0, release candidate #1

2019-02-05 Thread Etienne Chauchot

Hi guys,
I just found 2 bugs while replacing the mock in CassandraIO by a proper 
instance:
https://issues.apache.org/jira/browse/BEAM-6592https://issues.apache.org/jira/browse/BEAM-6591
I don't think they are release blockers because they have been there since 
CassandraIO first version.One of them is
quite tricky, IMHO I don't think we should wait for the fix before the release.
Etienne
Le mercredi 30 janvier 2019 à 10:01 -0800, Chamikara Jayalath a écrit :
> FYI, created another blocker: https://issues.apache.org/jira/browse/BEAM-6552
> 
> Thanks,
> Cham
> On Tue, Jan 29, 2019 at 4:38 PM Ahmet Altay  wrote:
> > -1, I ran into a new blocking issue: 
> > https://issues.apache.org/jira/browse/BEAM-6545
> > On Tue, Jan 29, 2019 at 4:08 PM Kenneth Knowles  wrote:
> > > I have done this in the least vulnerable way I can think of. I have filed 
> > > https://issues.apache.org/jira/browse/BEAM-6544 as a blocker to fix the 
> > > release process.
> > > Kenn
> > > On Tue, Jan 29, 2019 at 3:07 PM Kenneth Knowles  wrote:
> > > > Yes, the instructions for building the wheels includes inputting my ASF 
> > > > credentials into Travis-CI. I've been
> > > > trying to understand why and what I can do instead.
> > > > (The release guide says that the release script builds the binaries, 
> > > > but from what I can tell it does not. This
> > > > makes sense because the instructions are highly manual too.)
> > > > Kenn
> > > > On Tue, Jan 29, 2019 at 12:38 AM Robert Bradshaw  
> > > > wrote:
> > > > > The artifacts and signatures look good. But we're missing Python 
> > > > > wheels.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > On Tue, Jan 29, 2019 at 6:08 AM Kenneth Knowles  
> > > > > wrote:
> > > > > 
> > > > > >
> > > > > 
> > > > > > Ah, I did not close the staging repository. Thanks for letting me 
> > > > > > know. Try now.
> > > > > 
> > > > > >
> > > > > 
> > > > > > Kenn
> > > > > 
> > > > > >
> > > > > 
> > > > > > On Mon, Jan 28, 2019 at 2:31 PM Ismaël Mejía  
> > > > > > wrote:
> > > > > 
> > > > > >>
> > > > > 
> > > > > >> I think there is an issue, [4] does not open?
> > > > > 
> > > > > >>
> > > > > 
> > > > > >> On Mon, Jan 28, 2019 at 6:24 PM Kenneth Knowles  
> > > > > >> wrote:
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > Hi everyone,
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > Please review and vote on the release candidate #1 for the 
> > > > > >> > version 2.10.0, as follows:
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > [ ] +1, Approve the release
> > > > > 
> > > > > >> > [ ] -1, Do not approve the release (please provide specific 
> > > > > >> > comments)
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > The complete staging area is available for your review, which 
> > > > > >> > includes:
> > > > > 
> > > > > >> > * JIRA release notes [1],
> > > > > 
> > > > > >> > * the official Apache source release to be deployed to 
> > > > > >> > dist.apache.org [2], which is signed with the key
> > > > > with fingerprint 6ED551A8AE02461C [3],
> > > > > 
> > > > > >> > * all artifacts to be deployed to the Maven Central Repository 
> > > > > >> > [4],
> > > > > 
> > > > > >> > * source code tag "v2.10.0-RC1" [5],
> > > > > 
> > > > > >> > * website pull request listing the release [6] and publishing 
> > > > > >> > the API reference manual [7].
> > > > > 
> > > > > >> > * Python artifacts are deployed along with the source release to 
> > > > > >> > the dist.apache.org [2].
> > > > > 
> > > > > >> > * Validation sheet with a tab for 2.10.0 release to help with 
> > > > > >> > validation [7].
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > The vote will be open for at least 72 hours. It is adopted by 
> > > > > >> > majority approval, with at least 3 PMC
> > > > > affirmative votes.
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > Thanks,
> > > > > 
> > > > > >> > Kenn
> > > > > 
> > > > > >> >
> > > > > 
> > > > > >> > [1] 
> > > > > >> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12344540
> > > > > 
> > > > > >> > [2] https://dist.apache.org/repos/dist/dev/beam/2.10.0/
> > > > > 
> > > > > >> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > > > 
> > > > > >> > [4] 
> > > > > >> > https://repository.apache.org/content/repositories/orgapachebeam-1056/
> > > > > 
> > > > > >> > [5] https://github.com/apache/beam/tree/v2.10.0-RC1
> > > > > 
> > > > > >> > [6] https://github.com/apache/beam/pull/7651/files
> > > > > 
> > > > > >> > [7] https://github.com/apache/beam-site/pull/585
> > > > > 
> > > > > >> > [8] 
> > > > > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=2053422529
> > > > >

Re: BEAM-6324 / #7340: "I've pretty much given up on the PR being merged. I use my own fork for my projects"

2019-01-31 Thread Etienne Chauchot

I also missed the sentence Kenn mentioned. I think it is worth enlightening it.
Thx for your PR around that Lukasz !
Etienne
Le mercredi 30 janvier 2019 à 11:03 +0100, Łukasz Gajowy a écrit :
> Wow. I missed the sentence. Judging from the fact that others also proposed 
> adding it, I think it might need some
> care. I proposed a PR here: https://github.com/apache/beam/pull/7670
> 
> Łukasz
> śr., 30 sty 2019 o 00:39 Kenneth Knowles  napisał(a):
> > On Mon, Jan 28, 2019 at 5:25 AM Łukasz Gajowy  wrote:
> > > IMHO, I don't think committers spend time watching new PRs coming up, but 
> > > they more likely act when pinged. So, we
> > > may need some automation in case a contributor do not use github reviewed 
> > > proposal. Auto reviewer assignment seem
> > > too much but modifying the PR template to add a sentence such as "please 
> > > pickup a reviewer in the proposed list"
> > > could be enough. 
> > > WDYT ?
> > > 
> > > and
> > > 
> > > (1) A sentence in the PR template suggesting adding a reviewer. (easy)
> > > 
> > > 
> > > +100! Let's improve the message in the PR template. It costs nothing and 
> > > can help a lot. I'd say it should be in
> > > bold letters as this is super important.
> > > 
> > 
> > There is already a message. Is it unclear? Can you rephrase it to something 
> > better?
> > Kenn 
> > > Maybe this is also worth reconsidering if auto reviewer assignment (or at 
> > > least some form of it) is a bad idea. We
> > > can assign committers (the most "hardcore" option, maybe too much) or 
> > > ping them in emails/github comments if
> > > there's inactivity in pull requests (the soft one but requires a bot to 
> > > be implemented). The way I see this is
> > > that such auto assigned reviewer could always say "I have lots on my 
> > > plate" but suggest someone else to take care
> > > of the PR. This way the problem that nobody is mentioned by the PR author 
> > > is completely gone. Other than that,
> > > such an approach feels efficient to me because it's more "in person" 
> > > (similar to what Robert said). 
> > > 
> > > It's certainly disheartening as a
> > > reviewer to put time into reviewing a PR and then the author doesn't
> > > bother to even respond, or (as has happened to me) be told "hey, this
> > > wasn't ready for review yet."
> > > 
> > > As for "this wasn't ready for review" - there are sometimes situations 
> > > that require a PR to be opened before they
> > > are actually completed (especially when working with Jenkins jobs). Given 
> > > that there might be misunderstandings
> > > authors of such commits should give a clear message saying "do not merge 
> > > yet" or "not ready for review" in title
> > > or comments or even close such PR and reopen until the change is ready. 
> > > It's all about giving a clear signal to
> > > others. 
> > > 
> > > Maybe we should mention it in guidelines/PR message too to avoid 
> > > situations like this?
> > > 
> > > Łukasz
> > > 
> > > 
> > > 
> > > pon., 28 sty 2019 o 11:30 Robert Bradshaw  
> > > napisał(a):
> > > > On Mon, Jan 28, 2019 at 10:37 AM Etienne Chauchot 
> > > >  wrote:
> > > > >
> > > > > Sure it's a pity than this PR got unnoticed and I think it is a 
> > > > > combination of factors (PR date around
> > > > Christmas, the fact that the author forgot - AFAIK - to ping a reviewer 
> > > > in either the PR or the ML).
> > > > >
> > > > > I agree with Rui's proposal to enhance visibility of the "how to get 
> > > > > a reviewed" process.
> > > > >
> > > > > IMHO, I don't think committers spend time watching new PRs coming up, 
> > > > > but they more likely act when pinged.
> > > > So, we may need some automation in case a contributor do not use github 
> > > > reviewed proposal. Auto reviewer
> > > > assignment seem too much but modifying the PR template to add a 
> > > > sentence such as "please pickup a reviewer in
> > > > the proposed list" could be enough.
> > > > > WDYT ?
> > > > 
> > > > 
> > > > +1
> > > > 
>

Re: [ANNOUNCE] New PMC member: Etienne Chauchot

2019-01-29 Thread Etienne Chauchot

Thanks for the warm welcome guys !
Etienne
Le lundi 28 janvier 2019 à 14:28 +0100, Łukasz Gajowy a écrit :
> Thanks for your great work and congratulations! :)
> 
> pon., 28 sty 2019 o 12:01 Gleb Kanterov  napisał(a):
> > Congratulations Etienne! 
> > 
> > 
> > 
> > On Mon, Jan 28, 2019 at 11:36 AM Maximilian Michels  wrote:
> > > Congrats Etienne! It's been great to work with you.
> > > 
> > > 
> > > 
> > > On 26.01.19 07:16, Ismaël Mejía wrote:
> > > 
> > > > Congratulations Etienne!
> > > 
> > > > 
> > > 
> > > > Le sam. 26 janv. 2019 à 06:42, Reuven Lax  > > 
> > > > <mailto:re...@google.com>> a écrit :
> > > 
> > > > 
> > > 
> > > > Welcome!
> > > 
> > > > 
> > > 
> > > > On Fri, Jan 25, 2019 at 9:30 PM Pablo Estrada  > > 
> > > > <mailto:pabl...@google.com>> wrote:
> > > 
> > > > 
> > > 
> > > > Congrats Etienne :)
> > > 
> > > > 
> > > 
> > > > On Fri, Jan 25, 2019, 9:24 PM Trần Thành Đạt 
> > > >  > > 
> > > > <mailto:dattran.v...@gmail.com> wrote:
> > > 
> > > > 
> > > 
> > > > Congratulations Etienne!
> > > 
> > > > 
> > > 
> > > > On Sat, Jan 26, 2019 at 12:08 PM Thomas Weise 
> > > >  > > 
> > > > <mailto:t...@apache.org>> wrote:
> > > 
> > > > 
> > > 
> > > > Congrats, félicitations!
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > On Fri, Jan 25, 2019 at 3:06 PM Scott Wegner 
> > > >  > > 
> > > > <mailto:sc...@apache.org>> wrote:
> > > 
> > > > 
> > > 
> > > > Congrats Etienne!
> > > 
> > > > 
> > > 
> > > >     On Fri, Jan 25, 2019 at 2:34 PM Tim
> > > 
> > > >  > > 
> > > > <mailto:timrobertson...@gmail.com>> wrote:
> > > 
> > > > 
> > > 
> > > > Congratulations Etienne!
> > > 
> > > > 
> > > 
> > > > Tim
> > > 
> > > > 
> > > 
> > > >  > On 25 Jan 2019, at 23:00, Kenneth Knowles
> > > 
> > > > mailto:k...@apache.org>> 
> > > > wrote:
> > > 
> > > >  >
> > > 
> > > >  > Hi all,
> > > 
> > > >  >
> > > 
> > > >  > Please join me and the rest of the Beam PMC 
> > > > in
> > > 
> > > > welcoming Etienne Chauchot to join the PMC.
> > > 
> > > >  >
> > > 
> > > >  > Etienne introduced himself to dev@ in 
> > > > September of
> > > 
> > > > 2017 and over the years has contributed to Beam 
> > > > in many
> > > 
> > > > ways - connectors, performance, design 
> > > > discussion,
> > > 
> > > > talks, code reviews, and I'm sure I cannot list 
> > > > them
> > > 
> > > > all. He already has a major impact on the 
> > > > direction of Beam.
> > > 
> > > >  >
> > > 
> > > >  > Thanks for being a part of Beam, Etienne!
> > > 
> > > >  >
> > > 
> > > >  > Kenn
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > -- 
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > Got feedback? tinyurl.com/swegner-feedback
> > > 
> > > > <https://tinyurl.com/swegner-feedback>
> > > 
> > > > 
> > > 
> > 
> >

Re: [DISCUSSION] UTests and embedded backends

2019-01-28 Thread Etienne Chauchot

Hi Robert,
Yes, this is something I really believe in: test coverage offered by embedded 
instances are worth some temporary
flakiness (due to resource over consumption).
I also deeply agree with your point on maintenance: some mocks could hide bugs 
in production code that would cost a lot
in the long term.
Etienne

Le lundi 28 janvier 2019 à 11:44 +0100, Robert Bradshaw a écrit :
> I strongly agree with your original assessment "IMHO I believe thathaving 
> embedded backend for UTests are a lot better
> than mocks." Mocksare sometimes necessary, but in my experience they are 
> often anexpensive (in production and
> maintenance) way to get what amounts tolow true coverage.
> On Mon, Jan 28, 2019 at 11:16 AM Etienne Chauchot  
> wrote:
> 
> Guys,
> I will try using mocks where I see it is needed. As there is a current PR 
> opened on Cassandra, I will take this
> opportunity to add the embedded cassandra server 
> (https://github.com/jsevellec/cassandra-unit) to the UTests.Ticket
> was opened while ago: https://issues.apache.org/jira/browse/BEAM-4164
> Etienne
> Le mardi 22 janvier 2019 à 09:26 +0100, Robert Bradshaw a écrit :
> On Mon, Jan 21, 2019 at 10:42 PM Kenneth Knowles  wrote:
> 
> Robert - you meant this as a mostly-automatic thing that we would engineer, 
> yes?
> 
> Yes, something like TestPipeline that buffers up the pipelines and
> then executes on class teardown (details TBD).
> 
> A lighter-weight fake, like using something in-process sharing a Java 
> interface (versus today a locally running
> service sharing an RPC interface) is still much better than a mock.
> 
> +1
> 
> 
> Kenn
> 
> On Mon, Jan 21, 2019 at 7:17 AM Jean-Baptiste Onofré  
> wrote:
> 
> Hi,
> 
> it makes sense to use embedded backend when:
> 
> 1. it's possible to easily embed the backend
> 2. when the backend is "predictable".
> 
> If it's easy to embed and the backend behavior is predictable, then it
> makes sense.
> In other cases, we can fallback to mock.
> 
> Regards
> JB
> 
> On 21/01/2019 10:07, Etienne Chauchot wrote:
> Hi guys,
> 
> Lately I have been fixing various Elasticsearch flakiness issues in the
> UTests by: introducing timeouts, countdown latches, force refresh,
> embedded cluster size decrease ...
> 
> These flakiness issues are due to the embedded Elasticsearch not coping
> well with the jenkins overload. Still, IMHO I believe that having
> embedded backend for UTests are a lot better than mocks. Even if they
> are less tolerant to load, I prefer having UTests 100% representative of
> real backend and add countermeasures to protect against jenkins overload.
> 
> WDYT ?
> 
> Etienne
> 
> 
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: [DISCUSSION] UTests and embedded backends

2019-01-28 Thread Etienne Chauchot

Guys,
I will try using mocks where I see it is needed. As there is a current PR 
opened on Cassandra, I will take this
opportunity to add the embedded cassandra server 
(https://github.com/jsevellec/cassandra-unit) to the UTests.Ticket was
opened while ago: https://issues.apache.org/jira/browse/BEAM-4164 
id="-x-evo-selection-start-marker">
Etienne
Le mardi 22 janvier 2019 à 09:26 +0100, Robert Bradshaw a écrit :
> On Mon, Jan 21, 2019 at 10:42 PM Kenneth Knowles  wrote:
> 
> Robert - you meant this as a mostly-automatic thing that we would engineer, 
> yes?
> Yes, something like TestPipeline that buffers up the pipelines andthen 
> executes on class teardown (details TBD).
> A lighter-weight fake, like using something in-process sharing a Java 
> interface (versus today a locally running
> service sharing an RPC interface) is still much better than a mock.
> +1
> 
> Kenn
> On Mon, Jan 21, 2019 at 7:17 AM Jean-Baptiste Onofré  
> wrote:
> 
> Hi,
> it makes sense to use embedded backend when:
> 1. it's possible to easily embed the backend2. when the backend is 
> "predictable".
> If it's easy to embed and the backend behavior is predictable, then itmakes 
> sense.In other cases, we can fallback to
> mock.
> RegardsJB
> On 21/01/2019 10:07, Etienne Chauchot wrote:
> Hi guys,
> Lately I have been fixing various Elasticsearch flakiness issues in theUTests 
> by: introducing timeouts, countdown
> latches, force refresh,embedded cluster size decrease ...
> These flakiness issues are due to the embedded Elasticsearch not copingwell 
> with the jenkins overload. Still, IMHO I
> believe that havingembedded backend for UTests are a lot better than mocks. 
> Even if theyare less tolerant to load, I
> prefer having UTests 100% representative ofreal backend and add 
> countermeasures to protect against jenkins overload.
> WDYT ?
> Etienne
> 
> 
> --Jean-Baptiste Onofréjbonofre@apache.orghttp://blog.nanthrax.netTalend - 
> http://www.talend.com

Re: BEAM-6324 / #7340: "I've pretty much given up on the PR being merged. I use my own fork for my projects"

2019-01-28 Thread Etienne Chauchot

Sure it's a pity than this PR got unnoticed and I think it is a combination of 
factors (PR date around Christmas, the
fact that the author forgot - AFAIK - to ping a reviewer in either the PR or 
the ML).
I agree with Rui's proposal to enhance visibility of the "how to get a 
reviewed" process.
IMHO, I don't think committers spend time watching new PRs coming up, but they 
more likely act when pinged. So, we may
need some automation in case a contributor do not use github reviewed proposal. 
Auto reviewer assignment seem too much
but modifying the PR template to add a sentence such as "please pickup a 
reviewer in the proposed list" could be
enough. WDYT ? 

Also, I started to review the PR on Friday (thx Kenn for pinging me).
Etienne
Le vendredi 25 janvier 2019 à 10:21 -0800, Rui Wang a écrit :
> We have code contribution guidelines [1] and it says useful tips to make PR 
> reviewed and merged. But I guess it hides
> in Beam website so new contributors are likely to ignore it. In order to make 
> the guidance easy to find and read for
> new contributors, we probably can
> 
> a. Move number 5 item from [1] to a separate section and name it "Tips to get 
> your PR reviewed and merged"
> b. Put the link to the Github pull request template, so when a contributor 
> creates the first PR, the contributor could
> see the link (or even paste text from contribution guide). It will be a good 
> chance that new contributors read what's
> in pull request template.
> 
> 
> -Rui
> 
> [1] https://beam.apache.org/contribute/#make-your-change
> On Fri, Jan 25, 2019 at 9:24 AM Alexey Romanenko  
> wrote:
> > For sure, it’s a pity that this PR has not been addressed for a long time 
> > (I guess, we probably have other ones like
> > this) but, as I can see from this PR history, review has not been requested 
> > explicitly by author (and this is one of
> > the our recommendations for code contribution [1]).
> > What are the options to improve this:
> > 
> > 1) Make it more clearly for new contributors that they need to ask for a 
> > review explicitly (with a help of
> > recommendations that already provided in top-right corner on PR page)
> > 2) Create a bot (like “stale” bot that we have) to check for non-addressed 
> > PRs that are more than, say, 7 days, and
> > send notification to dev@ (or dedicated, see n.3) mailing list if they are 
> > starving for review.
> > 3) (Optionally) Create new mailing list called pr@ for new coming and 
> > non-addressed PRs
> > 
> > [1] https://beam.apache.org/contribute/#make-your-change
> > 
> > 
> > > On 25 Jan 2019, at 17:50, Ismaël Mejía  wrote:
> > > 
> > > The fact that this happened is a real pity. However it is clearly an
> > > exception and not the rule. Really few PRs have been long time without
> > > review. Can we somehow automatically send a notification if a PR has
> > > no assigned reviewers, or if it has not been reviewed after some time
> > > as Tim suggested?
> > > 
> > > On Fri, Jan 25, 2019 at 9:43 AM Tim Robertson  
> > > wrote:
> > > > Thanks Kenn
> > > > 
> > > > I tend to think that timing is the main contributing factor as you note 
> > > > on the Jira - it slipped down with no
> > > > reminders / bumps sent on any channels that I can see.
> > > > 
> > > > Would something that alerts the dev@ list of PRs that have not received 
> > > > any attention after N days be helpful
> > > > perhaps?
> > > > Even if that only prompts action by one of us to comment on the PR that 
> > > > it's been acknowledged would likely be
> > > > enough to engage the contributor - they would hopefully then ping the 
> > > > individual if it then slips for a long
> > > > time.
> > > > 
> > > > Next week will be my first I'll be able to work on Beam in 2019, but 
> > > > I'll comment on that PR now too as it's
> > > > missing tests.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > On Fri, Jan 25, 2019 at 7:27 AM Kenneth Knowles  wrote:
> > > > > The subject line is a quote from BEAM-6324*
> > > > > 
> > > > > This makes me sad. I hope/expect it is a failure to route a pull 
> > > > > request to the right reviewer. I am less sad
> > > > > about the functionality than the sentiment and how a contributor is 
> > > > > being discouraged.
> > > > > 
> > > > > Does anyone have ideas that could help?
> > > > > 
> > > > > Kenn
> > > > > 
> > > > > *https://issues.apache.org/jira/browse/BEAM-6324

Re: [ANNOUNCE] New committer announcement: Gleb Kanterov

2019-01-25 Thread Etienne Chauchot

Congrats Gleb and welcome onboard !
Etienne
Le vendredi 25 janvier 2019 à 10:39 +0100, Alexey Romanenko a écrit :
> Congrats to Gleb and welcome on board!
> 
> > On 25 Jan 2019, at 09:22, Tim Robertson  wrote:
> > 
> > Welcome Gleb and congratulations!
> > 
> > On Fri, Jan 25, 2019 at 8:06 AM Kenneth Knowles  wrote:
> > > Hi all,
> > > Please join me and the rest of the Beam PMC in welcoming a new committer: 
> > > Gleb Kanterov
> > > 
> > > Gleb started contributing to Beam and quickly dove deep, doing some 
> > > sensitive fixes to schemas, also general build
> > > issues, Beam SQL, Avro, and more. In consideration of Gleb's technical 
> > > and community contributions, the Beam PMC
> > > trusts Gleb with the responsibilities of a Beam committer [1].
> > > 
> > > Thank you, Gleb, for your contributions.
> > > 
> > > Kenn
> > > 
> > > [1] 
> > > https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

1 2 3 4 >

1 - 100 of 359 matches

Mail list logo