from:"Ismaël Mejía"

Re: ByteBuddy DoFnInvokers Write Up

2024-01-11 Thread Ismaël Mejía

Neat! I remember passing long time trying to decipher the DoFnInvoker
behavior so this will definitely be helpful.

Maybe a good idea to add the link to the Design Documents list for future
reference
https://cwiki.apache.org/confluence/display/BEAM/Design+Documents

On Wed, Jan 10, 2024 at 9:15 PM Robert Burke  wrote:

> That's neat! Thanks for writing that up!
>
> On Wed, Jan 10, 2024, 11:12 AM John Casey via dev 
> wrote:
>
>> The team at Google recently held an internal hackathon, and my hack
>> involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end
>> up going anywhere, but I learned a lot about how our code generation works.
>> It turns out we have no documentation or design docs about our code
>> generation, so I wrote up what I learned,
>>
>> Please take a look, and let me know if I got anything wrong, or if you
>> are looking for more detail
>>
>> s.apache.org/beam-bytebuddy-dofninvoker
>>
>> John
>>
>

Re: Lakehouse Formats with IO/Integration --> Hudi? Iceberg?

2023-11-07 Thread Ismaël Mejía

For iceberg there has been a long time opened issue and some WIP for a sink
https://github.com/apache/beam/issues/20327



On Tue, Nov 7, 2023 at 2:08 AM Austin Bennett  wrote:

> Beam Devs,
>
> I was looking through GH Issue and online more generally and hadn't seen
> much...  Has anyone written any Beam IO or other integration for writing to
> [ or reading from ] either Hudi or Iceberg?
>
> Any experience that can be shared [ on list, else feel free to message me
> off list and I'll also share back what is OK to be shared ]?
>
> Thanks!
> Austin
>

Re: [ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-05 Thread Ismaël Mejía

Congratulations Alex, well deserved!

On Wed, Oct 4, 2023 at 11:59 PM Chamikara Jayalath 
wrote:

> Congrats Alex!
>
> On Wed, Oct 4, 2023 at 1:43 AM Jan Lukavský  wrote:
>
>> Congrats Alex!
>> On 10/4/23 10:29, Alexey Romanenko wrote:
>>
>> Congrats Alex, very well deserved!
>>
>> —
>> Alexey
>>
>> On 4 Oct 2023, at 00:38, Austin Bennett 
>>  wrote:
>>
>> Thanks for all you do, @Alex Van Boxel  !
>>
>> On Tue, Oct 3, 2023 at 12:50 PM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, Oct 3, 2023 at 3:48 PM Byron Ellis via dev 
>>> wrote:
>>>
 Congrats!

 On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev <
 dev@beam.apache.org> wrote:

> Congratulations Alex!! Definitely well deserved!
>
> On Tue, Oct 3, 2023 at 2:57 PM Ahmet Altay via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Alex! Well deserved!
>>
>> On Tue, Oct 3, 2023 at 11:54 AM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations Alex!
>>>
>>> On Tue, Oct 3, 2023 at 2:54 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Alex, this is well deserved!

 On Tue, Oct 3, 2023 at 2:50 PM Jack McCluskey via dev <
 dev@beam.apache.org> wrote:

> Congrats, Alex!
>
> On Tue, Oct 3, 2023 at 2:49 PM XQ Hu via dev 
> wrote:
>
>> Configurations, Alex!
>>
>> On Tue, Oct 3, 2023 at 2:40 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Alex
>>> Van Boxel  as our newest PMC member.
>>>
>>> Alex has been with Beam since 2016, very early in the life of
>>> the project. Alex has contributed code, design ideas, and perhaps 
>>> most
>>> importantly been a huge part of organizing Beam Summits, and of 
>>> course
>>> presenting at them as well. Alex really brings the ASF community 
>>> spirit to
>>> Beam.
>>>
>>> Congratulations Alex and thanks for being a part of Apache Beam!
>>>
>>> Kenn, on behalf of the Beam PMC (which now includes Alex)
>>>
>>
>>

Re: [ANNOUNCE] New PMC Member: Robert Burke

2023-10-05 Thread Ismaël Mejía

Congratulations Robert, well deserved ! long live go !

On Wed, Oct 4, 2023 at 11:58 PM Chamikara Jayalath 
wrote:

> Congrats Rebo!
>
> On Wed, Oct 4, 2023 at 1:42 AM Jan Lukavský  wrote:
>
>> Congrats Robert!
>> On 10/4/23 10:29, Alexey Romanenko wrote:
>>
>> Congrats Robert, very well deserved!
>>
>> —
>> Alexey
>>
>> On 4 Oct 2023, at 00:39, Austin Bennett 
>>  wrote:
>>
>> Thanks for all you do @Robert Burke  !
>>
>> On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud <
>> ahmedabuals...@apache.org> wrote:
>>
>>> Congrats Rebo!
>>>
>>> On 2023/10/03 18:39:47 Kenneth Knowles wrote:
>>> > Hi all,
>>> >
>>> > Please join me and the rest of the Beam PMC in welcoming Robert Burke <
>>> > lostl...@apache.org> as our newest PMC member.
>>> >
>>> > Robert has been a part of the Beam community since 2017. He is our
>>> resident
>>> > Gopher, producing the Go SDK and most recently the local, portable,
>>> Prism
>>> > runner. Robert has presented on Beam many times, having written not
>>> just
>>> > core Beam code but quite interesting pipelines too :-)
>>> >
>>> > Congratulations Robert and thanks for being a part of Apache Beam!
>>> >
>>> > Kenn, on behalf of the Beam PMC (which now includes Robert)
>>> >
>>>
>>
>>

Re: [ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-05 Thread Ismaël Mejía

Congratulations Valentyn, well deserved !

On Wed, Oct 4, 2023 at 11:58 PM Chamikara Jayalath 
wrote:

> Congrats Valentyn!
>
> On Wed, Oct 4, 2023 at 1:42 AM Jan Lukavský  wrote:
>
>> Congrats Valentyn!
>> On 10/4/23 10:26, Alexey Romanenko wrote:
>>
>> Congrats Valentyn, very well deserved!
>>
>> —
>> Alexey
>>
>> On 4 Oct 2023, at 00:39, Austin Bennett 
>>  wrote:
>>
>> Thanks for everything @Valentyn Tymofieiev  !
>>
>> On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud <
>> ahmedabuals...@apache.org> wrote:
>>
>>> Congrats Valentyn!
>>>
>>> On 2023/10/03 18:39:49 Kenneth Knowles wrote:
>>> > Hi all,
>>> >
>>> > Please join me and the rest of the Beam PMC in welcoming Valentyn
>>> > Tymofieiev  as our newest PMC member.
>>> >
>>> > Valentyn has been contributing to Beam since 2017. Notable highlights
>>> > include his work on the Python SDK and also in our container
>>> management.
>>> > Valentyn also is involved in many discussions around Beam's
>>> infrastructure
>>> > and community processes. If you look through Valentyn's history, you
>>> will
>>> > see an abundance of the most critical maintenance work that is the
>>> beating
>>> > heart of any project.
>>> >
>>> > Congratulations Valentyn and thanks for being a part of Apache Beam!
>>> >
>>> > Kenn, on behalf of the Beam PMC (which now includes Valentyn)
>>> >
>>>
>>
>>

Re: Automatic signing of releases

2023-08-24 Thread Ismaël Mejía

Ah excellent, I was not aware it was the case, great to know we are in
advance !

On Thu, Aug 24, 2023 at 4:54 PM Danny McCormick via dev 
wrote:

> Hey Ismael,
>
> We've actually already been doing this since May! I started a thread for
> this here https://lists.apache.org/thread/mw9dbbdjtkqlvs0mmrh452z3jsf68sct and
> its in an Actions workflow here -
> https://github.com/apache/beam/blob/0c6ef7dd4788d13b3785d4e06d4552907c7200e3/.github/workflows/build_release_candidate.yml#L67
>
> Thanks for calling this out though, it's definitely nice that ASF has
> published some more formal documentation/process around this. Previously I
> had to ask the VP of Security for special permission to do this 🙃
>
> Thanks,
> Danny
>
> On Thu, Aug 24, 2023 at 10:48 AM Ismaël Mejía  wrote:
>
>> Hi,
>>
>> I just saw an interesting change on the ASF side that could be of
>> interest for Beam releases.
>>
>> The ASF now allows to do signing of releases by automated infrastructure.
>> https://issues.apache.org/jira/browse/LEGAL-647
>>
>> This is a good step for automation that I remember we discussed at the
>> beginning of the project so maybe someone can have some cycles to do the
>> work and avoid errors/manual work in the future.
>>
>> Regards,
>> Ismaël
>>
>>

Automatic signing of releases

2023-08-24 Thread Ismaël Mejía

Hi,

I just saw an interesting change on the ASF side that could be of interest
for Beam releases.

The ASF now allows to do signing of releases by automated infrastructure.
https://issues.apache.org/jira/browse/LEGAL-647

This is a good step for automation that I remember we discussed at the
beginning of the project so maybe someone can have some cycles to do the
work and avoid errors/manual work in the future.

Regards,
Ismaël

Re: FOSDEM 2023 is back as in person event

2022-10-21 Thread Ismaël Mejía

Hi Aizhamal,

You might be interested on this thread where the ASF people are also
discussing about FOSDEM participation.
https://lists.apache.org/thread/kv4fhldmc9mo6v5lwtkwqtwg97l64lx1

It seems the call for devrooms is closed so maybe it us too late for
Beam, but we have had talks in the past about Beam as part of the Big
Data track so maybe worth to participate there.

Best,
Ismaël

On Mon, Oct 17, 2022 at 9:06 PM Aizhamal Nurmamat kyzy
 wrote:
>
> Hi Beam community!
>
> FOSDEM 2023  is back as an in person event! I have 
> heard only great things about the event where thousands of developers get 
> together to talk all about open source!
>
> Is anyone from the Beam community planning to attend? The event takes place 
> in Brussels on February 4 & 5, 2023. I believe it is also free to attend but 
> don't quote me on this.
>
> As an open source project we can also have
> - a stand for free https://fosdem.org/2023/news/2022-09-26-stands-cfp/
> - a Devroom https://fosdem.org/2023/news/2022-09-29-call_for_devrooms/
>
> Anyone interested?
>

Re: [DISCUSS] Jenkins -> GitHub Actions ?

2022-10-21 Thread Ismaël Mejía

+1 Github Actions are more intuitive and easy to modify and test for everyone.
Also Beam wins because that makes one less system to maintain.

Regards,
Ismaël

On Wed, Oct 19, 2022 at 5:50 PM Danny McCormick via dev
 wrote:
>
> Thanks for kicking this conversation off. I'm +1 on migrating, but only once 
> we've found a specific replacement for easy observability (which workflows 
> have been failing lately, and how often) and trigger phrases (for retries and 
> workflows that aren't automatically kicked off but should be run for extra 
> validation, e.g. postcommits). Until we have viable replacements, I don't 
> think we should make the move. Publishing nightly snapshots is eventually 
> also a must to fully migrate, but probably doesn't need to block us from 
> making progress here.
>
> With those caveats, the reason that I'm +1 on moving is that our Jenkins 
> reliability has been rough. Since I joined the project in January, I can 
> think of 3 different incidents that significantly harmed our ability to do 
> work.
>
> 1. Jenkins triggers cause multi-day outage - this led to a multi-day code 
> freeze, and we lost our trigger functionality for days afterwards. 
> Investigating/restoring our state ate up a pretty full week for me.
> 2. Jenkins plugin cause multi-day outage - this led to multiple days of 
> Jenkins downtime before eventually being resolved by Infra.
> 3. Cert issues cause many workers to go down - I don't have a thread for this 
> because I handled most of the investigation the day of, but many of our 
> workers went down for around a day and nobody noticed until queue time 
> reached 6+ hours for each workflow.
>
> There may be others that I'm overlooking.
>
> GitHub Actions isn't a magic bullet to fix these problems, but it minimizes 
> the amount of infra that we're maintaining ourselves, increases the isolation 
> between workflows (catastrophic failure is less likely), has uptime 
> guarantees, and is more likely to receive investment going forward (we're 
> likely to get increasing benefits over time for free). We've also done a lot 
> of exploration in this area already, so we're not starting from scratch.
>
> Thanks,
> Danny
>
> On Wed, Oct 19, 2022 at 11:32 AM Kenneth Knowles  wrote:
>>
>> Hi all,
>>
>> As you probably noticed, there's a lot of work going on around adding more 
>> GitHub Actions workflows.
>>
>> Can we fully migrate to GitHub Actions? Similar to our GitHub Issues 
>> migration (but less user-facing) it would bring us on to "default" 
>> infrastructure that more people understand and is maintained by GitHub.
>>
>> So far we have hit some serious roadblocks. It isn't just a simple 
>> migration. We have to weigh doing the work to get there.
>>
>> I started a document with a table of the things we get from Jenkins that we 
>> need to be sure to have for GitHub Actions before we could think about 
>> migrating:
>>
>> https://s.apache.org/beam-jenkins-to-gha
>>
>> Can you please help me by adding things that we get from Jenkins, and if you 
>> know how to get them from GitHub Actions add that too.
>>
>> Thanks!
>>
>> Kenn

Re: unvendoring bytebuddy

2022-03-17 Thread Ismaël Mejía

+1

Probably worth to check if we have dependencies that rely on Byte
Buddy that can produce conflicts but I doubt it.
My only worry was ASM leaking into the classpath, but it seems that
Byte Buddy already shades ASM so that should not be an issue.


Ismaël

On Thu, Mar 17, 2022 at 5:09 PM Liam Miller-Cushon  wrote:
>
> Hello,
>
> I wanted to raise the possibility of using the upstream version of bytebuddy 
> in beam, instead of vendoring it, from BEAM-14117.
>
> Vendoring the bytebuddy dep was introduced in BEAM-1019:
>
>> We encountered backward incompatible changes in bytebuddy during upgrading 
>> to Mockito 2.0. Shading bytebuddy helps to address them and future issues.
>
>
> Vendoring makes it harder to upgrade the bytebuddy version, and upgrading 
> bytebuddy is one of the first things that needs to happen to support new Java 
> versions (e.g. BEAM-14065, BEAM-12241).
>
> Vendoring or shading bytebuddy is discouraged by the upstream owners of the 
> library, see e.g. https://github.com/assertj/assertj-core/issues/2470 where 
> assertj was migrated off a shaded version:
>
>> As Byte Buddy retains compatibility, not shading the library would allow 
>> running recent JVMs without an update of assertj but only BB. Other 
>> libraries like Mockito or Hibernate do not shade BB and there are no known 
>> issues with this approach.
>
>
> Does anyone have additional context about the issues encountered during the 
> mockito 2.0 upgrade, or concerns with trying to unvendor bytebuddy?
>
> Thanks,
> Liam

Re: Beam job details not available on Spark History Server

2022-02-24 Thread Ismaël Mejía

Hello Jozef this change was not introduced in the PR you referenced,
that PR was just a refactor.
The conflicting change was added in [1] via [2] starting on Beam 2.29.0.

It is not clear for me why this was done but maybe Kyle Weaver or
someone else have a better context.

Let's continue the discussion on the issue. For me the right solution
is to remove the extra event logging listener but I might be missing
something, in all cases it does not seem the runner is the best place
to deal with this logic, this seems error-prone.

[1] https://github.com/apache/beam/pull/13743/
[2] https://github.com/apache/beam/commit/291ced166af

On Wed, Feb 23, 2022 at 3:41 PM Jozef Vilcek  wrote:
>
> I would like to discuss a problem I am facing upgrading Beam 2.24.0 -> 2.33.0.
>
> Running Beam batch jobs on SparkRunner with Spark 2.4.4 stopped showing me 
> job details on Spark History Server. Problem is that there are 2 event 
> logging. listener running and they step on each other. More details in [1]. 
> One is run by Spark itself, the other is started by Beam, which was added by 
> MR [2].
>
> My first question is towards understanding why there is Spark's even logging 
> listener started manually within Beam next to the one started by Spark 
> Context internally?
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-13981
> [2] https://github.com/apache/beam/pull/14409
>
>

Re: Are runners-spark2 and beam-site branches obsolete?

2022-02-16 Thread Ismaël Mejía

**beam-site** is there for legacy reasons I suppose we can remove them
without any consequence. Most of the history is in the other repo and
the actual site in the master branch.

**runners-spark2** I think we can go ahead and remove it, this was the
in-progress work of Amit Sela who has not been active in the project
for a while, we already have support for Spark 2 and even for Spark 3
with both the old and the new Spark APIs so most of the features of
that branch should be covered and if they are not they should be quite
hard to migrate to the actual codebase (e.g. the portable runner was
not around, as well as the multiple runner implementations).

The other runners are arguable useful for reference but not based in
the new Runner/Fn APIs and clearly un-maintained, they are based on
Beam versions 4-5y old so I would also think we should maybe remove
them, or do a vote for removal of the branches. It is not that the
branches are a problem but they are probably not that useful too.

Ismaël

Ismaël

On Fri, Feb 4, 2022 at 5:52 PM Kenneth Knowles  wrote:
>
> Hi all,
>
> I was cleaning up branches on the main repo like a lot of "patch" and 
> "revert" branches that GitHub creates through the UI. I noticed the following 
> branches that represent real work and I was thinking about adding them to the 
> list of "protected" branches so they cannot be accidentally deleted:
>
>  * beam-site
>  * runners-spark2
>  * tez-runner
>  * mr-runner
>  * jstorm-runner
>
> Now, I know that the last three are very very old but I think it is harmless 
> to keep them around. But I wonder if the top two are entirely obsolete now. 
> They have not been updated in a very long time. Does anyone know?
>
> Kenn

Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Ismaël Mejía

Oh, forgot to add also the link to the tests that cover most of those
unexpected cases:
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTrackerTest.java


On Tue, Feb 15, 2022 at 10:17 AM Ismaël Mejía  wrote:

> Great idea, please take a look at the Java ByteKeyRestrictionTracker
> implementation for consistency [1]
> I remember we had to deal with lots of corner cases so probably worth a
> look.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java
>
>
> On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw 
> wrote:
>
>> +1 to being forward looking and making restriction trackers.
>> Hopefully the restriction tracker and existing range tracker could share
>> 90% of their code.
>>
>> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi  wrote:
>>
>>> Hello Robert,
>>>
>>>
>>>
>>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
>>> Beam is moving away from Source API, I thought it would be nice to develop
>>> IO connectors by using new SDFs. For this I need to create restriction
>>> tracker that follows new SDF API.
>>>
>>>
>>>
>>> So I propose adding ByteKeyRange as new restriction class and
>>> ByteKeyRestrictionTracker as new restriction tracker class. In my
>>> implementation I’ve also used ByteKey class which are given to restriction.
>>>
>>>
>>>
>>>1.
>>>
>>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
>>>
>>>
>>>
>>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
>>>
>>> > Hi Sam! Glad to hear you're willing to contribute.
>>>
>>> >
>>>
>>> > Though the name is a bit different, I'm wondering if this is already
>>>
>>> > present as LexicographicKeyRangeTracker.
>>>
>>> >
>>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
>>>
>>> >
>>>
>>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay  wrote:
>>>
>>> > >
>>>
>>> > > Hi Sami. Thank you for your interest.
>>>
>>> > >
>>>
>>> > > Adding people who might be able to comment: @Chamikara Jayalath
>>> @Lukasz Cwik
>>>
>>> > >
>>>
>>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi  wrote:
>>>
>>> > >>
>>>
>>> > >> Hello,
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I noticed that Python SDK only has implementation for
>>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange and
>>> -Tracker.
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I have currently created simple implementations of following Python
>>> classes:
>>>
>>> > >>
>>>
>>> > >> ByteKey
>>>
>>> > >> ByteKeyRange
>>>
>>> > >> ByteKeyRestrictionTracker
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I would like to make contribution and make these available in
>>> Python SDK in addition to OffsetRange and -Tracker. I would like to hear
>>> any thoughts about this and should I make a contribution.
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> Thank you,
>>>
>>> > >>
>>>
>>> > >> Sami Niemi
>>>
>>> >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *SAMI NIEMI*
>>> Data Engineer
>>> +358 50 412 2115 <+358%2050%204122115>
>>> sami.ni...@solita.fi
>>>
>>>
>>>
>>> *SOLITA*
>>> Eteläesplanadi 8
>>> 00130 Helsinki
>>> solita.fi <https://www.solita.fi>
>>>
>>>
>>>
>>

Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Ismaël Mejía

Great idea, please take a look at the Java ByteKeyRestrictionTracker
implementation for consistency [1]
I remember we had to deal with lots of corner cases so probably worth a
look.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java


On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw  wrote:

> +1 to being forward looking and making restriction trackers. Hopefully the
> restriction tracker and existing range tracker could share 90% of their
> code.
>
> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi  wrote:
>
>> Hello Robert,
>>
>>
>>
>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
>> Beam is moving away from Source API, I thought it would be nice to develop
>> IO connectors by using new SDFs. For this I need to create restriction
>> tracker that follows new SDF API.
>>
>>
>>
>> So I propose adding ByteKeyRange as new restriction class and
>> ByteKeyRestrictionTracker as new restriction tracker class. In my
>> implementation I’ve also used ByteKey class which are given to restriction.
>>
>>
>>
>>1.
>>
>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
>>
>>
>>
>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
>>
>> > Hi Sam! Glad to hear you're willing to contribute.
>>
>> >
>>
>> > Though the name is a bit different, I'm wondering if this is already
>>
>> > present as LexicographicKeyRangeTracker.
>>
>> >
>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
>>
>> >
>>
>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay  wrote:
>>
>> > >
>>
>> > > Hi Sami. Thank you for your interest.
>>
>> > >
>>
>> > > Adding people who might be able to comment: @Chamikara Jayalath
>> @Lukasz Cwik
>>
>> > >
>>
>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi  wrote:
>>
>> > >>
>>
>> > >> Hello,
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I noticed that Python SDK only has implementation for
>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange and
>> -Tracker.
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I have currently created simple implementations of following Python
>> classes:
>>
>> > >>
>>
>> > >> ByteKey
>>
>> > >> ByteKeyRange
>>
>> > >> ByteKeyRestrictionTracker
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I would like to make contribution and make these available in Python
>> SDK in addition to OffsetRange and -Tracker. I would like to hear any
>> thoughts about this and should I make a contribution.
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> Thank you,
>>
>> > >>
>>
>> > >> Sami Niemi
>>
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *SAMI NIEMI*
>> Data Engineer
>> +358 50 412 2115 <+358%2050%204122115>
>> sami.ni...@solita.fi
>>
>>
>>
>> *SOLITA*
>> Eteläesplanadi 8
>> 00130 Helsinki
>> solita.fi 
>>
>>
>>
>

Re: Developing on an M1 Mac

2022-02-09 Thread Ismaël Mejía

Thanks for pointing this out Robert, I had somehow in my mind that it
was not official until 1.18 but I forgot to double check, for info I
was able to build the Go SDK container without any trouble so I assume
most of the things are 'ready'.

It would be great to take a look at fixing the hardcoded arch uses in
paths and other (grep -i -R 'amd64') , don't hesitate to ping me if
you need help with some some test I have easy acces to AWS Graviton
too.

Of course the Beam community has never mentioned being ARM64
compatible as a target but with the increasing rise of this arch on
the server side and now on the Mac, it would be probably worth.

In case we decide to do it, Github Actions does not support ARM64 yet,
so other projects validate ARM64 through emulation via qemu, but this
seems to be quite slow.



On Tue, Feb 8, 2022 at 3:58 PM Robert Burke  wrote:
>
> Go supports ARM64 on Darwin since 1.16, which is the minimum version of Go we 
> currently support.
>
> See https://go.dev/blog/ports
>
> There are definitely some hardcoded paths we'd need to adjust to build boot 
> containers though.
>
> Go 1.18 improves things, and since it has the initial run of Go Generics, 
> we'll likely move to support it pretty quickly.
>
> On Tue, Feb 8, 2022, 6:18 AM Jarek Potiuk  wrote:
>>
>> Just for your information: Thanks to that change - i will soon be adding ARM 
>> support for Apache Airflow - including building and publishing the images 
>> and running our tests (using self-hosted runners).
>> As soon as I get it I will be able to share the code/experiences with you.
>>
>> J
>>
>> On Tue, Feb 8, 2022 at 2:50 PM Ismaël Mejía  wrote:
>>>
>>> For awareness with the just released Beam 2.36.0 Beam works out of the
>>> box to develop on a Mac M1.
>>>
>>> I tried Java and Python pipelines with success running locally on both
>>> Flink/Spark runner.
>>> I found one issue using zstd and created [1] that was merged today,
>>> with this the sdks:core tests and Spark runner tests fully pass.
>>>
>>> I would see 2.36.0 is the first good enough release for someone
>>> working on a Mac M1 or ARM64 processor.
>>>
>>> There are still some missing steps to have full ARM64 [apart of testing it 
>>> :)]
>>>
>>> 1. In theory we could run docker x86 images on ARM but those would be
>>> emulated so way slower so it is probably better to support 'native'
>>> CPUs) via multiarchitecture docker images [2].
>>> BEAM-11704 Support Beam docker images on ARM64
>>>
>>> I could create the runners images from master, for the SDK containers
>>> there are some issues with hardcoded paths [2] and virtualenv that
>>> probably will be solved once we move to venv, and we will need to
>>> upgrade our release process to include multiarch images (for user
>>> friendliness).
>>>
>>> Also golang only supports officially ARM64 starting with version
>>> 1.18.0 so we need to move up to that version.
>>>
>>> Anyway Beam is in a waaay better shape for ARM64 now than 1y ago when
>>> I created the initial JIRAs.
>>>
>>> Ismaël
>>>
>>> [1] https://github.com/apache/beam/pull/16755
>>> [2] https://issues.apache.org/jira/browse/BEAM-11704
>>> [3] 
>>> https://github.com/apache/beam/blob/d1b8e569fd651975f08823a3db49dbee56d491b5/sdks/python/container/Dockerfile#L79
>>>
>>>
>>>
>>>> Could not find protoc-3.14.0-osx-aarch_64.exe
>>> (com.google.protobuf:protoc:3.14.0).
>>>  Searched in the following locations:
>>>  
>>> https://jcenter.bintray.com/com/google/protobuf/protoc/3.14.0/protoc-3.14.0-osx-aarch_64.exe
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 12, 2022 at 9:53 PM Luke Cwik  wrote:
>>> >
>>> > The docker container running in an x86 based cloud machine should work 
>>> > pretty well. This is what Apache Beam's Jenkins setup effectively does.
>>> >
>>> > No experience with developing on an ARM based CPU.
>>> >
>>> > On Wed, Jan 12, 2022 at 9:28 AM Jarek Potiuk  wrote:
>>> >>
>>> >> Comment from the side - If you use Docker - experience from Airflow -
>>> >> until we will get ARM images, docker experience is next to unusable
>>> >> (docker filesystem slowness + emulation).
>>> >>
>>> >> J.
>>> >>
>>> >> On Wed, Jan 12, 2022 at 6:21 PM Daniel Collins  
>>> >> wrote:
>>> >> >
>>> >> > I regularly develop on a non-m1 mac using intellij, which mostly works 
>>> >> > out of the box. Are you running into any particular issues building or 
>>> >> > just looking for advice?
>>> >> >
>>> >> > -Daniel
>>> >> >
>>> >> > On Wed, Jan 12, 2022 at 12:16 PM Matt Rudary 
>>> >> >  wrote:
>>> >> >>
>>> >> >> Does anyone do Beam development on an M1 Mac? Any tips to getting 
>>> >> >> things up and running?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Alternatively, does anyone have a good “workstation in the cloud” 
>>> >> >> setup?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> Thanks
>>> >> >>
>>> >> >> Matt

Re: Developing on an M1 Mac

2022-02-08 Thread Ismaël Mejía

For awareness with the just released Beam 2.36.0 Beam works out of the
box to develop on a Mac M1.

I tried Java and Python pipelines with success running locally on both
Flink/Spark runner.
I found one issue using zstd and created [1] that was merged today,
with this the sdks:core tests and Spark runner tests fully pass.

I would see 2.36.0 is the first good enough release for someone
working on a Mac M1 or ARM64 processor.

There are still some missing steps to have full ARM64 [apart of testing it :)]

1. In theory we could run docker x86 images on ARM but those would be
emulated so way slower so it is probably better to support 'native'
CPUs) via multiarchitecture docker images [2].
BEAM-11704 Support Beam docker images on ARM64

I could create the runners images from master, for the SDK containers
there are some issues with hardcoded paths [2] and virtualenv that
probably will be solved once we move to venv, and we will need to
upgrade our release process to include multiarch images (for user
friendliness).

Also golang only supports officially ARM64 starting with version
1.18.0 so we need to move up to that version.

Anyway Beam is in a waaay better shape for ARM64 now than 1y ago when
I created the initial JIRAs.

Ismaël

[1] https://github.com/apache/beam/pull/16755
[2] https://issues.apache.org/jira/browse/BEAM-11704
[3] 
https://github.com/apache/beam/blob/d1b8e569fd651975f08823a3db49dbee56d491b5/sdks/python/container/Dockerfile#L79

   > Could not find protoc-3.14.0-osx-aarch_64.exe
(com.google.protobuf:protoc:3.14.0).
 Searched in the following locations:

https://jcenter.bintray.com/com/google/protobuf/protoc/3.14.0/protoc-3.14.0-osx-aarch_64.exe

On Wed, Jan 12, 2022 at 9:53 PM Luke Cwik  wrote:
>
> The docker container running in an x86 based cloud machine should work pretty 
> well. This is what Apache Beam's Jenkins setup effectively does.
>
> No experience with developing on an ARM based CPU.
>
> On Wed, Jan 12, 2022 at 9:28 AM Jarek Potiuk  wrote:
>>
>> Comment from the side - If you use Docker - experience from Airflow -
>> until we will get ARM images, docker experience is next to unusable
>> (docker filesystem slowness + emulation).
>>
>> J.
>>
>> On Wed, Jan 12, 2022 at 6:21 PM Daniel Collins  wrote:
>> >
>> > I regularly develop on a non-m1 mac using intellij, which mostly works out 
>> > of the box. Are you running into any particular issues building or just 
>> > looking for advice?
>> >
>> > -Daniel
>> >
>> > On Wed, Jan 12, 2022 at 12:16 PM Matt Rudary  
>> > wrote:
>> >>
>> >> Does anyone do Beam development on an M1 Mac? Any tips to getting things 
>> >> up and running?
>> >>
>> >>
>> >>
>> >> Alternatively, does anyone have a good “workstation in the cloud” setup?
>> >>
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Matt

Re: [ANNOUNCE] Apache Beam 2.36.0 Release

2022-02-08 Thread Ismaël Mejía

Great work Emily and everyone!

I am glad to see that with the dependency updates this is the first
Beam release that works correctly out of the box on ARM64, I tried
some helloword examples on a Mac M1 with both Java and Python and it
works ok.

Ismaël




On Tue, Feb 8, 2022 at 9:49 AM Jarek Potiuk  wrote:
>
> Thanks a lot for that Emily!
>
> It's been a release we were waiting for at Apache Airflow.
> I believe It will unblock a number of "modernizations" in our pipeline - 
> Python 3.10, ARM support were quite a bit depending on it (mostly through 
> numpy transitive dependency limitation). Great to see this one out!
>
> J.
>
> On Tue, Feb 8, 2022 at 3:39 AM Emily Ye  wrote:
>>
>> The Apache Beam team is pleased to announce the release of version 2.36.0.
>>
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>>
>> You can download the release here:
>> https://beam.apache.org/get-started/downloads/
>>
>> This release includes bug fixes, features, and improvements detailed
>> on the Beam blog: https://beam.apache.org/blog/beam-2.36.0/
>>
>> Thank you to everyone who contributed to this release, and we hope you
>> enjoy using Beam 2.36.0
>>
>> - Emily, on behalf of the Apache Beam community.

Re: [DISCUSS] propdeps removal and what to do going forward

2022-01-13 Thread Ismaël Mejía

Optional dependencies should not be a major issue.

What matters to validate that we are not breaking users is to compare
the generated POM files with the previous (pre gradle 7 / 2.35.0)
version and see that what was provided is still provided.

In particular the Hadoop/Spark and Kafka dependencies must be
**provided** as they were. I am not sure of others but those three
matter.

Ismaël

On Wed, Jan 12, 2022 at 10:55 PM Emily Ye  wrote:
>
> We've chatted offline and have a tentative plan for what to do with these 
> dependencies that are currently marked as compileOnly (instead of provided). 
> Please review the list if possible [1].
>
> Two projects we aren't sure about:
>
> :sdks:java:io:hcatalog
>
> library.java.jackson_annotations
> library.java.jackson_core
> library.java.jackson_databind
> library.java.hadoop_common
> org.apache.hive:hive-exec
> org.apache.hive.hcatalog:hive-hcatalog-core
>
> :sdks:java:io:parquet
>
> library.java.hadoop_client
>
>
> Does anyone have experience with either of these IOs? ccing Chamikara
>
> Thank you,
> Emily
>
>
> [1] 
> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>
> On Tue, Jan 11, 2022 at 6:38 PM Emily Ye  wrote:
>>
>> As the person volunteering to do fixes for this to unblock Beam 2.36.0, I 
>> created a spreadsheet of the projects with dependencies changed from 
>> provided to compile only [1]. I pre-filled with what I think things should 
>> be, but I don't have very much background in java/maven/gradle 
>> configurations so please give input!
>>
>> Some (mainly hadoop/kafka) I left blank, since I'm not sure - do we keep 
>> them provided because it depends on the user's version?
>>
>> [1] 
>> https://docs.google.com/spreadsheets/d/1UpeQtx1PoAgeSmpKxZC9lv3B9G1c7cryW3iICfRtG1o/edit?usp=sharing
>>
>> On Tue, Jan 11, 2022 at 1:17 PM Luke Cwik  wrote:
>>>
>>> I'm not following what you're trying to say Kenn since provided in maven 
>>> requires the user to explicitly add the dependency themselves to have it 
>>> part of their runtime.
>>>
>>> As per 
>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#dependency-scope
>>> "
>>> * provided
>>> This is much like compile, but indicates you expect the JDK or a container 
>>> to provide the dependency at runtime. For example, when building a web 
>>> application for the Java Enterprise Edition, you would set the dependency 
>>> on the Servlet API and related Java EE APIs to scope provided because the 
>>> web container provides those classes. A dependency with this scope is added 
>>> to the classpath used for compilation and test, but not the runtime 
>>> classpath. It is not transitive."
>>>
>>> On Tue, Jan 11, 2022 at 11:54 AM Kenneth Knowles  wrote:

 To clarify: "provided" should have been in the test runtime configuration, 
 but not in the shipped runtime configuration (otherwise dep resolution for 
 users would pull in provided deps, which should not happen)

 On Thu, Dec 30, 2021 at 10:05 AM Luke Cwik  wrote:
>
> During the migration to Gradle 7[1] the propdeps plugin was removed[2] 
> since there wasn't a newer version that was compatible with Gradle 7 and 
> a replacement couldn't be found. All existing usages of "provided" were 
> moved to "compileOnly" and "compileOnly" is being mapped to the 
> "provided" maven scope in the generated pom files. This has lead to two 
> issues:
> 1) provided was also part of the runtime configuration, so we are getting 
> a few class not found exceptions when running tests [3]
> 2) the generated pom.xml will have a bunch of compile time only 
> annotations added as a provided dependency in the generated pom files[4]
>
> #1 can be fixed by adding the dependency to both the "compileOnly" and 
> "runtimeOnly" configurations or by adding dependency to the 
> "implementation" configuration
> #2 will make the pom files messier which can lead to confusion for users 
> but shouldn't impact existing uses.
>
> There was a suggestion[4] to completely remove the usage of provided from 
> the generated pom.xml and have all our previously "provided" dependencies 
> declared as "implementation" allowing us to solve both #1 and #2 above.
>
> The largest usage of "provided" in the past was to packages related to 
> the hadoop ecosystem and afterwards it was for packages such as 
> junit/hamcrest/aircompressor in sdks/java/core which aren't required to 
> use the module but can provide additional features if the dependency 
> exists.
>
> What should we migrate if anything to the "implementation" configuration 
> or should we try to recreate what we were doing with the "provided" 
> configuration in the past?
>
> 1: https://issues.apache.org/jira/browse/BEAM-13430
> 2: https://github.com/apache/beam/pull/16308
> 3: https://issues.apache.

Re: Contributor permission for Beam Jira tickets

2021-11-27 Thread Ismaël Mejía

Done, I assigned the issue to you too. Welcome to Beam!

On Sat, Nov 27, 2021 at 12:53 AM Alexander Dahl  wrote:
>
> Hi,
>
> My name is Alex, working as a data engineer at ICA, a large Swedish retailer. 
> At work I'm writing beam code for Google Cloud Dataflow jobs. I want to 
> contribute to the beam project :)
>
> I saw this very small website issue that I can submit a fix for: 
> https://issues.apache.org/jira/browse/BEAM-11943
>
> My Jira username: aleda145
>
> Thanks,
> Alex
>

Re: Performance tests dashboard not working

2021-07-29 Thread Ismaël Mejía

Hi, I was back to check something in the dashboards and they seem to
be failing with an error

"Templating init failed
NetworkError when attempting to fetch resource."

Can somebody with more access/knowledge about that infra help me check
what is going on.

Thanks

On Wed, Apr 21, 2021 at 9:41 PM Ismaël Mejía  wrote:
>
> Seems to be a networking issue in my side, they fail on Firefox for
> some weird timeout but they work perfectly on Chrome.
> Thanks for confirming Andrew
>
> On Wed, Apr 21, 2021 at 6:45 PM Andrew Pilloud  wrote:
> >
> > Looks like it is working now?
> >
> > On Wed, Apr 21, 2021 at 7:34 AM Ismaël Mejía  wrote:
> >>
> >> Following the conversation on the performance regression on Flink
> >> runner I wanted to take a look at the performance dashboards (Nexmark
> >> + Load Tests) but when I open the dashboards it says there is a
> >> connectivity error "NetworkError when attempting to fetch resource.".
> >> Can someone with more knowledge about our CI / dashboards infra please
> >> take a look.
> >>
> >> http://104.154.241.245/d/ahudA_zGz/nexmark?orgId=1

Re: [PROPOSAL] Vendored gRPC 1.36.0 0.2 Release

2021-06-30 Thread Ismaël Mejía

+1 I just merged Tomo's security fix so this should be ready to go

On Wed, Jun 30, 2021 at 12:26 AM Luke Cwik  wrote:
>
> Sounds good, will wait for you PR.
>
> On Tue, Jun 29, 2021 at 2:35 PM Tomo Suzuki  wrote:
>>
>> I have this https://issues.apache.org/jira/browse/BEAM-12422 "Vendored gRPC 
>> 1.36.0 is using a log4j version with security issues".
>> Let me try to fix that.
>>
>> On Tue, Jun 29, 2021 at 3:57 PM Luke Cwik  wrote:
>>>
>>> Yes, but as I suggested we should make the test mandatory if you're using 
>>> Linux to catch the breakage in the future.
>>>
>>> On Tue, Jun 29, 2021 at 12:22 PM Kenneth Knowles  wrote:

 Sounds good. I imagine the tests are set up that way so that users can run 
 tests locally on platforms that do not support epoll. Will the change in 
 shaded deps preserve this?

 Kenn

 On Tue, Jun 29, 2021 at 12:10 PM Luke Cwik  wrote:
>
> I noticed that at some point in time we broke support for epoll (and unix 
> domain sockets) by not relocating and renaming the linux-x86_64 variant.
>
> I filed BEAM-12547[1] and was able to fix the shading to include the 
> correctly renamed epoll JNI library[2]. I was looking to perform a 
> vendored gRPC 1.36.0 release and follow-up by fixing our existing test to 
> catch the breakage by requiring ManagedChannelFactoryTest to run if the 
> architecture is Linux instead of when Epoll is available[3].
>
> Any concerns on the release?
>
> 1: https://issues.apache.org/jira/browse/BEAM-12547
> 2: https://github.com/apache/beam/pull/15091
> 3: 
> https://github.com/apache/beam/blob/5fffad689e3fa7b6f788f3037a02fb215c1864ec/sdks/java/fn-execution/src/test/java/org/apache/beam/sdk/fn/channel/ManagedChannelFactoryTest.java#L48
>>
>>
>>
>> --
>> Regards,
>> Tomo

Re: Beam Contributor Request

2021-06-29 Thread Ismaël Mejía

Hello Jack,

You were added to the contributors group so you can self assign tickets.

Welcome to Beam!

Ismaël

On Mon, Jun 28, 2021 at 9:35 PM Jack McCluskey 
wrote:

> Hey everyone,
>
> My name is Jack McCluskey and I'd like to be added to the contributor
> list. My Jira username is 'jrmccluskey'. I'll be working on the Go side of
> things.
>
> Thanks,
>
> Jack
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Beam Go
> RDU
> jrmcclus...@gmail.com
>
>
>

Re: Apache Beam Contributor List Request

2021-06-29 Thread Ismaël Mejía

Hello Marco,

You were added to the contributor group so you can now self-assign JIRA tickets.

Welcome to Beam!

Ismaël

On Mon, Jun 28, 2021 at 6:58 PM Marco Robles Pulido
 wrote:
>
> Hi team,
>
> This is Marco and I would like to be added to the Apache Beam contributor 
> list, my username is 'marroble'.
>
>
> Thanks!
>
> This email and its contents (including any attachments) are being sent to
> you on the condition of confidentiality and may be protected by legal
> privilege. Access to this email by anyone other than the intended recipient
> is unauthorized. If you are not the intended recipient, please immediately
> notify the sender by replying to this message and delete the material
> immediately from your system. Any further use, dissemination, distribution
> or reproduction of this email is strictly prohibited. Further, no
> representation is made with respect to any content contained in this email.

Re: [Proposal] Go SDK Exits Experimental

2021-06-17 Thread Ismaël Mejía

Oups forgot to write one question. Will this come with revamped
website instructions/doc for golang too?

On Thu, Jun 17, 2021 at 3:21 PM Ismaël Mejía  wrote:
>
> Huge +1
>
> This is definitely something many people have asked about, so it is
> great to see it finally happening.
>
> On Wed, Jun 16, 2021 at 7:56 PM Kenneth Knowles  wrote:
> >
> > +1 awesome
> >
> > On Wed, Jun 16, 2021 at 10:33 AM Robert Burke  wrote:
> >>
> >> Sounds reasonable to me. I agree. We'll aim to get those (Go modules and 
> >> LICENSE issue) done before the 2.32 cut, and certainly before the 2.33 cut 
> >> if release images aren't added to the 2.32 process.
> >>
> >> Regarding Go Generics: at some point in the future, we may want a harder 
> >> break between a newer Generic first API and and the current version, but 
> >> there's no rush. Generics/TypeParameters in Go aren't identical to the 
> >> feature referred to by that term in Java, C++, Rust, etc, so it'll take a 
> >> bit of time for that expertise to develop.
> >>
> >> However, by the current nature of Go, we had to have pretty sophisticated 
> >> reflective analysis to handle DoFns and map them to their graph inputs. 
> >> So, adding new helpers like a KV, emitter, and Iterator types, shouldn't 
> >> be too difficult. Changing Go SDK internals to use generics (like the 
> >> implementation of Stats DoFns like Min, Max, etc) would also be able to be 
> >> made transparently to most users, and certainly any of the framework for 
> >> execution time handling (the "worker's SDK harness") would be able to be 
> >> cleaned up if need be. Finally, adding more sophisticated DoFn 
> >> registration and code generation would be able to replace the optional 
> >> code generator entirely, saving some users a `go generate` step, 
> >> simplifying getting improved execution performance.
> >>
> >> Changing things like making a Type Parameterized PCollection, would be far 
> >> more involved, as would trying to use some kind of Apply format. The lack 
> >> of Method Overrides prevents the apply chaining approach. Or at least 
> >> prevents it from working simply.
> >>
> >> Finally, Go Generics won't be available until Go 1.18, which isn't until 
> >> next year. See https://blog.golang.org/generics-proposal for details.
> >>
> >> Go 1.17 https://tip.golang.org/doc/go1.17 does include a Register calling 
> >> convention, leading to a modest performance improvement across the board.
> >>
> >> Cheers,
> >> Robert Burke
> >>
> >> On 2021/06/15 18:10:46, Robert Bradshaw  wrote:
> >> > +1 to declaring Golang support out of experimental once the Go Modules
> >> > issues are solved. I don't think an SDK needs to support every feature
> >> > to be accepted, especially now that we can do cross-language
> >> > transforms, and Go definitely supports enough to be quite useful. (WRT
> >> > streaming, my understanding is that Go supports the streaming model
> >> > with windows and timestamps, and runs fine on a streaming runner, even
> >> > if more advanced features like state and timers aren't yet available.)
> >> >
> >> > This is a great milestone.
> >> >
> >> > On Tue, Jun 15, 2021 at 10:12 AM Tyson Hamilton  
> >> > wrote:
> >> > >
> >> > > WOW! Big news.
> >> > >
> >> > > I'm supportive of leaving experimental status after Go Modules are 
> >> > > completed and the LICENSE issue is resolved. I don't think that 
> >> > > lacking streaming support is a blocker. The other thing I checked to 
> >> > > see was if there were metrics available on metrics.beam.apache.org, 
> >> > > specifically for measuring code health via post-commit over time, 
> >> > > which there are and the passing test rate is high (Huzzah!). The one 
> >> > > thing that surprised me from your summary is that when Go introduces 
> >> > > generics it won't result in any backwards incompatible changes in 
> >> > > Apache Beam. That's great news, but does it mean there will be a need 
> >> > > to support both non-generic and generic APIs moving forward? It seems 
> >> > > like generics will be introduced in the Go 1.17 release 
> >> > > (optimistically) in August this year.
> >> > >
> &g

Re: [Proposal] Go SDK Exits Experimental

2021-06-17 Thread Ismaël Mejía

Huge +1

This is definitely something many people have asked about, so it is
great to see it finally happening.

On Wed, Jun 16, 2021 at 7:56 PM Kenneth Knowles  wrote:
>
> +1 awesome
>
> On Wed, Jun 16, 2021 at 10:33 AM Robert Burke  wrote:
>>
>> Sounds reasonable to me. I agree. We'll aim to get those (Go modules and 
>> LICENSE issue) done before the 2.32 cut, and certainly before the 2.33 cut 
>> if release images aren't added to the 2.32 process.
>>
>> Regarding Go Generics: at some point in the future, we may want a harder 
>> break between a newer Generic first API and and the current version, but 
>> there's no rush. Generics/TypeParameters in Go aren't identical to the 
>> feature referred to by that term in Java, C++, Rust, etc, so it'll take a 
>> bit of time for that expertise to develop.
>>
>> However, by the current nature of Go, we had to have pretty sophisticated 
>> reflective analysis to handle DoFns and map them to their graph inputs. So, 
>> adding new helpers like a KV, emitter, and Iterator types, shouldn't be too 
>> difficult. Changing Go SDK internals to use generics (like the 
>> implementation of Stats DoFns like Min, Max, etc) would also be able to be 
>> made transparently to most users, and certainly any of the framework for 
>> execution time handling (the "worker's SDK harness") would be able to be 
>> cleaned up if need be. Finally, adding more sophisticated DoFn registration 
>> and code generation would be able to replace the optional code generator 
>> entirely, saving some users a `go generate` step, simplifying getting 
>> improved execution performance.
>>
>> Changing things like making a Type Parameterized PCollection, would be far 
>> more involved, as would trying to use some kind of Apply format. The lack of 
>> Method Overrides prevents the apply chaining approach. Or at least prevents 
>> it from working simply.
>>
>> Finally, Go Generics won't be available until Go 1.18, which isn't until 
>> next year. See https://blog.golang.org/generics-proposal for details.
>>
>> Go 1.17 https://tip.golang.org/doc/go1.17 does include a Register calling 
>> convention, leading to a modest performance improvement across the board.
>>
>> Cheers,
>> Robert Burke
>>
>> On 2021/06/15 18:10:46, Robert Bradshaw  wrote:
>> > +1 to declaring Golang support out of experimental once the Go Modules
>> > issues are solved. I don't think an SDK needs to support every feature
>> > to be accepted, especially now that we can do cross-language
>> > transforms, and Go definitely supports enough to be quite useful. (WRT
>> > streaming, my understanding is that Go supports the streaming model
>> > with windows and timestamps, and runs fine on a streaming runner, even
>> > if more advanced features like state and timers aren't yet available.)
>> >
>> > This is a great milestone.
>> >
>> > On Tue, Jun 15, 2021 at 10:12 AM Tyson Hamilton  wrote:
>> > >
>> > > WOW! Big news.
>> > >
>> > > I'm supportive of leaving experimental status after Go Modules are 
>> > > completed and the LICENSE issue is resolved. I don't think that lacking 
>> > > streaming support is a blocker. The other thing I checked to see was if 
>> > > there were metrics available on metrics.beam.apache.org, specifically 
>> > > for measuring code health via post-commit over time, which there are and 
>> > > the passing test rate is high (Huzzah!). The one thing that surprised me 
>> > > from your summary is that when Go introduces generics it won't result in 
>> > > any backwards incompatible changes in Apache Beam. That's great news, 
>> > > but does it mean there will be a need to support both non-generic and 
>> > > generic APIs moving forward? It seems like generics will be introduced 
>> > > in the Go 1.17 release (optimistically) in August this year.
>> > >
>> > >
>> > >
>> > > On Thu, Jun 10, 2021 at 5:04 PM Robert Burke  wrote:
>> > >>
>> > >> Hello Beam Community!
>> > >>
>> > >> I propose we stop calling the Apache Beam Go SDK experimental.
>> > >>
>> > >> This thread is to discuss it as a community, and any conditions that 
>> > >> remain that would prevent the exit.
>> > >>
>> > >> tl;dr;
>> > >> Ask Questions for answers and links! I have both.
>> > >> This entails including it officially in the Release process, removing 
>> > >> the various "experimental" text throughout the repo etc,
>> > >> and otherwise treating it like Python and Java. Some Go specific tasks 
>> > >> around dep versioning.
>> > >>
>> > >> The Go SDK implements the beam model efficiently for most batch tasks, 
>> > >> including basic windowing.
>> > >> Apache Beam Go jobs can execute, and are tested on all Portable runners.
>> > >> The core APIs are not going to change in incompatible ways going 
>> > >> forward.
>> > >> Scalable transforms can be written through SplittableDoFns or via Cross 
>> > >> Language transforms.
>> > >>
>> > >> The SDK isn't 100% feature complete, but keeping it experimental 
>> > >> doesn't help with that any further.
>>

Re: Multiple architectures support on Beam (ARM)

2021-06-10 Thread Ismaël Mejía

As a follow up on this with the merge of
https://github.com/apache/beam/pull/14832 Beam will be producing python
wheels for AARCH64 starting on Beam 2.32.0!
Also due to the recent version updates (grpc, protobuf and arrow) we should
be pretty close to fully support it without extra compilation.
Seems like the only missing piece is cython
https://github.com/cython/cython/issues/3892

Now the next important step would be to make the docker images multi-arch.
That would be a great contribution if someone is motivated.


On Thu, Jan 28, 2021 at 1:47 AM Robert Bradshaw  wrote:

> Cython supports ARM64. The issue here is that we don't have a C++ compiler
> (It's looking for 'cc') available in the container (and grpc, and possibly
> others, don't have wheel files for this platform). I wonder if apt-get
> install build-essential would be sufficient.
>
> On Wed, Jan 27, 2021 at 2:22 PM Ismaël Mejía  wrote:
>
>> Nice to see the interest, I also suppose that devs on Apple macbooks with
>> the
>> new M1 processor will soon request this feature.
>>
>> I ran today some pipelines on ARM64 on classic runners relatively easy
>> which was expected.  We will have issues however for the Java 8 SDK
>> harness
>> because the parent image openjdk:8 is not supported yet for ARM64.
>>
>> I tried to setup a python dev environment and found the first issue. It
>> looks
>> like gRPC does not support arm64 yet [1][2] or am I misreading it?
>>
>> $ pip install -r build-requirements.txt
>>
>> Collecting grpcio-tools==1.30.0
>>   Downloading grpcio-tools-1.30.0.tar.gz (2.1 MB)
>>  || 2.1 MB 21.7 MB/s
>> ERROR: Command errored out with exit status 1:
>>  command: /home/ubuntu/.virtualenvs/beam-dev/bin/python3 -c
>> 'import sys, setuptools, tokenize; sys.argv[0] =
>>
>> '"'"'/tmp/pip-install-3lhad2qc/grpcio-tools_d3562157df5c41db9110e4ccd165c87e/setup.py'"'"';
>>
>> __file__='"'"'/tmp/pip-install-3lhad2qc/grpcio-tools_d3562157df5c41db9110e4ccd165c87e/setup.py'"'"';f=getattr(tokenize,
>> '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"',
>> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))'
>> egg_info --egg-base /tmp/pip-pip-egg-info-km8agjf4
>>  cwd:
>> /tmp/pip-install-3lhad2qc/grpcio-tools_d3562157df5c41db9110e4ccd165c87e/
>> Complete output (11 lines):
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File
>> "/tmp/pip-install-3lhad2qc/grpcio-tools_d3562157df5c41db9110e4ccd165c87e/setup.py",
>> line 112, in 
>> if check_linker_need_libatomic():
>>   File
>> "/tmp/pip-install-3lhad2qc/grpcio-tools_d3562157df5c41db9110e4ccd165c87e/setup.py",
>> line 73, in check_linker_need_libatomic
>> cc_test = subprocess.Popen(['cc', '-x', 'c++', '-std=c++11', '-'],
>>   File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
>> self._execute_child(args, executable, preexec_fn, close_fds,
>>   File "/usr/lib/python3.8/subprocess.py", line 1702, in
>> _execute_child
>> raise child_exception_type(errno_num, err_msg, err_filename)
>> FileNotFoundError: [Errno 2] No such file or directory: 'cc'
>> 
>> WARNING: Discarding
>>
>> https://files.pythonhosted.org/packages/da/3c/bed275484f6cc262b5de6ceaae36798c60d7904cdd05dc79cc830b880687/grpcio-tools-1.30.0.tar.gz#sha256=7878adb93b0c1941eb2e0bed60719f38cda2ae5568bc0bcaa701f457e719a329
>> (from https://pypi.org/simple/grpcio-tools/). Command errored out with
>> exit status 1: python setup.py egg_info Check the logs for full
>> command output.
>> ERROR: Could not find a version that satisfies the requirement
>> grpcio-tools==1.30.0
>> ERROR: No matching distribution found for grpcio-tools==1.30.0
>>
>> [1] https://pypi.org/project/grpcio-tools/#files
>> [2] https://github.com/grpc/grpc/issues/21283
>>
>> I can imagine also that we will have some struggles with the python
>> harness
>> and all of its dependencies. Does cython already support ARM64?
>>
>> I went and filled some JIRAs to keep track of this:
>>
>> BEAM-11703 Support apa

Re: contributor permission for Beam Jira tickets

2021-06-10 Thread Ismaël Mejía

Hello Pascal,

I added you as a contributor  so you can now self assign issues if you
want. I assigned BEAM-12471 to you since I saw you opened a PR to fix it.

Best,
Ismaël

On Wed, Jun 9, 2021 at 11:05 PM Pascal Gillet 
wrote:

> Hi,
>
> This is Pascal. I identified some little but nonetheless annoying bugs in
> Beam. Can someone add me as a contributor for Beam's Jira issue
> tracker? I would like to assign tickets to myself.
>
> My JIRA login: pgillet
>
>
> Thanks,
> Pascal
>

Re: Beam SNAPSHOTS not working since friday

2021-06-08 Thread Ismaël Mejía

Just to finish this thread I double checked and the SNAPSHOTs are published
correctly now.

On Tue, Jun 8, 2021 at 5:31 PM Ismaël Mejía  wrote:

> Thanks for clarifying Brian. So we shall wait for Infra then
>
> On Tue, Jun 8, 2021, 5:15 PM Brian Hulette  wrote:
>
>> You may have already made this connection, but this is likely the same
>> issue discussed in [1][2].
>>
>> [1]
>> https://lists.apache.org/thread.html/r658cdfa643c44a3fa18c226238e537ad221c8f65337f0eab3ad6dad9%40%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/INFRA-21976
>>
>> On Tue, Jun 8, 2021 at 12:53 AM Ismaël Mejía  wrote:
>>
>>> While trying to check on the new 2.32.0-SNAPSHOTs this morning I noticed
>>> that the daily SNAPSHOTs have not been updating since last friday:
>>>
>>> https://ci-beam.apache.org/job/beam_Release_NightlySnapshot/
>>>
>>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-core/
>>>
>>> Can someone please check what is going on and kick the daily generation.
>>> I am also not able to log in into the ci-beam server, not sure if
>>> related.
>>>
>>> Regards,
>>> Ismaël
>>>
>>>

Re: Beam SNAPSHOTS not working since friday

2021-06-08 Thread Ismaël Mejía

Thanks for clarifying Brian. So we shall wait for Infra then

On Tue, Jun 8, 2021, 5:15 PM Brian Hulette  wrote:

> You may have already made this connection, but this is likely the same
> issue discussed in [1][2].
>
> [1]
> https://lists.apache.org/thread.html/r658cdfa643c44a3fa18c226238e537ad221c8f65337f0eab3ad6dad9%40%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/INFRA-21976
>
> On Tue, Jun 8, 2021 at 12:53 AM Ismaël Mejía  wrote:
>
>> While trying to check on the new 2.32.0-SNAPSHOTs this morning I noticed
>> that the daily SNAPSHOTs have not been updating since last friday:
>>
>> https://ci-beam.apache.org/job/beam_Release_NightlySnapshot/
>>
>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-core/
>>
>> Can someone please check what is going on and kick the daily generation.
>> I am also not able to log in into the ci-beam server, not sure if related.
>>
>> Regards,
>> Ismaël
>>
>>

Beam SNAPSHOTS not working since friday

2021-06-08 Thread Ismaël Mejía

While trying to check on the new 2.32.0-SNAPSHOTs this morning I noticed
that the daily SNAPSHOTs have not been updating since last friday:

https://ci-beam.apache.org/job/beam_Release_NightlySnapshot/
https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-core/

Can someone please check what is going on and kick the daily generation.
I am also not able to log in into the ci-beam server, not sure if related.

Regards,
Ismaël

[DISCUSS] Drop support for Flink 1.10

2021-05-28 Thread Ismaël Mejía

Hello,

With Beam support for Flink 1.13 just merged it is the time to discuss the
end of
support for Flink 1.10 following the agreed policy on supporting only the
latest
three Flink releases [1].

I would like to propose that for Beam 2.31.0 we stop supporting Flink 1.10
[2].
I prepared a PR for this [3] but of course I wanted to bring the subject
here
(and to user@) for your attention and in case someone has a different
opinion or
reason to still support the older version.

WDYT?

Regards,
Ismael

[1]
https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-12281
[3] https://github.com/apache/beam/pull/14906

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Ismaël Mejía

I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 4 of which are binding:
* Pablo Estrada
* Etienne Chauchot
* Jean-Baptiste Onofre
* Ismaël Mejía

There are no disapproving votes.

Thanks everyone!

On Thu, May 20, 2021 at 9:17 PM Ismaël Mejía  wrote:

>
> +1 (binding)
>

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-20 Thread Ismaël Mejía

+1 (binding)

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-19 Thread Ismaël Mejía

This release is only to publish the vendored dependency artifacts. We need
those to integrate it and be able to verify if it causes problems or not.
The PR for this is already opened but it needs the artifacts of this vote
to be ran.
https://github.com/apache/beam/pull/14824

For ref there is a document on how to release and validate releases of
Beam's vendored dependencies that can be handy to anyone wishing to help
validate:
https://s.apache.org/beam-release-vendored-artifacts

On Wed, May 19, 2021 at 8:45 PM Tyson Hamilton  wrote:

> I'd like to help, but I don't know how to determine whether this upgrade
> is going to cause problems or not. Are there tests I should look at, or
> some validation I should perform?
>
> On Wed, May 19, 2021 at 11:29 AM Ismaël Mejía  wrote:
>
>> Kind reminder, the vote is ongoing
>>
>> On Mon, May 17, 2021 at 5:32 PM Ismaël Mejía  wrote:
>>
>>> Please review the release of the following artifacts that we vendor:
>>>  * beam-vendor-bytebuddy-1_11_0
>>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #1 for the version 0.1,
>>> as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which includes:
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [1], which is signed with the key with fingerprint
>>> 3415631729E15B33051ADB670A9DAF6713B86349 [2],
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> * commit hash "d93c591deb21237ddb656583d7ef7a4debba" [4],
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Release Manager
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1166/
>>> [4]
>>> https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba
>>>
>>>

Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-19 Thread Ismaël Mejía

Kind reminder, the vote is ongoing

On Mon, May 17, 2021 at 5:32 PM Ismaël Mejía  wrote:

> Please review the release of the following artifacts that we vendor:
>  * beam-vendor-bytebuddy-1_11_0
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 0.1, as
> follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with fingerprint
> 3415631729E15B33051ADB670A9DAF6713B86349 [2],
> * all artifacts to be deployed to the Maven Central Repository [3],
> * commit hash "d93c591deb21237ddb656583d7ef7a4debba" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> [3] https://repository.apache.org/content/repositories/orgapachebeam-1166/
> [4]
> https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba
>
>

[VOTE] Vendored Dependencies Release Byte Buddy 1.11.0

2021-05-17 Thread Ismaël Mejía

Please review the release of the following artifacts that we vendor:
 * beam-vendor-bytebuddy-1_11_0

Hi everyone,
Please review and vote on the release candidate #1 for the version 0.1, as
follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* the official Apache source release to be deployed to dist.apache.org [1],
which is signed with the key with fingerprint
3415631729E15B33051ADB670A9DAF6713B86349 [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* commit hash "d93c591deb21237ddb656583d7ef7a4debba" [4],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1] https://dist.apache.org/repos/dist/dev/beam/vendor/
[2] https://dist.apache.org/repos/dist/release/beam/KEYS
[3] https://repository.apache.org/content/repositories/orgapachebeam-1166/
[4]
https://github.com/apache/beam/commit/d93c591deb21237ddb656583d7ef7a4debba

Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-17 Thread Ismaël Mejía

Now the PR is merged and seems we have consensus module attention to
possible performance issues when integrated so I will start the vote.

On Sat, May 15, 2021 at 3:53 AM Reuven Lax  wrote:

> Microbenchmarks are tough for these benchmarks. In the past, we've had
> changes that increase the time it took to generate bytcode. While this his
> minimal impact on real pipelines (since bytecode is generated on worker
> startup), it has an outsized impact on microbencmark run time.
>
> On Wed, May 12, 2021 at 5:55 AM Ismaël Mejía  wrote:
>
>> Testing this particular kind of PR for perf would be tricky, I think the
>> easiest thing we can notice is if the runtime of the CI tests differs a lot.
>> I really don't think the generated bytecode with the new version would
>> differ much but is for sure something we should pay attention to.
>> And worse case scenario reversing the upgrade should not be that
>> difficult given Beam's well confined dependency on bytebuddy.
>>
>> Other ideas/comments?
>>
>>
>> On Mon, May 10, 2021 at 7:16 PM Reuven Lax  wrote:
>>
>>> What's the best way to test a PR for perf?
>>>
>>> On Mon, May 10, 2021 at 8:59 AM Kenneth Knowles  wrote:
>>>
>>>> If nothing breaks, and we check perf, then absolutely this seems good.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, May 10, 2021 at 12:38 AM Ismaël Mejía 
>>>> wrote:
>>>>
>>>>> Most issues on the previous migration were related to changes on
>>>>> behavior of class-loading on Java 11. It seems Oracle is taking a more
>>>>> backwards compatible on latest releases, so let's hope everything will go
>>>>> well. In the meantime I tested the upgrade locally and tests are passing 
>>>>> ok
>>>>> so we should be good to go. I opened a PR [1] for the version upgrade and
>>>>> assuming consensus on this proposal I expect we can pass to vote soon.
>>>>>
>>>>> [1] https://github.com/apache/beam/pull/14766
>>>>>
>>>>>
>>>>> On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:
>>>>>
>>>>>> We've had some issues in the past with semantic changes in ByteBuddy
>>>>>> (I think related to new Java versions) that required rewriting code in
>>>>>> Beam.
>>>>>>
>>>>>> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>>> What were the issues last time Reuven? I remember that the release
>>>>>>> and upgrade PR were pretty smooth, were there unintended consequences 
>>>>>>> from
>>>>>>> the library changes themselves?
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>>>>>>
>>>>>>>> Sounds good. Based on previous experience though, this might be a
>>>>>>>> difficult upgrade to do.
>>>>>>>>
>>>>>>>> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The version of bytebuddy Beam is vendoring (1.10.8) is already 16
>>>>>>>>> months old and
>>>>>>>>> it is not compatible with more recent versions of Java. I would
>>>>>>>>> like to propose
>>>>>>>>> that we upgrade it [1] to the most recent version (1.11.0) [2] so
>>>>>>>>> we can benefit
>>>>>>>>> of the latest improvements for Java 16/17 and upgraded ASM.
>>>>>>>>>
>>>>>>>>> If everyone agrees I would like to volunteer as the release
>>>>>>>>> manager for this
>>>>>>>>> upgrade.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-12241
>>>>>>>>> [2]
>>>>>>>>> https://github.com/raphw/byte-buddy/blob/master/release-notes.md
>>>>>>>>>
>>>>>>>>>

Github Actions requires to approve the CI jobs for new contributors

2021-05-17 Thread Ismaël Mejía

For awareness Github Actions requires now to approve the CI runs for new
contributors, so this is a call for committers to pay attention to this and
help when they see new users PRs so they don't stay just waiting without
running.

https://github.blog/changelog/2021-04-22-github-actions-maintainers-must-approve-first-time-contributor-workflow-runs/

Is implementing DisplayData on Beam Transforms worth?

2021-05-12 Thread Ismaël Mejía

Running a pipeline on Dataflow I noticed it was not showing the 'display
data' of ParquetIO on the Dataflow UI, after digging deeper I found that
composite transforms are not shown on Dataflow.

BEAM-366 Support Display Data on Composite Transforms
https://issues.apache.org/jira/browse/BEAM-366

I also noticed that for primitive transforms what is shown is not the
populateDisplayData code extended from PTransform but the
populateDisplayData method code implemented at the parametrizing function
level, concretely the DoFn or Source for the case of IOs.

This of course surprised me because we have been implemented all these
methods in the wrong place (at the PTransform level) for years and ignoring
the function so they are not shown in the UI, so I was wondering:

1. Does Google plan to support displaying composite transforms (BEAM-366)
at some point?

2. If (1) is not happening soon, shall we refine all our
populateDisplayData implementations to be done only at the Function level
(DoFn, Source, WindowFn)?

Since Open Source runners (Flink, Spark, etc) do not use DisplayData at all
I suppose we should keep this discussion at the Dataflow level only at this
time.

I ignore how this is modeled on Portable Pipelines, is DisplayData part of
FunctionSpec to support the current use case? I saw that DisplayData is
considered at the PTransform level so this should cover the Composite case,
so I am curious if we are considering the parametrized function level
currently in use correctly for Portable pipelines.

Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-12 Thread Ismaël Mejía

Testing this particular kind of PR for perf would be tricky, I think the
easiest thing we can notice is if the runtime of the CI tests differs a lot.
I really don't think the generated bytecode with the new version would
differ much but is for sure something we should pay attention to.
And worse case scenario reversing the upgrade should not be that difficult
given Beam's well confined dependency on bytebuddy.

Other ideas/comments?


On Mon, May 10, 2021 at 7:16 PM Reuven Lax  wrote:

> What's the best way to test a PR for perf?
>
> On Mon, May 10, 2021 at 8:59 AM Kenneth Knowles  wrote:
>
>> If nothing breaks, and we check perf, then absolutely this seems good.
>>
>> Kenn
>>
>> On Mon, May 10, 2021 at 12:38 AM Ismaël Mejía  wrote:
>>
>>> Most issues on the previous migration were related to changes on
>>> behavior of class-loading on Java 11. It seems Oracle is taking a more
>>> backwards compatible on latest releases, so let's hope everything will go
>>> well. In the meantime I tested the upgrade locally and tests are passing ok
>>> so we should be good to go. I opened a PR [1] for the version upgrade and
>>> assuming consensus on this proposal I expect we can pass to vote soon.
>>>
>>> [1] https://github.com/apache/beam/pull/14766
>>>
>>>
>>> On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:
>>>
>>>> We've had some issues in the past with semantic changes in ByteBuddy (I
>>>> think related to new Java versions) that required rewriting code in Beam.
>>>>
>>>> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía  wrote:
>>>>
>>>>> What were the issues last time Reuven? I remember that the release and
>>>>> upgrade PR were pretty smooth, were there unintended consequences from the
>>>>> library changes themselves?
>>>>>
>>>>>
>>>>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>>>>
>>>>>> Sounds good. Based on previous experience though, this might be a
>>>>>> difficult upgrade to do.
>>>>>>
>>>>>> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>>> The version of bytebuddy Beam is vendoring (1.10.8) is already 16
>>>>>>> months old and
>>>>>>> it is not compatible with more recent versions of Java. I would like
>>>>>>> to propose
>>>>>>> that we upgrade it [1] to the most recent version (1.11.0) [2] so we
>>>>>>> can benefit
>>>>>>> of the latest improvements for Java 16/17 and upgraded ASM.
>>>>>>>
>>>>>>> If everyone agrees I would like to volunteer as the release manager
>>>>>>> for this
>>>>>>> upgrade.
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-12241
>>>>>>> [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md
>>>>>>>
>>>>>>>

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-05-12 Thread Ismaël Mejía

My excuses Brian I had not seen your question:

> - Will dependabot work ok with the version ranges that we specify? For
example some Python dependencies have upper bounds for the next major
version, some for the next minor version. Is dependabot smart enough to try
bumping the appropriate version number?

Yes, it does and we can also explicitly set it to ignore certain versions
or a all for each dependency if we don't want to have any PR upgrade for it.

As a follow up on this I received an email from my Beam fork this morning
reporting a CVE issue on one of the website dependencies, it is a moderate
issue since this is a dep for the website generation, so it won't affect
Beam users) but it is a clear example of the utility of dependabot.

So the question is how do we proceed? Do I contact INFRA to enable it for
the main repo? and more concretely how do we deal with these PRs in a
practical sense? Do we rename them and create an associated JIRA for
tracking?

Other opinions?

Ismaël



On Fri, Apr 16, 2021 at 5:36 PM Brian Hulette  wrote:

> Yeah I can see the advantage in tooling like this for easy upgrades. I
> suspect many of the outdated Python dependencies fall under this category,
> but the toil of creating a PR and verifying it passes tests is enough of a
> barrier that we just haven't done it. Having a bot create the PR and
> trigger CI to verify it would be helpful IMO.
>
> Some questions/concerns I have:
> - I think many python upgrades will still require manual work:
>   - We also have pinned versions for some Python dependencies in
> base_image_requirements.txt [1]
>   - We test with multiple major versions of pyarrow. We'd want to add a
> new test environment [2] when bumping to the next major version
> - Will dependabot work ok with the version ranges that we specify? For
> example some Python dependencies have upper bounds for the next major
> version, some for the next minor version. Is dependabot smart enough to try
> bumping the appropriate version number?
>
> Brian
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>
> [2]
> https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/python/tox.ini#L231
>
> On Fri, Apr 16, 2021 at 7:16 AM Ismaël Mejía  wrote:
>
>> Oh forgot to mention one alternative that we do in the Avro project,
>> it is that we don't create issues for the dependabot PRs and then we
>> search all the commits authored by dependabot and include them in the
>> release notes to track dependency upgrades.
>>
>> On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
>> >
>> > > Quite often, dependency upgrade to latest versions leads to either
>> compilation errors or failed tests and it should be resolved manually or
>> declined. Having this, maybe I miss something, but I don’t see what kind of
>> advantages automatic upgrade will bring to us except that we don’t need to
>> create a PR manually (which is a not big deal).
>> >
>> > The advantage is exactly that, that we don't have to create and track
>> > dependency updates manually, it will be done by the bot and we will
>> > only have to do the review and guarantee that no issues are
>> > introduced. I forgot to mention but we can create exception rules so
>> > no further upgrades will be proposed for some dependencies e.g.
>> > Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
>> > advantage that is the detailed security report that will help us
>> > prioritize dependency upgrades.
>> >
>> > > Regarding another issue - it’s already a problem, imho. Since we have
>> a one Jira per package upgrade now and usually it “accumulates” all package
>> upgrades and it’s not closed once upgrade is done, we don’t have a reliable
>> way to notify in release notes about all dependency upgrades for current
>> release. One of the way is to mention the package upgrade in CHANGES.md
>> which seems not very relible because it's quite easy to forget to do. I’d
>> prefer to have a dedicated Jira issue for every upgrade and it will be
>> included into releases notes almost automatically.
>> >
>> > Yes it seems the best for release note tracking to create the issue
>> > and rename the PR title for this, but that would be part of the
>> > review/merge process, so up to the Beam committers to do it
>> > systematically and given how well we respect the commit naming /
>> > squashing rules I am not sure if we will win much by having another
>> > useless rule.
>> >
>> > On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
>> >  wrote:

Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Ismaël Mejía

Tomo just confirmed in the ticket that if we update the gRPC vendored
version we won't need the JBoss dependency anymore so we should be good to
go with the upgrade. The open question is if this should be blocking for
the upcoming Beam 2.31.0 release or we can fix it afterwards.


On Mon, May 10, 2021 at 2:46 PM Ismaël Mejía  wrote:

> We have been discussing about updating the vendored dependency in
> BEAM-11227 <https://issues.apache.org/jira/browse/BEAM-11227>, if I
> remember correctly the newer version of gRPC does not require the jboss
> dependency, so probably is the best upgrade path, can you confirm Tomo
> Suzuki
> <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=suztomo> ?
>
> On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk  wrote:
>
>> Also we have very similar discussion about it in
>> https://issues.apache.org/jira/browse/LEGAL-572
>> Just to be clear about the context of it, it's not a legal requirement of
>> Apache Licence, it's Apache Software Foundation policy, that we should not
>> limit our users in using our software. If the LGPL dependency is
>> "optional", it's fine to add such optional dependency. If it is "required"
>> to run your software, then it is not allowed as it limits the users of ASF
>> software in further redistributing the software in the way they want (this
>> is at least my understanding of it).
>>
>> On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:
>>
>>> Hi
>>>
>>> You can take a look on
>>>
>>> https://www.apache.org/legal/resolved.html
>>>
>>> Regards
>>> JB
>>>
>>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
>>> écrit :
>>>
>>> Anyone have a link to the official Apache policy about this? Thanks.
>>>
>>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>>>
>>>
>>> Hi,
>>>
>>>
>>> we are bundling dependencies with LGPL-2.1, according to license header
>>>
>>> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>>>
>>> might be an issue, already reported here: [1]. I created [2] to track it
>>>
>>> on our side.
>>>
>>>
>>>  Jan
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-22555
>>>
>>>
>>> [2] https://issues.apache.org/jira/browse/BEAM-12316
>>>
>>>
>>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org
>>>
>>>
>>
>> --
>> +48 660 796 129
>>
>

Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Ismaël Mejía

We have been discussing about updating the vendored dependency in BEAM-11227
, if I remember correctly
the newer version of gRPC does not require the jboss dependency, so
probably is the best upgrade path, can you confirm Tomo Suzuki
 ?

On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk  wrote:

> Also we have very similar discussion about it in
> https://issues.apache.org/jira/browse/LEGAL-572
> Just to be clear about the context of it, it's not a legal requirement of
> Apache Licence, it's Apache Software Foundation policy, that we should not
> limit our users in using our software. If the LGPL dependency is
> "optional", it's fine to add such optional dependency. If it is "required"
> to run your software, then it is not allowed as it limits the users of ASF
> software in further redistributing the software in the way they want (this
> is at least my understanding of it).
>
> On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:
>
>> Hi
>>
>> You can take a look on
>>
>> https://www.apache.org/legal/resolved.html
>>
>> Regards
>> JB
>>
>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
>> écrit :
>>
>> Anyone have a link to the official Apache policy about this? Thanks.
>>
>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>>
>>
>> Hi,
>>
>>
>> we are bundling dependencies with LGPL-2.1, according to license header
>>
>> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>>
>> might be an issue, already reported here: [1]. I created [2] to track it
>>
>> on our side.
>>
>>
>>  Jan
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-22555
>>
>>
>> [2] https://issues.apache.org/jira/browse/BEAM-12316
>>
>>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>>
>
> --
> +48 660 796 129
>

Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

2021-05-10 Thread Ismaël Mejía

I just saw that gRPC 1.37.1 is out now (and with aarch64 support for
python!) that made me wonder about this, what is the current status of
upgrading the vendored dependency Tomo?


On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki  wrote:

> We observed the cron job of Java Precommit for the master branch started
> timing out often (not always) since upgrading the gRPC version.
> https://github.com/apache/beam/pull/14466#issuecomment-815343974
>
> Exchanged messages with Kenn, I reverted to the change; now the master
> branch uses the vendored gRPC 1.26.
>
>
> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles  wrote:
>
>> Merged. Let's keep an eye for trouble, and I will incorporate to the
>> release branch.
>>
>> Kenn
>>
>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki  wrote:
>>
>>> Regarding troubleshooting on build timeout, it seems that Docker cache
>>> in Jenkins machines might be playing a role. As I run more "Java
>>> Presubmit", I no longer observe timeouts in the PR.
>>>
>>> Kenn, would you merge the PR?
>>> https://github.com/apache/beam/pull/14295 (all checks green, including
>>> the new Java postcommit checks)
>>>
>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles  wrote:
>>>
>>>> Yes, I agree this might be a good idea. This is not the only major
>>>> issue on the release-2.29.0 branch.
>>>>
>>>> The counter argument is that we will be pulling in all the bugs
>>>> introduced to `master` since the branch cut.
>>>>
>>>> As far as effort goes, I have been mostly focused on burning down the
>>>> bugs so I would not lose much work in the release process.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía  wrote:
>>>>
>>>>> Precommit is quite unstable in the last days, so worth to check if
>>>>> something is wrong in the CI.
>>>>>
>>>>> I have a question Kenn. Given that cherry picking this might be a bit
>>>>> big as a change can we just reconsider cutting the 2.29.0 branch again
>>>>> after the updated gRPC version use gets merged and mark the issues
>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an
>>>>> easier upgrade path (and we will get some nice fixes/improvements like
>>>>> official Spark 3 support for free on the release).
>>>>>
>>>>> WDYT?
>>>>>
>>>>>
>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki 
>>>>> wrote:
>>>>> >
>>>>> > Update: I observe that Java precommit check is unstable in the PR to
>>>>> upgrade vendored gRPC (compared with an PR with an empty change). There's
>>>>> no constant failures; sometimes it succeeds and other times it faces
>>>>> timeout and flaky test failures.
>>>>> >
>>>>> > https://github.com/apache/beam/pull/14295#issuecomment-806071087
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki 
>>>>> wrote:
>>>>> >>
>>>>> >> Thank you for the voting and I see the artifact available in Maven
>>>>> Central. I'll work on the PR to use the published artifact today.
>>>>> >>
>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar
>>>>> >>
>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles 
>>>>> wrote:
>>>>> >>>
>>>>> >>> Update on this: there are some minor issues and then I'll send out
>>>>> the RC.
>>>>> >>>
>>>>> >>> I think this is worth blocking 2.29.0 release on, so I will do
>>>>> this first. We are still eliminating other blockers from 2.29.0 anyhow.
>>>>> >>>
>>>>> >>> Kenn
>>>>> >>>
>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki 
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi Beam developers,
>>>>> >>>>
>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0
>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR:
>>>>> https://github.com/apache/beam/pull/14028)
>>>>> >>>> Let me know if you have any questions or concerns.
>>>>> >>>>
>>>>> >>>> Background:
>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems that it
>>>>> the ticket created by some automation is false positive, but it's nice to
>>>>> use an artifact without being marked with CVE.
>>>>> >>>>
>>>>> >>>> Kenn offered to work as the release manager (as in
>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the vendored
>>>>> artifact.
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>> Regards,
>>>>> >>>> Tomo
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Regards,
>>>>> >> Tomo
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Tomo
>>>>>
>>>>
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>
> --
> Regards,
> Tomo
>

Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-10 Thread Ismaël Mejía

Most issues on the previous migration were related to changes on behavior
of class-loading on Java 11. It seems Oracle is taking a more backwards
compatible on latest releases, so let's hope everything will go well. In
the meantime I tested the upgrade locally and tests are passing ok so we
should be good to go. I opened a PR [1] for the version upgrade and
assuming consensus on this proposal I expect we can pass to vote soon.

[1] https://github.com/apache/beam/pull/14766


On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:

> We've had some issues in the past with semantic changes in ByteBuddy (I
> think related to new Java versions) that required rewriting code in Beam.
>
> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía  wrote:
>
>> What were the issues last time Reuven? I remember that the release and
>> upgrade PR were pretty smooth, were there unintended consequences from the
>> library changes themselves?
>>
>>
>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>
>>> Sounds good. Based on previous experience though, this might be a
>>> difficult upgrade to do.
>>>
>>> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía  wrote:
>>>
>>>> The version of bytebuddy Beam is vendoring (1.10.8) is already 16
>>>> months old and
>>>> it is not compatible with more recent versions of Java. I would like to
>>>> propose
>>>> that we upgrade it [1] to the most recent version (1.11.0) [2] so we
>>>> can benefit
>>>> of the latest improvements for Java 16/17 and upgraded ASM.
>>>>
>>>> If everyone agrees I would like to volunteer as the release manager for
>>>> this
>>>> upgrade.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-12241
>>>> [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md
>>>>
>>>>

Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-08 Thread Ismaël Mejía

What were the issues last time Reuven? I remember that the release and
upgrade PR were pretty smooth, were there unintended consequences from the
library changes themselves?


On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:

> Sounds good. Based on previous experience though, this might be a
> difficult upgrade to do.
>
> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía  wrote:
>
>> The version of bytebuddy Beam is vendoring (1.10.8) is already 16 months
>> old and
>> it is not compatible with more recent versions of Java. I would like to
>> propose
>> that we upgrade it [1] to the most recent version (1.11.0) [2] so we can
>> benefit
>> of the latest improvements for Java 16/17 and upgraded ASM.
>>
>> If everyone agrees I would like to volunteer as the release manager for
>> this
>> upgrade.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-12241
>> [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md
>>
>>

Re: Lots of branches

2021-05-08 Thread Ismaël Mejía

Big +1

If you want to know if you have accidentally let some branches in the
'apache/beam' origin this command may help:

git for-each-ref --format='%(authorname) %09 %(refname)' --sort=authorname
| grep "origin" | grep -v "release"


On Sat, May 8, 2021 at 5:01 AM Daniel Oliveira 
wrote:

> Agreed, it would be a good idea to clean it out a bit. I went and deleted
> my own unnecessary branches.
>
> For anyone who needs it, here's a page listing all your branches in the
> repo: https://github.com/apache/beam/branches/yours
>
> On Fri, May 7, 2021 at 5:36 PM Ahmet Altay  wrote:
>
>> Hello all,
>>
>> Our git repo has lots of branches for cherry picks, patches, reverts etc.
>> I believe these are artifacts of github's easy to use online editor. If you
>> no longer need those, could you please clean them?
>>
>> Have a great weekend!
>> Ahmet
>>
>

Re: Extremely Slow DirectRunner

2021-05-08 Thread Ismaël Mejía

Can you try running direct runner with the option
`--experiments=use_deprecated_read`

Seems like an instance of
https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
also reported in
https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E

We should rollback using the SDF wrapper by default because of the
usability and performance issues reported.


On Sat, May 8, 2021 at 12:57 AM Evan Galpin  wrote:

> Hi all,
>
> I’m experiencing very slow performance and startup delay when testing a
> pipeline locally. I’m reading data from a Google PubSub subscription as the
> data source, and before each pipeline execution I ensure that data is
> present in the subscription (readable from GCP console).
>
> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
> min). Is that expected? I did find a Jira ticket[1] that at first seemed
> related, but I think it has more to do with BQ than DirectRunner.
>
> I’ve run the pipeline with a debugger connected and confirmed that it’s
> minutes before the first DoFn in my pipeline receives any data. Is there a
> way I can profile the direct runner to see what it’s churning on?
>
> Thanks,
> Evan
>
> [1]
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>

[PROPOSAL] Vendored bytebuddy dependency release

2021-05-08 Thread Ismaël Mejía

The version of bytebuddy Beam is vendoring (1.10.8) is already 16 months
old and
it is not compatible with more recent versions of Java. I would like to
propose
that we upgrade it [1] to the most recent version (1.11.0) [2] so we can
benefit
of the latest improvements for Java 16/17 and upgraded ASM.

If everyone agrees I would like to volunteer as the release manager for this
upgrade.

[1] https://issues.apache.org/jira/browse/BEAM-12241
[2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md

Re: [PROPOSAL] Upgrade Cassandra driver from 3.x to 4.x in CassandraIO

2021-04-30 Thread Ismaël Mejía

Hello,

My excuses for not having commented on this thread before. Thanks for
bringing this new IO connector!

After reading the proposal I think we need to create this as a new
independent CassandraIO (v4) IO connector different from the existing one
based on v3 for the following reasons:

1. The new API follows a different programming model that is binary
incompatible with existing clients, so if we don't want to break existing
users we need to put it in a different module e.g. `sdks/java/cassandra4/`
(we can keep the same name CassandraIO). The alternative would be to keep
it in the same module and expose it via a different API for example
`CassandraIO.readV4()` but sharing the same module brings the inconvenience
of having to deal with the legacy v3 dependencies and implementation
constraints. (I don't understand yet why the PR still keeps the v3
dependencies, is for this reason?)

2. Being independent will allow us to design a more appropriate API for
Cassandra 4 without the previous constraints, if users want to migrate to
v4 they will have to do the adaptation themselves and we (Beam) will be
free of the responsability of providing API guarantees or a migration path
because both Beam APIs would be independent. Of course the tax we have to
pay on the Beam side is to keep maintaining both versions for a while, but
given how slowly the existing CassandraIO connector has changed in the last
years [1] I would not expect many changes coming on the v3 version and most
new ones to target v4.

3. Be able to run tests with a Cassandra v4 cluster. The current tests in
the PR are still targeting an older version of Cassandra. Of course the
Cassandra v4 client is able to connect to earlier versions of Cassandra [2]
but one big selling point of moving to use the Cassandra v4 dependencies is
to support the most recent version of the Cassandra cluster so it is in our
interest to target it in the tests.

Finally I would suggest to adapt this new version to be based on a DoFn
translation and follow the ReadAll pattern (ideally if possible using
SplittableDoFn but not mandatory). We have been in the process of
refactoring the existing CassandraIO (v3) with Vincent Marquez for some
months and this should be hopefully finished soon so you can take this as a
reference [3], and of course you can contact me (us) in case of questions
on details. (of course this is optional at the moment but good to have).

Regards,
Ismaël

[1]
https://github.com/apache/beam/commits/master/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java
[2]
https://docs.datastax.com/en/driver-matrix/doc/driver_matrix/common/versionCompatibility.html
[3]
https://github.com/vmarquez/beam/tree/feature/BEAM-9008%2Fcassandraio_readall

On Tue, Apr 20, 2021 at 7:48 AM D, Anup (Nokia - IN/Bangalore) <
anu...@nokia.com> wrote:

> Hi All,
>
>
>
> Satwik and myself have been working together on this.
>
> 4.x has been a major revamp and we have highlighted below major
> differences that were seen during this activity.
>
> Please review and provide feedback.
>
>
>
>1. Package names :
>
> 3.x : com.datastax.cassandra
>
> 4.x : com.datastax.oss
>
> Comment : 4.x is different from 3.x. We think both can co-exist. Please
> see JanusGraph who have included both the packages for reference [1]
>
>
>
>1. Mapping :
>
> 3.x : Default Object Mapper took care of mapping all Entity types at
> runtime - org.apache.beam.sdk.io.cassandra.DefaultObjectMapper
>
> 4.x : Mapper auto-generates helper classes during compile time by
> processing annotations on Mapper,Dao and Entity. Then, use either a
> specific Dao or Generic Dao to access/map classes.[2][3]
>
> Comment : With objective to avoid/limit breaking changes, we could find
> providing a Generic/Base Dao via inheritance has limited breakage.[4]
>
> Impacts :
>
>1. Requires mapperFactoryFunction to be mandatorily supplied that can
>return SpecificDao reference.
>2. @GetEntity is the annotation that maps ResultSet to Entity which
>performs strict column checking among the two. This was not the case in
>3.x. We had posted query to Cassandra community [5]
>
>
>
>1. HadoopFormatIO
>
> Unit test in HadoopFormatIO that interacts with Cassandra failed when
> driver was upgraded to 4.x. Latest Apache Cassandra server still uses 3.x
> Cassandra connector.
>
> There is an open JIRA [6][7]
>
>
>
>1. Load Balancing policy
>
> 3.x : Providing data center name is optional.
>
> 4.x : Load balancing policies have been revamped. Providing data center
> name is mandatory.[8]
>
>
>
>1. Configuration
>
> 3.x : This was done by configuring classes.
>
> 4.x : Along with configuring classes, file-based configuration is
> supported. [9][10]
>
> Comment : We did test loading some part of configuration via file and some
> programmatically. There is no impact as such but this is a new
> complimenting feature .
>
>
>
>1. Driver compatibility
>
> Cassandra 4.5+ drivers are fully

Re: [DISCUSSION] TPC-DS benchmark via Beam SQL, issues

2021-04-28 Thread Ismaël Mejía

> Not every query can be supported by BeamSQL easily.

I have one related question. Would we be able to apply SQL specific
optimizations that apply only to batch only pipelines? Asking this because
I can imagine that covering the full Beam model should constraint the
optimization possibilities no?

On Tue, Apr 27, 2021 at 7:25 PM Rui Wang  wrote:

>
>
> On Tue, Apr 27, 2021 at 9:10 AM Alexey Romanenko 
> wrote:
>
>> Hello all,
>>
>> I try to run a Beam implementation [1] of TPC-DS benchmark [2] and I
>> observe that most of the queries don’t pass because of different reasons
>> (see below). I run it with Spark Runner but the issues, I believe, are
>> mostly related to either query parsing or query planning, so we can expect
>> the same with other runners too. For now, only ~22% (23/103) of TPC-DS
>> queries passed successfully via Beam SQL / CalciteSQL.
>>
>> The most common issues are the following ones:
>>
>>1. *“Caused by: java.lang.UnsupportedOperationException: Non
>>equi-join is not supported”*
>>2. *“Caused by: java.lang.UnsupportedOperationException: ORDER BY
>>without a LIMIT is not supported!”*
>>3. *“Caused by: 
>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.plan.RelOptPlanner$CannotPlanException:
>>  There
>>are not enough rules to produce a node with desired
>>properties: convention=BEAM_LOGICAL. All the inputs have relevant nodes,
>>however the cost is still infinite.”*
>>4. *“Caused by: 
>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
>>  No
>>match found for function signature substr(, ,
>>)”*
>>
>> The full list of query statuses is available here [3]. The generated
>> TPC-DS SQL queries can be found there as well [4].
>>
>
> Not every query can be supported by BeamSQL easily. For example, support
> non equi-join(BEAM-2194). We had discussions for cause 2 to add the
> limitation that BeamSQL only supports ORDER BY LIMIT (LIMIT is required).
> Cause 3 needs a case by case investigation, some might be able to be fixed.
> Cause 4 looks like no such function found in the catalog.
>
>>
>> I’m not very familiar with a current status of ongoing work for Beam SQL,
>> so I’m sorry in advance if my questions will sound naive.
>>
>> Please, guide me on this:
>>
>> 1. Are there any chances that we can resolve, at least, partly the
>> current limitations of the query parsing/planning, mentioned above? Are
>> there any principal blockers among them?
>> 2. Are there any plans or ongoing work related to this?
>> 3. Are there any plans to upgrade vendored Calcite version to more recent
>> one? Should it reduce the number of current limitations or not?
>> 4. Do you think it could be valuable for Beam SQL to run TPC-DS benchmark
>> on a regular basis (as we do for Nexmark, for example) even if not all
>> queries can pass with Beam SQL?
>>
>
> This is definitely valuable for BeamSQL if we have enough resources to run
> such queries regularly.
>
>>
>> I’d appreciate any additional information/docs/details/opinions on this
>> topic.
>>
>> —
>> Alexey
>>
>> [1] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds
>> [2] http://www.tpc.org/tpcds/
>> [3]
>> https://docs.google.com/spreadsheets/d/1Gya9Xoa6uWwORHSrRqpkfSII4ajYvDpUTt0cNJCRHjE/edit?usp=sharing
>> [4]
>> https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds/src/main/resources/queries
>>
>

Re: Contributor permissions for Beam Jira tickets

2021-04-26 Thread Ismaël Mejía

Done, you are now a contributor and I assigned BEAM-12225 to you.
Welcome to Beam and don't feel bad about your accents I have to deal with
the same issues regularly :)

Regards,
Ismaël Mejía


On Mon, Apr 26, 2021 at 6:22 PM Rafal Ochyra  wrote:

> Hi,
>
> I have created the account on Beam Jira to report an issue that I would
> like to work on. Link to created issue:
> https://issues.apache.org/jira/browse/BEAM-12225.
> My username is: Rafał Ochyra. Forgive me for using "ł" and space in
> there... Maybe it would be worth changing it if possible.
> Could you please add me as a contributor in the Beam issue tracker so I
> will be able to assign myself to this task (after triage) and, potentially,
> other tasks in the future?
>
> Best regards,
> Rafał
>
> Notice:
> This email is confidential and may contain copyright material of members
> of the Ocado Group. Opinions and views expressed in this message may not
> necessarily reflect the opinions and views of the members of the Ocado
> Group.
>
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.
>
> References to the "Ocado Group" are to Ocado Group plc (registered in
> England and Wales with number 7098618) and its subsidiary undertakings (as
> that expression is defined in the Companies Act 2006) from time to time.
> The registered office of Ocado Group plc is Buildings One & Two, Trident
> Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.
>

Re: Issues and PR names and descriptions (or should we change the contribution guide)

2021-04-22 Thread Ismaël Mejía

I was not referring to author identity but to committer identity that
matters to know who accepted to merge something but it seems we are
not really using this much because github is the 'committer' of merge
commits too :S maybe something we can improve as part of this
discussion.

git show --pretty=full COMMITID

On Thu, Apr 22, 2021 at 9:10 PM Valentyn Tymofieiev  wrote:
>
> Author identity is preserved. Here's an output of 'git log'
>
> commit 93ecc1d3a4b997b2490c4439972ffaf09125299f
> Merge: 2e9ee8c005 4e3decbb4e  
> <-- a merge commit that merges 2 commit, 4e3decbb4e and 
> it's parent. Author history is preserved on 4e3decbb4e
> Author: Ismaël Mejía   
><--  this is the author of merge commit
> Date:   Thu Apr 22 12:46:38 2021 +0200
>
> Merge pull request #14616: [BEAM-12207] Remove log messages about files 
> to stage.<-- Note that message was edited, and does not include a 
> branch, which is nice!
> commit 2e9ee8c0052d96045588e617c9e5de017f30454a
>
>
> commit 28020effca12a18a65799ac7d2d3d520d73072d7
> Author: yoshiki.obata <1285728+lazyl...@users.noreply.github.com>
> Date:   Thu Apr 22 11:57:45 2021 +0900
>
> [BEAM-7372] cleanup codes for py2 from apache_beam/transforms (#14544)
>  <--- 1-commit PR  was squashed-and-merged by me. Author's identity is 
> preserved
>
> On Thu, Apr 22, 2021 at 11:47 AM Ismaël Mejía  wrote:
>>
>> In the past github squash and merge did not preserve the committer
>> identity correctly, is it still the case? If  so we should not be
>> using it.
>> https://github.com/isaacs/github/issues/1368
>>
>> On Thu, Apr 22, 2021 at 8:41 PM Robert Bradshaw  wrote:
>> >
>> > On Thu, Apr 22, 2021 at 11:29 AM Valentyn Tymofieiev  
>> > wrote:
>> >>
>> >> I always squash-and-merge even when there is only 1 commit. This avoids 
>> >> the necessity to edit the commit message to remove not so helpful "Merge 
>> >> pull request xxx" message. Is there any harm to recommend squash by 
>> >> default in the upcoming squash bot even for single commit PRs?
>> >
>> >
>> > Does squash-and-merge in that case preserve the commit as-is if there's 
>> > only one? In that case, there'd be no issues of history. (I opted to not 
>> > comment on 1-commit PRs to be less chatty.)
>> >
>> >>
>> >>
>> >> On Thu, Apr 22, 2021 at 11:19 AM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Thu, Apr 22, 2021 at 9:33 AM Kenneth Knowles  wrote:
>> >>>>
>> >>>>
>> >>>> On Thu, Apr 22, 2021 at 7:04 AM Alexey Romanenko 
>> >>>>  wrote:
>> >>>>>
>> >>>>> Thanks Ismael for bringing this on the table again. Kind of my 
>> >>>>> “favourite” topic, unfortunately, that I raised a couple of times… Let 
>> >>>>> me share some of my thoughts on this.
>> >>>>>
>> >>>>> First of all, as Beam developers, honestly we have to agree if we care 
>> >>>>> about our commits history or not. If not (or not so much) then 
>> >>>>> probably there is no more things to discuss and we use Git as just 
>> >>>>> Git… It’s not a bad thing, it’s just different but for large projects, 
>> >>>>> like Beam, clear commits history is ultra important, imho.
>> >>>>>
>> >>>>> Well, for now we do care and we clearly mention this in our 
>> >>>>> Contribution Guide. Probably, it sounds only as a recommendation there 
>> >>>>> or not all contributors (especially first-time ones) read this or take 
>> >>>>> this into account or pay attention on this. It’s fine and we always 
>> >>>>> can expect not following our guide because of many different reasons. 
>> >>>>> And this is exactly where Committers have to play their role! I mean 
>> >>>>> that our clear Git history mostly relies on committer's shoulders and, 
>> >>>>> before clicking on Merge button, every committer have (even “must" I’d 
>> >>>>> say) make sure that PR respects all our rules (we have them because of 
>> >>>>> some reasons, right?) and ready to be merged. Nice and correct 
>> >>>>> titles/messages is one this thin

Re: Issues and PR names and descriptions (or should we change the contribution guide)

2021-04-22 Thread Ismaël Mejía

a separate 
>>>>>> PR they don't get done and they don't get reviewed with the same 
>>>>>> priority (extra sad face)
>>>>>>
>>>>>> I know I am in the minority. I tend to have a lot of PRs where 
>>>>>> there are 2-5 fairly independent commits. It is "to aid code review" but 
>>>>>> not in the way you might think: The best size for code review is pretty 
>>>>>> big, compared to the best size for commit. A commit is the unit of 
>>>>>> roll-forward, roll-back, cherry-pick, etc. Brian's point about commits 
>>>>>> not being independently tested is important: this is a tooling issue, 
>>>>>> but not that easy to change. Here is why I am not that worried about it: 
>>>>>> I believe strongly in a "rollback first" policy to restore greenness, 
>>>>>> but also that the rollback change itself must be verified to restore 
>>>>>> greenness. When a multi-commit PR fails, you can easily open a revert of 
>>>>>> the whole PR as well as reverts of individual suspect commits. The CI 
>>>>>> for these will finish around the same time, and if you manage a smaller 
>>>>>> revert, great! Imagine if to revert a PR you had to revert _every_ 
>>>>>> change between HEAD and that PR. It would restore to a known green 
>>>>>> state. Yet we don't do this, because we have technology that makes it 
>>>>>> unnecessary. Ultimately, single large commits with bullet points are 
>>>>>> just an unstructured version of multi-commit PRs. So I favor the 
>>>>>> structure. But people seem to be more likely to write good bullet points 
>>>>>> than to write independent commits. Perhaps because it is easier.
>>>>>>
>>>>>> So at this point, I think I am OK with a 1 commit per PR policy. I think 
>>>>>> the net benefits to our commit history would be good. I have grown tired 
>>>>>> of repeating the conversation. Rebase-and-squash edits commit ids in 
>>>>>> ways that confuses tools, so I do not favor this. Tooling that merges 
>>>>>> one commit at a time (without altering commit id) would also be super 
>>>>>> cool and not that hard. It would prevent intermediate results from 
>>>>>> merging, solving both problems.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 21, 2021 at 1:25 PM Brian Hulette  
>>>>>> wrote:
>>>>>>>
>>>>>>> I'd argue that the history is almost always "most useful" when one PR 
>>>>>>> == one commit on master. Intermediate commits from a PR may be useful 
>>>>>>> to aid code review, but they're not verified by presubmits and thus 
>>>>>>> aren't necessarily independently revertible, so I see little value in 
>>>>>>> keeping them around on master. In fact if you're breaking up a PR into 
>>>>>>> multiple commits to aid code review, it's worth considering if they 
>>>>>>> could/should be separately reviewed and verified PRs.
>>>>>>> We could solve the unwanted commit issue if we have a policy to always 
>>>>>>> "Squash and Merge" PRs with rare exceptions.
>>>>>>>
>>>>>>> I agree jira/PR titles could be better, I'm not sure what we can do 
>>>>>>> about it aside from reminding committers of this responsibility. 
>>>>>>> Perhaps the triage process can help catch poorly titled jiras?
>>>>>>>
>>>>>>> On Wed, Apr 21, 2021 at 11:38 AM Robert Bradshaw  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1 to better descriptions for JIRA (and PRs). Thanks for bringing this 
>>>>>>>> up.
>>>>>>>>
>>>>>>>> For merging unwanted commits, can we automate a simple check (e.g. 
>>>>>>>> with github actions)?
>>>>>>>>
>>>>>>>> On Wed, Apr 21, 2021 at 8:00 AM Tomo Suzuki  wrote:
>>>>>>>>>
>>>>>>>>> BEAM-12173 is on me. I'm sorry about that. Re-reading committer guide
>>>>>>>>> [1], I see I was not following this
>>>>>>>>>
>>>>>>>>> > The reviewer should give the LGTM and then request that the author 
>>>>>>>>> > of the pull request rebase, squash, split, etc, the commits, so 
>>>>>>>>> > that the history is most useful
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you for the feedback on this matter! (And I don't think we
>>>>>>>>> should change the contribution guide)
>>>>>>>>>
>>>>>>>>> [1] https://beam.apache.org/contribute/committer-guide/
>>>>>>>>>
>>>>>>>>> On Wed, Apr 21, 2021 at 10:35 AM Ismaël Mejía  
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > Hello,
>>>>>>>>> >
>>>>>>>>> > I have noticed an ongoing pattern of carelessness around issues/PR 
>>>>>>>>> > titles and
>>>>>>>>> > descriptions. It is really painful to see more and more examples 
>>>>>>>>> > like:
>>>>>>>>> >
>>>>>>>>> > BEAM-12160 Add TODO for fixing the warning
>>>>>>>>> > BEAM-12165 Fix ParquetIO
>>>>>>>>> > BEAM-12173 avoid intermediate conversion (PR) and BEAM-12173 use
>>>>>>>>> > toMinutes (commit)
>>>>>>>>> >
>>>>>>>>> > In all these cases with just a bit of detail in the title it would 
>>>>>>>>> > be enough to
>>>>>>>>> > make other contributors or reviewers life easierm as well as to 
>>>>>>>>> > have a better
>>>>>>>>> > project history.  What astonishes me apart of the lack of care is 
>>>>>>>>> > that some of
>>>>>>>>> > those are from Beam commmitters.
>>>>>>>>> >
>>>>>>>>> > We already have discussed about not paying attention during commit 
>>>>>>>>> > merges where
>>>>>>>>> > some PRs end up merging tons of 'unwanted' fixup commits, and 
>>>>>>>>> > nothing has
>>>>>>>>> > changed so I am wondering if we should maybe just totally remove 
>>>>>>>>> > that rule (for
>>>>>>>>> > commits) and also eventually for titles and descriptions.
>>>>>>>>> >
>>>>>>>>> > Ismaël
>>>>>>>>> >
>>>>>>>>> > [1] https://beam.apache.org/contribute/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Tomo
>>>>>
>>>>>

Re: Performance tests dashboard not working

2021-04-21 Thread Ismaël Mejía

Seems to be a networking issue in my side, they fail on Firefox for
some weird timeout but they work perfectly on Chrome.
Thanks for confirming Andrew

On Wed, Apr 21, 2021 at 6:45 PM Andrew Pilloud  wrote:
>
> Looks like it is working now?
>
> On Wed, Apr 21, 2021 at 7:34 AM Ismaël Mejía  wrote:
>>
>> Following the conversation on the performance regression on Flink
>> runner I wanted to take a look at the performance dashboards (Nexmark
>> + Load Tests) but when I open the dashboards it says there is a
>> connectivity error "NetworkError when attempting to fetch resource.".
>> Can someone with more knowledge about our CI / dashboards infra please
>> take a look.
>>
>> http://104.154.241.245/d/ahudA_zGz/nexmark?orgId=1

Issues and PR names and descriptions (or should we change the contribution guide)

2021-04-21 Thread Ismaël Mejía

Hello,

I have noticed an ongoing pattern of carelessness around issues/PR titles and
descriptions. It is really painful to see more and more examples like:

BEAM-12160 Add TODO for fixing the warning
BEAM-12165 Fix ParquetIO
BEAM-12173 avoid intermediate conversion (PR) and BEAM-12173 use
toMinutes (commit)

In all these cases with just a bit of detail in the title it would be enough to
make other contributors or reviewers life easierm as well as to have a better
project history.  What astonishes me apart of the lack of care is that some of
those are from Beam commmitters.

We already have discussed about not paying attention during commit merges where
some PRs end up merging tons of 'unwanted' fixup commits, and nothing has
changed so I am wondering if we should maybe just totally remove that rule (for
commits) and also eventually for titles and descriptions.

Ismaël

[1] https://beam.apache.org/contribute/

Performance tests dashboard not working

2021-04-21 Thread Ismaël Mejía

Following the conversation on the performance regression on Flink
runner I wanted to take a look at the performance dashboards (Nexmark
+ Load Tests) but when I open the dashboards it says there is a
connectivity error "NetworkError when attempting to fetch resource.".
Can someone with more knowledge about our CI / dashboards infra please
take a look.

http://104.154.241.245/d/ahudA_zGz/nexmark?orgId=1

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía

Oh forgot to mention one alternative that we do in the Avro project,
it is that we don't create issues for the dependabot PRs and then we
search all the commits authored by dependabot and include them in the
release notes to track dependency upgrades.

On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
>
> > Quite often, dependency upgrade to latest versions leads to either 
> > compilation errors or failed tests and it should be resolved manually or 
> > declined. Having this, maybe I miss something, but I don’t see what kind of 
> > advantages automatic upgrade will bring to us except that we don’t need to 
> > create a PR manually (which is a not big deal).
>
> The advantage is exactly that, that we don't have to create and track
> dependency updates manually, it will be done by the bot and we will
> only have to do the review and guarantee that no issues are
> introduced. I forgot to mention but we can create exception rules so
> no further upgrades will be proposed for some dependencies e.g.
> Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
> advantage that is the detailed security report that will help us
> prioritize dependency upgrades.
>
> > Regarding another issue - it’s already a problem, imho. Since we have a one 
> > Jira per package upgrade now and usually it “accumulates” all package 
> > upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> > way to notify in release notes about all dependency upgrades for current 
> > release. One of the way is to mention the package upgrade in CHANGES.md 
> > which seems not very relible because it's quite easy to forget to do. I’d 
> > prefer to have a dedicated Jira issue for every upgrade and it will be 
> > included into releases notes almost automatically.
>
> Yes it seems the best for release note tracking to create the issue
> and rename the PR title for this, but that would be part of the
> review/merge process, so up to the Beam committers to do it
> systematically and given how well we respect the commit naming /
> squashing rules I am not sure if we will win much by having another
> useless rule.
>
> On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
>  wrote:
> >
> > Quite often, dependency upgrade to latest versions leads to either 
> > compilation errors or failed tests and it should be resolved manually or 
> > declined. Having this, maybe I miss something, but I don’t see what kind of 
> > advantages automatic upgrade will bring to us except that we don’t need to 
> > create a PR manually (which is a not big deal).
> >
> > Regarding another issue - it’s already a problem, imho. Since we have a one 
> > Jira per package upgrade now and usually it “accumulates” all package 
> > upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> > way to notify in release notes about all dependency upgrades for current 
> > release. One of the way is to mention the package upgrade in CHANGES.md 
> > which seems not very relible because it's quite easy to forget to do. I’d 
> > prefer to have a dedicated Jira issue for every upgrade and it will be 
> > included into releases notes almost automatically.
> >
> > > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> > >
> > > Hello,
> > >
> > > Github has a bot that creates automatically Dependency Update PRs and
> > > report security issues called dependabot.
> > >
> > > I was wondering if we should enable it for Beam. I tested it in my
> > > personal Beam fork and it seems to be working well, it created
> > > dependency updates for both Python and JS (website) dependencies.
> > > The bot seems to be having problems to understand our gradle
> > > dependency definitions for Java but that's something we can address in
> > > the future to benefit of the updates. Also it did not propose go-lang
> > > updates (probably for the same reason).
> > >
> > > If the community agrees I will create a ticket for INFRA to enable it.
> > > We might be getting extra PRs (at the beginning) and we have to be
> > > cautious about updates that might have unintended consequences for
> > > example we should not merge non stable dependency updates (those
> > > ending on -rc1 or -beta on Java) that
> > > might be proposed or dependencies that committers are aware we should
> > > not update for example projects where their main stable version is not
> > > the most recent one like Hadoop or dependencies that do not support
> > > our ongoing language target version (e.g. Java 11 only deps).
> > >
> > > Another issue is that these dependency updates might not get a JIRA
> > > associated with them so we need to decide if (1) we create one and
> > > rename/associate the PR with it, or (2) we just decide not to have
> > > JIRAs for dependency updates.
> > >
> > > WDYT? other pros/cons that I can be missing?
> > >
> > > Ismaël
> >

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía

> Quite often, dependency upgrade to latest versions leads to either 
> compilation errors or failed tests and it should be resolved manually or 
> declined. Having this, maybe I miss something, but I don’t see what kind of 
> advantages automatic upgrade will bring to us except that we don’t need to 
> create a PR manually (which is a not big deal).

The advantage is exactly that, that we don't have to create and track
dependency updates manually, it will be done by the bot and we will
only have to do the review and guarantee that no issues are
introduced. I forgot to mention but we can create exception rules so
no further upgrades will be proposed for some dependencies e.g.
Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
advantage that is the detailed security report that will help us
prioritize dependency upgrades.

> Regarding another issue - it’s already a problem, imho. Since we have a one 
> Jira per package upgrade now and usually it “accumulates” all package 
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> way to notify in release notes about all dependency upgrades for current 
> release. One of the way is to mention the package upgrade in CHANGES.md which 
> seems not very relible because it's quite easy to forget to do. I’d prefer to 
> have a dedicated Jira issue for every upgrade and it will be included into 
> releases notes almost automatically.

Yes it seems the best for release note tracking to create the issue
and rename the PR title for this, but that would be part of the
review/merge process, so up to the Beam committers to do it
systematically and given how well we respect the commit naming /
squashing rules I am not sure if we will win much by having another
useless rule.

On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
 wrote:
>
> Quite often, dependency upgrade to latest versions leads to either 
> compilation errors or failed tests and it should be resolved manually or 
> declined. Having this, maybe I miss something, but I don’t see what kind of 
> advantages automatic upgrade will bring to us except that we don’t need to 
> create a PR manually (which is a not big deal).
>
> Regarding another issue - it’s already a problem, imho. Since we have a one 
> Jira per package upgrade now and usually it “accumulates” all package 
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> way to notify in release notes about all dependency upgrades for current 
> release. One of the way is to mention the package upgrade in CHANGES.md which 
> seems not very relible because it's quite easy to forget to do. I’d prefer to 
> have a dedicated Jira issue for every upgrade and it will be included into 
> releases notes almost automatically.
>
> > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> >
> > Hello,
> >
> > Github has a bot that creates automatically Dependency Update PRs and
> > report security issues called dependabot.
> >
> > I was wondering if we should enable it for Beam. I tested it in my
> > personal Beam fork and it seems to be working well, it created
> > dependency updates for both Python and JS (website) dependencies.
> > The bot seems to be having problems to understand our gradle
> > dependency definitions for Java but that's something we can address in
> > the future to benefit of the updates. Also it did not propose go-lang
> > updates (probably for the same reason).
> >
> > If the community agrees I will create a ticket for INFRA to enable it.
> > We might be getting extra PRs (at the beginning) and we have to be
> > cautious about updates that might have unintended consequences for
> > example we should not merge non stable dependency updates (those
> > ending on -rc1 or -beta on Java) that
> > might be proposed or dependencies that committers are aware we should
> > not update for example projects where their main stable version is not
> > the most recent one like Hadoop or dependencies that do not support
> > our ongoing language target version (e.g. Java 11 only deps).
> >
> > Another issue is that these dependency updates might not get a JIRA
> > associated with them so we need to decide if (1) we create one and
> > rename/associate the PR with it, or (2) we just decide not to have
> > JIRAs for dependency updates.
> >
> > WDYT? other pros/cons that I can be missing?
> >
> > Ismaël
>

[DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía

Hello,

Github has a bot that creates automatically Dependency Update PRs and
report security issues called dependabot.

I was wondering if we should enable it for Beam. I tested it in my
personal Beam fork and it seems to be working well, it created
dependency updates for both Python and JS (website) dependencies.
The bot seems to be having problems to understand our gradle
dependency definitions for Java but that's something we can address in
the future to benefit of the updates. Also it did not propose go-lang
updates (probably for the same reason).

If the community agrees I will create a ticket for INFRA to enable it.
We might be getting extra PRs (at the beginning) and we have to be
cautious about updates that might have unintended consequences for
example we should not merge non stable dependency updates (those
ending on -rc1 or -beta on Java) that
might be proposed or dependencies that committers are aware we should
not update for example projects where their main stable version is not
the most recent one like Hadoop or dependencies that do not support
our ongoing language target version (e.g. Java 11 only deps).

Another issue is that these dependency updates might not get a JIRA
associated with them so we need to decide if (1) we create one and
rename/associate the PR with it, or (2) we just decide not to have
JIRAs for dependency updates.

WDYT? other pros/cons that I can be missing?

Ismaël

Re: Long term support versions of Beam Java

2021-04-16 Thread Ismaël Mejía

As Kenn points clearly, everyone can do an Apache release of an
earlier version, so this should cover most maintenance fixes for old
versions. So any person (or company) can decide to work on supporting
one version.

The real deal of having a LTS "backed by the community" is that ALL
the community should care about backporting issues and that's exactly
what made the previous LTS trial fail.

Why should I (contributor/maintainer) care about backporting stuff if
users (or my employer) can move rapidly to a more recent version and
get even more additional benefits? It does not help the fact that
backporting fixes has been absolutely painful in the past due to the
rapid changes of Beam internals + build system + CI runs.

But even if most people (or just one company) are interested on
backporting issues and maintaining a LTS there is still more to
clarify. The devil in the details: What are the goals or guarantees of
the LTS version and how this impact the backporting of issues? Some
already mentioned upgrade and state compatibility as candidates (that
of course will require additional regression tests), but I can think
about more mundane ones like can we upgrade dependency versions in the
LTS for reasons other than security to make migration easier? What
happens if changes in master introduce incompatible transitive (or
not) APIs into the LTS version, should they be backported or no? There
are without doubts more details to clarify.

Other aspect of having a LTS version that has not been mentioned is
that all the bugs that users report in advance because they are moving
versions more regularly will be discovered and reported now one year
later. This has a negative impact for the project quality too. We are
not perfect and errors will happen, so the earlier we can find and fix
them the better. It is up to us in the project that the quality is
good enough so users are motivated to upgrade and don't prefer to stay
in older versions just because of fear.

On Tue, Apr 13, 2021 at 2:22 AM Robert Burke  wrote:
>
> I'll note that "binary compatibility" is can be substituted substitute for 
> "upgrade compatibility" or "state compatibility".
>
> On Mon, Apr 12, 2021, 5:04 PM Brian Hulette  wrote:
>>
>> > Beam is also multi language, which adjusts concerns. How does GRPC handle 
>> > that? Or protos? (I'm sure there are other examples we can pull from...)
>>
>> I'm not sure those are good projects to look at. They're likely much more 
>> concerned with binary compatibility and there's probably not much change in 
>> the user-facing API.
>> Arrow is another multi-language project but I don't think we can learn much 
>> from it's versioning policy [1], which is much more concerned with binary 
>> compatibility than it is with API compatibility (for now). Perhaps one 
>> lesson is that they track a separate format version and API version. We 
>> could do something similar and have a separate version number for the Beam 
>> model protos. I'm not sure if that's relevant for this discussion or not.
>>
>> Spark may be a reasonable comparison since it provides an API in multiple 
>> languages, but that doesn't seem to have any bearing on it's versioning 
>> policy [2]. It sounds similar to Flink in that every minor release gets 
>> backported bugfixes for 18 months, but releases are slower (~6 months) so 
>> that's not as much of a burden.
>>
>> Brian
>>
>> [1] 
>> https://arrow.apache.org/docs/format/Versioning.html#backward-compatibility
>> [2] https://spark.apache.org/versioning-policy.html
>>
>> On Thu, Apr 8, 2021 at 1:18 PM Robert Bradshaw  wrote:
>>>
>>> Python (again a language) has a slower release cycle, fairly strict 
>>> backwards compatibility stance (with the ability to opt-in before changes 
>>> become the default) and clear ownership for maintenance of each minor 
>>> version until end-of-life (so each could be considered an LTS release). 
>>> https://devguide.python.org/devcycle/
>>>
>>> Cython is more similar to Beam: best-effort compatibility, no LTS, but as a 
>>> code-generater rather than a runtime library a developer is mostly free to 
>>> upgrade at their own cadence regardless of the surrounding ecosystem 
>>> (including downstream projects that take them on as a dependency).
>>>
>>> IIRC, Flink supports the latest N (3?) releases, which are infrequent 
>>> enough to cover about 12-18 months.
>>>
>>> My take is that Beam should be supportive of LTS releases, but we're not in 
>>> a position to commit to it (to the same level we commit to the 6-week 
>>> cut-from-head release cycle). But certain users of Beam (which have a large 
>>> overlap with the Beam community) could make such commitments as it helps 
>>> them (directly or indirectly). Let's give it a try.
>>>
>>>
>>> On Thu, Apr 8, 2021 at 1:00 PM Robert Burke  wrote:

 I don't know about other Apache projects but the Go Programming Language 
 uses a slower release cadence, two releases a year. Only the last two 
 releases are main

Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Ismaël Mejía

I had not seen that the query API of Neptune is Gremlin based so this
could be an even more generic IO connector.
That's probably beyond scope because you care most for the write but
interesting anyway.

https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html

On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía  wrote:
>
> Hello Gabriel,
>
> Other interesting reference because of the Batch loads API like use +
> Amazon is the unfinished Amazon Redshift connector PR from this ticket
> https://issues.apache.org/jira/browse/BEAM-3032
>
> The reason why that one was not merged into Beam is because if lacked tests.
> You should probably look at how to test Neptune in advance, it seems
> that localstack does not support neptune (only on the paying version)
> so probably mocking would be the right way.
>
> We will be really interested in case you want to contribute the
> NeptuneIO connector into Beam so don't hesitate to contact us.
>
>
> On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz  
> wrote:
> >
> > Hi Daniel, Kenneth,
> >
> > Thank you very much for your answers! I'll be looking carefully into the 
> > info you've provided and if we eventually decide it's worth implementing, 
> > I'll get back to you.
> >
> > Best,
> > Gabriel
> >
> >
> > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles  wrote:
> >>
> >>
> >>
> >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins  
> >> wrote:
> >>>
> >>> Hi Gabriel,
> >>>
> >>> Write-side adapters for systems tend to be easier than read-side adapters 
> >>> to implement. That being said, looking at the documentation for neptune, 
> >>> it looks to me like there's no direct data load API, only a batch data 
> >>> load from a file on S3? This is usable but perhaps a bit more difficult 
> >>> to work with.
> >>>
> >>> You could implement a write side adapter for neptune (either on your own 
> >>> or as a contribution to beam) by writing a standard DoFn which, in its 
> >>> ProcessElement method, buffers received records in memory, and in its 
> >>> FinishBundle method, writes all collected records to a file on S3, 
> >>> notifies neptune, and waits for neptune to ingest them. You can see 
> >>> documentation on the DoFn API here. Someone else here might have more 
> >>> experience working with microbatch-style APIs like this, and could have 
> >>> more suggestions.
> >>
> >>
> >> In fact, our BigQueryIO connector has a mode of operation that does batch 
> >> loads from files on GCS: 
> >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> >>
> >> The connector overall is large and complex, because it is old and mature. 
> >> But it may be helpful as a point of reference.
> >>
> >> Kenn
> >>
> >>>
> >>> A read-side API would likely be only a minimally higher lift. This could 
> >>> be done in a simple loading step (Create with a single element followed 
> >>> by MapElements), although much of the complexity likely lies around how 
> >>> to provide the necessary properties to the cluster construction on the 
> >>> beam worker task, and how to define the query the user would need to 
> >>> execute. I'd also wonder if this could be done in an engine-agnostic way, 
> >>> "TinkerPopIO" instead of "NeptuneIO".
> >>>
> >>> If you'd like to pursue adding such an integration, 
> >>> https://beam.apache.org/contribute/ provides documentation on the 
> >>> contribution process. Contributions to beam are always appreciated!
> >>>
> >>> -Daniel
> >>>
> >>>
> >>>
> >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz 
> >>>  wrote:
> >>>>
> >>>> Dear Beam Dev community,
> >>>>
> >>>> I'm working on a project where we have a graph database on Amazon 
> >>>> Neptune (https://aws.amazon.com/neptune) and we have data coming from 
> >>>> Google Cloud.
> >>>>
> >>>> So I was wondering if anyone has ever worked with a similar architecture 
> >>>> and has considered developing an Amazon Neptune custom Beam I/O 
> >>>> connector. Is it feasible? Is it worth it?
> >>>>
> >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm 
> >>>> not sure if something like that would make sense. Currently we're 
> >>>> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
> >>>>
> >>>> Thank you all very much in advance!
> >>>>
> >>>> Best,
> >>>> Gabriel Levcovitz

Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Ismaël Mejía

Hello Gabriel,

Other interesting reference because of the Batch loads API like use +
Amazon is the unfinished Amazon Redshift connector PR from this ticket
https://issues.apache.org/jira/browse/BEAM-3032

The reason why that one was not merged into Beam is because if lacked tests.
You should probably look at how to test Neptune in advance, it seems
that localstack does not support neptune (only on the paying version)
so probably mocking would be the right way.

We will be really interested in case you want to contribute the
NeptuneIO connector into Beam so don't hesitate to contact us.


On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz  wrote:
>
> Hi Daniel, Kenneth,
>
> Thank you very much for your answers! I'll be looking carefully into the info 
> you've provided and if we eventually decide it's worth implementing, I'll get 
> back to you.
>
> Best,
> Gabriel
>
>
> On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles  wrote:
>>
>>
>>
>> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins  wrote:
>>>
>>> Hi Gabriel,
>>>
>>> Write-side adapters for systems tend to be easier than read-side adapters 
>>> to implement. That being said, looking at the documentation for neptune, it 
>>> looks to me like there's no direct data load API, only a batch data load 
>>> from a file on S3? This is usable but perhaps a bit more difficult to work 
>>> with.
>>>
>>> You could implement a write side adapter for neptune (either on your own or 
>>> as a contribution to beam) by writing a standard DoFn which, in its 
>>> ProcessElement method, buffers received records in memory, and in its 
>>> FinishBundle method, writes all collected records to a file on S3, notifies 
>>> neptune, and waits for neptune to ingest them. You can see documentation on 
>>> the DoFn API here. Someone else here might have more experience working 
>>> with microbatch-style APIs like this, and could have more suggestions.
>>
>>
>> In fact, our BigQueryIO connector has a mode of operation that does batch 
>> loads from files on GCS: 
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
>>
>> The connector overall is large and complex, because it is old and mature. 
>> But it may be helpful as a point of reference.
>>
>> Kenn
>>
>>>
>>> A read-side API would likely be only a minimally higher lift. This could be 
>>> done in a simple loading step (Create with a single element followed by 
>>> MapElements), although much of the complexity likely lies around how to 
>>> provide the necessary properties to the cluster construction on the beam 
>>> worker task, and how to define the query the user would need to execute. 
>>> I'd also wonder if this could be done in an engine-agnostic way, 
>>> "TinkerPopIO" instead of "NeptuneIO".
>>>
>>> If you'd like to pursue adding such an integration, 
>>> https://beam.apache.org/contribute/ provides documentation on the 
>>> contribution process. Contributions to beam are always appreciated!
>>>
>>> -Daniel
>>>
>>>
>>>
>>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz  
>>> wrote:

 Dear Beam Dev community,

 I'm working on a project where we have a graph database on Amazon Neptune 
 (https://aws.amazon.com/neptune) and we have data coming from Google Cloud.

 So I was wondering if anyone has ever worked with a similar architecture 
 and has considered developing an Amazon Neptune custom Beam I/O connector. 
 Is it feasible? Is it worth it?

 Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm not 
 sure if something like that would make sense. Currently we're connecting 
 Beam to AWS Kinesis and AWS S3, and from there, to Neptune.

 Thank you all very much in advance!

 Best,
 Gabriel Levcovitz

Re: [ANNOUNCE] New committer: Tomo Suzuki

2021-04-02 Thread Ismaël Mejía

Congrats Tomo, so well deserved. It has been a pleasure to work with you!

On Fri, Apr 2, 2021 at 8:29 PM Tyson Hamilton  wrote:

> Congrats!
>
> On Fri, Apr 2, 2021 at 11:02 AM Pablo Estrada  wrote:
>
>> Thank you Tomo! And congrats : )
>>
>> On Fri, Apr 2, 2021 at 10:24 AM Robert Bradshaw 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Fri, Apr 2, 2021 at 10:19 AM Chamikara Jayalath 
>>> wrote:
>>>
 Congrats Tomo!

 On Fri, Apr 2, 2021 at 9:54 AM Brian Hulette 
 wrote:

> Congratulations Tomo! Well deserved :)
>
> On Fri, Apr 2, 2021 at 9:51 AM Yichi Zhang  wrote:
>
>> Congratulations!
>>
>> On Fri, Apr 2, 2021 at 9:42 AM Ahmet Altay  wrote:
>>
>>> Congratulations! 🎉🎉🎉
>>>
>>> On Fri, Apr 2, 2021 at 9:38 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Tomo Suzuki

 Since joining the Beam community in 2019, Tomo has done lots of
 critical work on Beam's dependencies: maintaining the dependency 
 checker
 that files Jiras and sends emails, upgrading dependencies, fixing
 dependency configuration errors, maintaining our linkage checker. Most
 recently, an epic upgrade of gRPC.

 Considering these highlighted contributions and the rest, the Beam
 PMC trusts Tomo with the responsibilities of a Beam committer [1].

 Thank you, Tomo, for your contributions.

 Kenn

 [1] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer

>>>

Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

2021-03-25 Thread Ismaël Mejía

Precommit is quite unstable in the last days, so worth to check if
something is wrong in the CI.

I have a question Kenn. Given that cherry picking this might be a bit
big as a change can we just reconsider cutting the 2.29.0 branch again
after the updated gRPC version use gets merged and mark the issues
already fixed for version 2.30.0 to version 2.29.0 ? Seems like an
easier upgrade path (and we will get some nice fixes/improvements like
official Spark 3 support for free on the release).

WDYT?

On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki  wrote:
>
> Update: I observe that Java precommit check is unstable in the PR to upgrade 
> vendored gRPC (compared with an PR with an empty change). There's no constant 
> failures; sometimes it succeeds and other times it faces timeout and flaky 
> test failures.
>
> https://github.com/apache/beam/pull/14295#issuecomment-806071087
>
>
> On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki  wrote:
>>
>> Thank you for the voting and I see the artifact available in Maven Central. 
>> I'll work on the PR to use the published artifact today.
>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar
>>
>> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles  wrote:
>>>
>>> Update on this: there are some minor issues and then I'll send out the RC.
>>>
>>> I think this is worth blocking 2.29.0 release on, so I will do this first. 
>>> We are still eliminating other blockers from 2.29.0 anyhow.
>>>
>>> Kenn
>>>
>>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki  wrote:

 Hi Beam developers,

 I'm working on upgrading the vendored gRPC 1.36.0
 https://issues.apache.org/jira/browse/BEAM-11227 (PR: 
 https://github.com/apache/beam/pull/14028)
 Let me know if you have any questions or concerns.

 Background:
 Exchanged messages with Ismaël in BEAM-11227, it seems that it the ticket 
 created by some automation is false positive, but it's nice to use an 
 artifact without being marked with CVE.

 Kenn offered to work as the release manager (as in 
 https://s.apache.org/beam-release-vendored-artifacts) of the vendored 
 artifact.

 --
 Regards,
 Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo
>
>
>
> --
> Regards,
> Tomo

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Ismaël Mejía

+dev

Since we all agree that we should return something different than
PDone the real question is what should we return.
As a reminder we had a pretty interesting discussion about this
already in the past but uniformization of our return values has not
happened.
This thread is worth reading for Vincent or anyone who wants to
contribute Write transforms that return.
https://lists.apache.org/thread.html/d1a4556a1e13a661cce19021926a5d0997fbbfde016d36989cf75a07%40%3Cdev.beam.apache.org%3E

> Returning PDone is an anti-pattern that should be avoided, but changing it 
> now would be backwards incompatible.

Periodic reminder most IOs are still Experimental so I suppose it is
worth to the maintainers to judge if the upgrade to return someething
different of PDone is worth, in that case we can deprecate and remove
the previous signature in short time (2 releases was the average for
previous cases).


On Wed, Mar 24, 2021 at 10:24 PM Alexey Romanenko
 wrote:
>
> I thought that was said about returning a PCollection of write results as 
> it’s done in other IOs (as I mentioned as examples) that have _additional_ 
> write methods, like “withWriteResults()” etc, that return PTransform<…, 
> PCollection>.
> In this case, we keep backwards compatibility and just add new funtionality. 
> Though, we need to follow the same pattern for user API and maybe even naming 
> for this feature across different IOs (like we have for "readAll()” methods).
>
>  I agree that we have to avoid returning PDone for such cases.
>
> On 24 Mar 2021, at 20:05, Robert Bradshaw  wrote:
>
> Returning PDone is an anti-pattern that should be avoided, but changing it 
> now would be backwards incompatible. PRs to add non-PDone returning variants 
> (probably as another option to the builders) that compose well with Wait, 
> etc. would be welcome.
>
> On Wed, Mar 24, 2021 at 11:14 AM Alexey Romanenko  
> wrote:
>>
>> In this way, I think “Wait” PTransform should work for you but, as it was 
>> mentioned before, it doesn’t work with PDone, only with PCollection as a 
>> signal.
>>
>> Since you already adjusted your own writer for that, it would be great to 
>> contribute it back to Beam in the way as it was done for other IOs (for 
>> example, JdbcIO [1] or BigtableIO [2])
>>
>> In general, I think we need to have it for all IOs, at least to use with 
>> “Wait” because this pattern it's quite often required.
>>
>> [1] 
>> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1078
>> [2] 
>> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java#L715
>>
>> On 24 Mar 2021, at 18:01, Vincent Marquez  wrote:
>>
>> No, it only needs to ensure that one record seen on Pubsub has successfully 
>> written to a database.  So "record by record" is fine, or even "bundle".
>>
>> ~Vincent
>>
>>
>> On Wed, Mar 24, 2021 at 9:49 AM Alexey Romanenko  
>> wrote:
>>>
>>> Do you want to wait for ALL records are written for Cassandra and then 
>>> write all successfully written records to PubSub or it should be performed 
>>> "record by record"?
>>>
>>> On 24 Mar 2021, at 04:58, Vincent Marquez  wrote:
>>>
>>> I have a common use case where my pipeline looks like this:
>>> CassandraIO.readAll -> Aggregate -> CassandraIO.write -> PubSubIO.write
>>>
>>> I do NOT want my pipeline to look like the following:
>>>
>>> CassandraIO.readAll -> Aggregate -> CassandraIO.write
>>>  |
>>>   -> PubsubIO.write
>>>
>>> Because I need to ensure that only items written to Pubsub have 
>>> successfully finished a (quorum) write.
>>>
>>> Since CassandraIO.write is a PTransform I can't actually use it 
>>> here so I often roll my own 'writer', but maybe there is a recommended way 
>>> of doing this?
>>>
>>> Thanks in advance for any help.
>>>
>>> ~Vincent
>>>
>>>
>>
>

Re: BEAM-11023: tests failing on Spark Structured Streaming runner

2021-03-17 Thread Ismaël Mejía

Actually there are many reasons that could have produced this
regression even if the code of the runner has not changed at all: (1)
those tests weren't enabled before and now are and they weren't
passing or (2) the tests were changed or (3) my principal guess: the
translation strategy of a runners-core library changed and as a side
effect the tests fail in the runner, maybe the SDF/use_deprecated_read
changes.


On Wed, Mar 17, 2021 at 4:44 PM Brian Hulette  wrote:
>
> You can look through the history of the PostCommit [1]. We only keep a couple 
> weeks of history, but it looks like we have one successful run from Sept 10, 
> 2020, marked as "keep forever", that ran on commit 
> 57055262e7a6bff447eef2df1e6efcda754939ca. Is that what you're looking for?
>
> (Somewhat related, I was under the impression that Jenkins always kept the 
> before/after runs around the last state change, but that doesn't seem to be 
> the case as the first failure we have is [3])
>
> Brian
>
> [1] 
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/
> [2] 
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/2049/
> [3] 
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/2098/
>
> On Tue, Mar 16, 2021 at 4:36 PM Fernando Morales Martinez 
>  wrote:
>>
>> Hi team,
>> it is mentioned in this WI that the tests (GroupByKeyTest testLargeKeys100MB 
>> and testGroupByKeyWithBadEqualsHashCode) stopped working around five months 
>> ago.
>> I took a look at the PRs prior to that date and couldn't find a report 
>> stating that they were working.
>>
>> Is there a way to get reports from before June 2020 (the farthest back I was 
>> able to navigate) so I can compare the tests succeeding against them failing?
>>
>> Thanks a lot!
>> - Fernando Morales
>>
>> This email and its contents (including any attachments) are being sent to
>> you on the condition of confidentiality and may be protected by legal
>> privilege. Access to this email by anyone other than the intended recipient
>> is unauthorized. If you are not the intended recipient, please immediately
>> notify the sender by replying to this message and delete the material
>> immediately from your system. Any further use, dissemination, distribution
>> or reproduction of this email is strictly prohibited. Further, no
>> representation is made with respect to any content contained in this email.

Re: Contributor permission for Beam Jira tickets

2021-03-16 Thread Ismaël Mejía

Hello Vitaly, What is your jira id?

On Tue, Mar 16, 2021 at 5:54 PM Vitaly Terentyev
 wrote:
>
> This is Vitaly from Akvelon.
> Could you please add me as a contributor to Beam's Jira issue tracker?
> I would like to assign some tickets for my work.
>
> Best regards,
>
> Vitaly

Re: Null checking in Beam

2021-03-15 Thread Ismaël Mejía

+1

Even if I like the strictness for Null checking, I also think that
this is adding too much extra time for builds (that I noticed locally
when enabled) and also I agree with Jan that the annotations are
really an undesired side effect. For reference when you try to auto
complete some method signatures on IntelliJ on downstream projects
with C-A-v it generates some extra Checkers annotations like @NonNull
and others even if the user isn't using them which is not desirable.

On Mon, Mar 15, 2021 at 6:04 PM Kyle Weaver  wrote:
>>
>> Big +1 for moving this to separate CI job. I really don't like what 
>> annotations are currently added to the code we ship. Tools like Idea add 
>> these annotations to code they generate when overriding classes and that's 
>> very annoying. Users should not be exposed to internal tools like 
>> nullability checking.
>
>
> I was only planning on moving this to a separate CI job. The job would still 
> be release blocking, so the same annotations would still be required.
>
> I'm not sure which annotations you are concerned about. There are two 
> annotations involved with nullness checking, @SuppressWarnings and @Nullable. 
> @SuppressWarnings has retention policy SOURCE, so it shouldn't be exposed to 
> users at all. @Nullable is not just for internal tooling, it also provides 
> useful information about our APIs to users. The user should not have to guess 
> whether a method argument etc. can be null or not, and for better or worse, 
> these annotations are the standard way of expressing that in Java.

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-12 Thread Ismaël Mejía

> Do we now support 1.8 through 1.12?

Yes and that's clearly too much given that the Flink community only
support the two latest release.
It also hits us because we run tests for all those versions on precommit.

On Fri, Mar 12, 2021 at 7:27 PM Robert Bradshaw  wrote:
>
> Do we now support 1.8 through 1.12?
>
> Unless there are specific objections, makes sense to me.
>
> On Fri, Mar 12, 2021 at 8:29 AM Alexey Romanenko  
> wrote:
>>
>> +1 too but are there any potential objections for this?
>>
>> On 12 Mar 2021, at 11:21, David Morávek  wrote:
>>
>> +1
>>
>> D.
>>
>> On Thu, Mar 11, 2021 at 8:33 PM Ismaël Mejía  wrote:
>>>
>>> +user
>>>
>>> > Should we add a warning or something to 2.29.0?
>>>
>>> Sounds like a good idea.
>>>
>>>
>>>
>>>
>>> On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles  wrote:
>>> >
>>> > Should we add a warning or something to 2.29.0?
>>> >
>>> > On Thu, Mar 11, 2021 at 10:19 AM Ismaël Mejía  wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> We have been supporting older versions of Flink that we had agreed in 
>>> >> previous
>>> >> discussions where we said we will be supporting only the latest three 
>>> >> releases
>>> >> [1].
>>> >>
>>> >> I would like to propose that for Beam 2.30.0 we stop supporting Flink 
>>> >> 1.8 and
>>> >> 1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring 
>>> >> the
>>> >> subject here (and to user@) for your attention and in case someone has a
>>> >> different opinion or reason to still support the older versions.
>>> >>
>>> >> WDYT?
>>> >>
>>> >> Regards,
>>> >> Ismael
>>> >>
>>> >> [1] 
>>> >> https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
>>> >> [2] https://issues.apache.org/jira/browse/BEAM-11948
>>> >> [3] https://github.com/apache/beam/pull/14203
>>
>>

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-11 Thread Ismaël Mejía

+user

> Should we add a warning or something to 2.29.0?

Sounds like a good idea.




On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles  wrote:
>
> Should we add a warning or something to 2.29.0?
>
> On Thu, Mar 11, 2021 at 10:19 AM Ismaël Mejía  wrote:
>>
>> Hello,
>>
>> We have been supporting older versions of Flink that we had agreed in 
>> previous
>> discussions where we said we will be supporting only the latest three 
>> releases
>> [1].
>>
>> I would like to propose that for Beam 2.30.0 we stop supporting Flink 1.8 and
>> 1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring the
>> subject here (and to user@) for your attention and in case someone has a
>> different opinion or reason to still support the older versions.
>>
>> WDYT?
>>
>> Regards,
>> Ismael
>>
>> [1] 
>> https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-11948
>> [3] https://github.com/apache/beam/pull/14203

[DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-11 Thread Ismaël Mejía

Hello,

We have been supporting older versions of Flink that we had agreed in previous
discussions where we said we will be supporting only the latest three releases
[1].

I would like to propose that for Beam 2.30.0 we stop supporting Flink 1.8 and
1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring the
subject here (and to user@) for your attention and in case someone has a
different opinion or reason to still support the older versions.

WDYT?

Regards,
Ismael

[1] 
https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-11948
[3] https://github.com/apache/beam/pull/14203

Re: [VOTE] Release vendor-calcite-1_26_0 version 0.1, release candidate #1

2021-03-10 Thread Ismaël Mejía

Thanks Brian I see that there are a lot of reasons to do the upgrade,
maybe in the future it would be a good idea to explain the main
motivations of the upgrade for vendored dependencies and refer
relevant JIRAs

On Tue, Mar 9, 2021 at 6:23 PM Brian Hulette  wrote:
>
> There are several jiras blocked by a Calcite upgrade. See 
> https://issues.apache.org/jira/browse/BEAM-9379
>
> On Tue, Mar 9, 2021 at 5:17 AM Ismaël Mejía  wrote:
>>
>> Just out of curiosity is there some feature we are expecting from
>> Calcite that pushes this upgrade or is this just catching up for the
>> sake of security improvements + not having old dependencies?
>>
>>
>> On Tue, Mar 9, 2021 at 12:23 AM Ahmet Altay  wrote:
>> >
>> > +1 (binding)
>> >
>> > On Mon, Mar 8, 2021 at 3:21 PM Pablo Estrada  wrote:
>> >>
>> >> +1 (binding) verified signature
>> >>
>> >> On Mon, Mar 8, 2021 at 12:05 PM Kai Jiang  wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>> On Mon, Mar 8, 2021 at 11:37 AM Andrew Pilloud  
>> >>> wrote:
>> >>>>
>> >>>> Hi everyone,
>> >>>> Please review and vote on the release candidate #1 for 
>> >>>> beam-vendor-calcite-1_26_0 version 0.1, as follows:
>> >>>> [ ] +1, Approve the release
>> >>>> [ ] -1, Do not approve the release (please provide specific comments)
>> >>>>
>> >>>>
>> >>>> Reviewers are encouraged to test their own use cases with the release 
>> >>>> candidate, and vote +1 if no issues are found.
>> >>>>
>> >>>> The complete staging area is available for your review, which includes:
>> >>>> * the official Apache source release to be deployed to dist.apache.org 
>> >>>> [1], which is signed with the key with fingerprint 
>> >>>> 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [2],
>> >>>> * all artifacts to be deployed to the Maven Central Repository [3],
>> >>>> * source code commit "0f52187e344dad9bca4c183fe51151b733b24e35" [4].
>> >>>>
>> >>>> The vote will be open for at least 72 hours. It is adopted by majority 
>> >>>> approval, with at least 3 PMC affirmative votes.
>> >>>>
>> >>>> Thanks,
>> >>>> Andrew
>> >>>>
>> >>>> [1] 
>> >>>> https://dist.apache.org/repos/dist/dev/beam/vendor/beam-vendor-calcite-1_26_0/0.1/
>> >>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>> >>>> [3] 
>> >>>> https://repository.apache.org/content/repositories/orgapachebeam-1163/
>> >>>> [4] 
>> >>>> https://github.com/apache/beam/tree/0f52187e344dad9bca4c183fe51151b733b24e35

Re: Debezium integration

2021-03-09 Thread Ismaël Mejía

Hello Gunnar,

Thanks for the message and willingness to collaborate. Most connectors
on Beam are called based on the target system name + the IO suffix,
e.g. KafkaIO, PubsubIO, KinesisIO, etc. so naming it DebeziumIO makes
sense from the Beam side. So far nobody has requested us to rename a
Beam connector because it is assumed that if the code of the connector
resides on the Apache project repository the maintenance comes from
the Beam community.

I don't know what others in the community think, I suppose it is still
possible to rename it since the component has not been released yet,
it will be released on Beam 2.29.0 (branch cut tomorrow).

The real deal is what name would be appropriate that allows users to
find it easily, specially as a single word name. Any suggestions? Or
would making it clear in the component documentation + webpage that
this component is not developed by the Debezium project be enough for
the Debezium community so we (Beam) can use the name?

Best,
Ismaël

On Tue, Mar 9, 2021 at 12:09 PM Gunnar Morling
 wrote:
>
> Hi,
>
> I'm the lead of the Debezium project and just saw that the Apache Beam 
> community is working on integrating Debezium into Beam. That's really 
> exciting. and I'd love to see a demo of this. If there's anything the 
> Debezium community can do in order to help with this, please let us know on 
> the Debezium mailing list [1].
>
> What I'd like to kindly ask for though is finding a name for this component 
> other than "DebeziumIO", as currently used in the README [2]. This may 
> suggest that this is an effort by the Debezium community itself, and we 
> should avoid any potential confusion here. A name like "Apache Beam connector 
> for Debezium" would be fine.
>
> Thanks a lot,
>
> --Gunnar
>
> [1] https://groups.google.com/g/debezium
> [2] https://github.com/apache/beam/tree/master/sdks/java/io/debezium/src
>

Re: [VOTE] Release vendor-calcite-1_26_0 version 0.1, release candidate #1

2021-03-09 Thread Ismaël Mejía

Just out of curiosity is there some feature we are expecting from
Calcite that pushes this upgrade or is this just catching up for the
sake of security improvements + not having old dependencies?

On Tue, Mar 9, 2021 at 12:23 AM Ahmet Altay  wrote:
>
> +1 (binding)
>
> On Mon, Mar 8, 2021 at 3:21 PM Pablo Estrada  wrote:
>>
>> +1 (binding) verified signature
>>
>> On Mon, Mar 8, 2021 at 12:05 PM Kai Jiang  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Mon, Mar 8, 2021 at 11:37 AM Andrew Pilloud  wrote:

 Hi everyone,
 Please review and vote on the release candidate #1 for 
 beam-vendor-calcite-1_26_0 version 0.1, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 Reviewers are encouraged to test their own use cases with the release 
 candidate, and vote +1 if no issues are found.

 The complete staging area is available for your review, which includes:
 * the official Apache source release to be deployed to dist.apache.org 
 [1], which is signed with the key with fingerprint 
 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [2],
 * all artifacts to be deployed to the Maven Central Repository [3],
 * source code commit "0f52187e344dad9bca4c183fe51151b733b24e35" [4].

 The vote will be open for at least 72 hours. It is adopted by majority 
 approval, with at least 3 PMC affirmative votes.

 Thanks,
 Andrew

 [1] 
 https://dist.apache.org/repos/dist/dev/beam/vendor/beam-vendor-calcite-1_26_0/0.1/
 [2] https://dist.apache.org/repos/dist/release/beam/KEYS
 [3] https://repository.apache.org/content/repositories/orgapachebeam-1163/
 [4] 
 https://github.com/apache/beam/tree/0f52187e344dad9bca4c183fe51151b733b24e35

Re: Contributor permission for Beam Jira tickets

2021-03-07 Thread Ismaël Mejía

Done, Welcome to Beam!

On Sun, Mar 7, 2021 at 8:09 AM Manav Garg  wrote:
>
> Hi,
>
> This is Manav from Google. I plan on taking up BEAM-4152 for adding session 
> windowing support to Go sdk. Can someone add me as a contributor for Beam's 
> Jira issue
> tracker? My ASF Jira username would be "manavgarg".
>
> Thanks a lot.
>
> Regards,
> Manav

Re: Migrate S3FileSystem

2021-02-10 Thread Ismaël Mejía

Hello,

Thanks for the contribution Raphael.
I have been a bit busy this week, but I will take a look as soon as I can
probably end of the week/beginning of next. Sorry.

Best,
Ismaël


On Wed, Feb 10, 2021 at 10:39 AM Raphael Sanamyan <
raphael.sanam...@akvelon.com> wrote:

> Hello Ismaël,
>
>
> I have finished the task "Migrate S3FileSystem to AWS SDK for Java 2" and
> made the PR https://github.com/apache/beam/pull/13914. Could you please
> review this PR or suggest somebody who could do it?
>
>
> Thank you,
>
> Raphael.
>
> --
> *От:* Raphael Sanamyan
> *Отправлено:* 29 января 2021 г. 1:44:17
> *Кому:* dev@beam.apache.org
> *Копия:* Ilya Kozyrev
> *Тема:* Re: Migrate S3FileSystem
>
>
> Hello Ismaël,
>
>
> Thank you for such a quick response. If the main task is to adapt the beam
> classes to the new AWS API, then I have no questions and I will start the
> task and send out a PR for the review soon.
>
> Thank you,
> Raphael.
>
> --
> *От:* Ismaël Mejía 
> *Отправлено:* 28 января 2021 г. 15:37:04
> *Кому:* dev
> *Тема:* Re: Migrate S3FileSystem
>
> Hello Raphael,
>
> You don't need to change the version of the SDK because at the moment
> we do support AWS SDK for Java 2, you just have to put the classes in
> the correct module.
>
> https://github.com/apache/beam/tree/master/sdks/java/io/amazon-web-services2
>
> <https://github.com/apache/beam/tree/master/sdks/java/io/amazon-web-services2>
> beam/sdks/java/io/amazon-web-services2 at master · apache/beam · GitHub
> <https://github.com/apache/beam/tree/master/sdks/java/io/amazon-web-services2>
> github.com
> Apache Beam is a unified programming model for Batch and Streaming -
> apache/beam
>
>
> The expected outcome is just to reproduce what S3FileSystem.java does
> for the amazon-web-services module, Main task is to adapt the Beam
> classes to the new AWS API
>
> If more doubts don't hesitate to ask.
>
> Best,
> Ismaël
>
> On Thu, Jan 28, 2021 at 11:38 AM Raphael Sanamyan
>  wrote:
> >
> > Hi, community,
> >
> >
> > I'm going to implement a task "Migrate S3FileSystem to AWS SDK for Java
> 2". I'm planning to change the version of SDK to the new one and to fix
> troubles in case they appear. If anyone has any details of this task, it
> would be nice if you share them, since there are no definite requirements
> and comments at the task's description.
> >
> >
> > Thank you,
> >
> > Raphael.
>

Re: Builds Meeting this Thursday

2021-02-08 Thread Ismaël Mejía

Just for reference and related to this thread. It seems we may end up
also having this queue issue (even if we don't fully move to Github
actions).
"For Apache projects, starting December 2020 we are experiencing a
high strain of GitHub Actions jobs. All Apache projects are sharing
180 jobs and as more projects are using GitHub Actions the job queue
becomes a serious bottleneck."

An interesting document shared recently on builds@ goes deeper on how
the Airflow project is dealing with this:
https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#

On Mon, Jan 18, 2021 at 1:28 PM Elliotte Rusty Harold
 wrote:
>
> On Mon, Jan 18, 2021 at 10:49 AM Ismaël Mejía  wrote:
> >
> > Thanks for sharing this Pablo, This looks super interesting. We should
> > see if it could make sense to migrate our Jenkins infra to GitHub
> > Actions given that it is free and quickly becoming the new 'standard',
> > Good points it is 'free' because we will bring our machines and Google
> > pays :) bad points we will become 100% github dependant.
> >
>
> Github actions have a really big advantage over Jenkins: they run on
> forks, not just branches. This is very useful to non-commmiter
> contributors.
>
> On the minus side it's not clear if one can see the logs from the
> integration tests, which is blocking some work in the
> maven-site-plugin:
>
> https://github.com/apache/maven-site-plugin/pull/34#issuecomment-762207488
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org

Re: Migrate S3FileSystem

2021-01-28 Thread Ismaël Mejía

Hello Raphael,

You don't need to change the version of the SDK because at the moment
we do support AWS SDK for Java 2, you just have to put the classes in
the correct module.
https://github.com/apache/beam/tree/master/sdks/java/io/amazon-web-services2

The expected outcome is just to reproduce what S3FileSystem.java does
for the amazon-web-services module, Main task is to adapt the Beam
classes to the new AWS API

If more doubts don't hesitate to ask.

Best,
Ismaël

On Thu, Jan 28, 2021 at 11:38 AM Raphael Sanamyan
 wrote:
>
> Hi, community,
>
>
> I'm going to implement a task "Migrate S3FileSystem to AWS SDK for Java 2". 
> I'm planning to change the version of SDK to the new one and to fix troubles 
> in case they appear. If anyone has any details of this task, it would be nice 
> if you share them, since there are no definite requirements and comments at 
> the task's description.
>
>
> Thank you,
>
> Raphael.

Re: Multiple architectures support on Beam (ARM)

2021-01-27 Thread Ismaël Mejía

>
>>> On Tue, Jan 26, 2021, 10:25 AM Robert Bradshaw  wrote:
>>>>
>>>> +1
>>>>
>>>> I don't think it would be that hard to build and release arm-based docker 
>>>> images. (Perhaps just a matter of changing the docker file to depend on a 
>>>> different base, and doing some cross-compile. That would suss out whether 
>>>> we're inadvertently taking on any incompatible dependencies.)
>>>>
>>>> Theoretically, if one does that and manually specifies the container, it 
>>>> could just work for Python (assuming no wheel files are specified as 
>>>> manual dependencies). For Java, if one builds/deploys an uberjar (on a 
>>>> different architecture), there may be issues in any transitive dependency 
>>>> that has JNI code (us or users). I'd imagine this issue is common to and 
>>>> being explored by many of the other Java big data systems in use; it'd be 
>>>> interesting to know what solutions are out there.
>>>>
>>>> For go, the executable is uploaded directly into the container. We'd 
>>>> probably have to do something fancier like cross-compiling the executable 
>>>> (and making sure the UserFn references, which I think are just pointers 
>>>> into the binary, still work if the launcher is one architecture and the 
>>>> workers another).
>>>>
>>>> Definitely worth exploring.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 26, 2021 at 10:09 AM Ismaël Mejía  wrote:
>>>>>
>>>>> I stumbled today on this user request:
>>>>> BEAM-10982 Wheel support for linux aarch64
>>>>>
>>>>> It made me wonder if with the advent of ARM64 processors not only in
>>>>> the client but server side (Graviton and others) if it is worth that
>>>>> we start to think about having support for this architecture on the
>>>>> python installers and in the docker images. It seems that for the
>>>>> latter it should not be that difficult given that our parent images
>>>>> are already multi-arch.
>>>>>
>>>>> Are there some possible issues or binary/platform specific
>>>>> dependencies that impede us from doing this?

Multiple architectures support on Beam (ARM)

2021-01-26 Thread Ismaël Mejía

I stumbled today on this user request:
BEAM-10982 Wheel support for linux aarch64

It made me wonder if with the advent of ARM64 processors not only in
the client but server side (Graviton and others) if it is worth that
we start to think about having support for this architecture on the
python installers and in the docker images. It seems that for the
latter it should not be that difficult given that our parent images
are already multi-arch.

Are there some possible issues or binary/platform specific
dependencies that impede us from doing this?

Re: [ANNOUNCE] New committer: Piotr Szuberski

2021-01-22 Thread Ismaël Mejía

Congratulations Piotr ! Thanks for all your work !

On Fri, Jan 22, 2021 at 5:33 PM Alexey Romanenko
 wrote:
>
> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer: 
> Piotr Szuberski .
>
> Piotr started to contribute to Beam about one year ago and he did it very 
> actively since then. He contributed to the different areas, like adding a 
> cross-language functionality to existing IOs, improving ITs and performance 
> tests environment/runtime, he actively worked on dependency updates [1].
>
> In consideration of his contributions, the Beam PMC trusts him with the 
> responsibilities of a Beam committer [2].
>
> Thank you for your contributions, Piotr!
>
> -Alexey, on behalf of the Apache Beam PMC
>
> [1] https://github.com/apache/beam/pulls?q=is%3Apr+author%3Apiotr-szuberski
> [2] 
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>
>

Re: [ANNOUNCE] New PMC Member: Chamikara Jayalath

2021-01-22 Thread Ismaël Mejía

Congrats Cham, well deserved!


On Fri, Jan 22, 2021 at 9:02 AM Michał Walenia 
wrote:

> Congratulations, Cham! Thanks for your work!
>
>
> On Fri, Jan 22, 2021 at 3:13 AM Charles Chen  wrote:
>
>> Congrats Cham!
>>
>> On Thu, Jan 21, 2021, 5:39 PM Chamikara Jayalath 
>> wrote:
>>
>>> Thanks everybody :)
>>>
>>> - Cham
>>>
>>> On Thu, Jan 21, 2021 at 5:22 PM Pablo Estrada 
>>> wrote:
>>>
 Yoohoo Cham : )

 On Thu, Jan 21, 2021 at 5:20 PM Udi Meiri  wrote:

> Congrats Cham!
>
> On Thu, Jan 21, 2021 at 4:25 PM Griselda Cuevas 
> wrote:
>
>> Congratulations Cham!!! Well deserved :)
>>
>> On Thu, 21 Jan 2021 at 15:23, Connell O'Callaghan <
>> conne...@google.com> wrote:
>>
>>> Well done Cham!!! Thank you for all your contributions to date!!!
>>>
>>>
>>> On Thu, Jan 21, 2021 at 3:18 PM Rui Wang  wrote:
>>>
 Congratulations, Cham!

 -Rui

 On Thu, Jan 21, 2021 at 3:15 PM Robert Bradshaw <
 rober...@google.com> wrote:

> Congratulations, Cham!
>
> On Thu, Jan 21, 2021 at 3:13 PM Brian Hulette 
> wrote:
>
>> Great news, congratulations Cham!
>>
>> On Thu, Jan 21, 2021 at 3:08 PM Robin Qiu 
>> wrote:
>>
>>> Congratulations, Cham!
>>>
>>> On Thu, Jan 21, 2021 at 3:05 PM Tyson Hamilton <
>>> tyso...@google.com> wrote:
>>>
 Woo! Congrats Cham!

 On Thu, Jan 21, 2021 at 3:02 PM Robert Burke <
 rob...@frantil.com> wrote:

> Congratulations! That's fantastic news.
>
> On Thu, Jan 21, 2021, 2:59 PM Reza Rokni 
> wrote:
>
>> Congratulations!
>>
>> On Fri, Jan 22, 2021 at 6:58 AM Ankur Goenka <
>> goe...@google.com> wrote:
>>
>>> Congrats Cham!
>>>
>>> On Thu, Jan 21, 2021 at 2:57 PM Ahmet Altay <
>>> al...@google.com> wrote:
>>>
 Hi all,

 Please join me and the rest of Beam PMC in welcoming
 Chamikara Jayalath as our
 newest PMC member.

 Cham has been part of the Beam community from its early
 days and contributed to the project in significant ways, 
 including
 contributing new features and improvements especially related 
 Beam IOs,
 advocating for users, and mentoring new community members.

 Congratulations Cham! And thanks for being a part of Beam!

 Ahmet

>>>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>

Re: Making preview (sample) time consistent on Direct runner

2021-01-21 Thread Ismaël Mejía

Thanks Kenn! That sounds like a good and achievable strategy to get
the first/limit results. I will check the code to see if we can reuse
this logic, the extra question is if we may fit in the direct runner
for the general use case (not only SQL) maybe via some PipelineOptions
of the runner.

> Note that both of these don't solve the issue that Read + GBK + take(N) would 
> have to do the full Read+GBK for a batch pipeline.

Just to confirm that I understand correctly Robert, you mention this
for example for the case of IOs where we can match 1000s of
`ReadableFiles` and we will necessarily end up distributing and
reading the thousands until we have the take(N) results. You mean we
cannot avoid this.

I was wondering if with SDF we could have a generic solution
(specially now that most translations are based on SDF), maybe some
sort of 'BoundedRestrictionTracker' to deal with the limit and then
stop producing output. Maybe Boyuan, Luke or Robert can have an idea
if this approach is really viable or there can be issues. I am saying
this in the context of finding a solution for all runners.


On Thu, Jan 21, 2021 at 8:34 PM Robert Bradshaw  wrote:
>
> I don't know that SDF vs. BoundedSources changes things here--for both one 
> can implement take(n) by running until one has N elements and then canceling 
> the pipeline.
>
> One could have a more sophisticated First(n) operator that has a "back-edge" 
> to checkpoint/splits the upstream operators once a sufficient number of 
> elements has been observed.
>
> Note that both of these don't solve the issue that Read + GBK + take(N) would 
> have to do the full Read+GBK for a batch pipeline.
>
> On Thu, Jan 21, 2021 at 10:25 AM Kenneth Knowles  wrote:
>>
>> I forgot two things:
>>
>> 1. I totally agree that this is a good opportunity to make Beam more useful. 
>> Different engines have their own similar abilities some time, but making it 
>> available across the runners and xlang transforms, etc, is way cool.
>> 2. You can of course do the same trick for a distributed runner by using a 
>> message queue between the pipeline and the controller program. And 
>> interactive Beam Java, or improving/unifying the concepts between 
>> Python/Java/SQL (Go?) would be great. Not sure how much code can be reused.
>>
>> Kenn
>>
>> On Thu, Jan 21, 2021 at 10:15 AM Kenneth Knowles  wrote:
>>>
>>> I think the approach used in the SQL CLI to implement a LIMIT clause may 
>>> work for some cases. It only works in the same process with the 
>>> DirectRunner. It doesn't sample at the source, because you never know what 
>>> will happen in the query. Instead it counts outputs and then cancels the 
>>> job when it has enough: 
>>> https://github.com/apache/beam/blob/a72460272354747a54449358f5df414be4b6d72c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamEnumerableConverter.java#L200
>>>
>>> However if your pipeline starts with a read of 1000s of files it may be a 
>>> different pattern for invoking SDF:
>>>
>>> 1. initial splits do not matter much, probably
>>> 2. you want to checkpoint and emit values so that the end output of the 
>>> pipeline can receive them to cancel it; you don't want to read a whole 
>>> restriction like in a batch case
>>>
>>> I don't know the status of this, if it needs special treatment or not. 
>>> There may also be the issue that SDF is more actively developed in portable 
>>> runners and less so in classic runners.
>>>
>>> Kenn
>>>
>>> On Wed, Jan 6, 2021 at 9:05 AM Ismaël Mejía  wrote:
>>>>
>>>> > Those are good points. Do you know if the Interactive Runner has been 
>>>> > tried in those instances? If so, what were the shortcomings?
>>>>
>>>> I am not aware of experiences or shortcomings with the Interactive
>>>> Runner. The issue is that the Interactive runner is based on python
>>>> and all the tools I mention above are Java-based so Python probably
>>>> won't be a valid alternative.
>>>>
>>>> What is concerning for me is that in other similar systems (e.g.
>>>> Spark, Flink) a developer can consistently do a `.take(n)` read from a
>>>> data source and have results in constant time almost independently of
>>>> the size of the targeted data. This allows to  iterate faster and
>>>> improve the developer experience.
>>>>
>>>> What is not clear for me yet is how we can achieve this in a clean
>>>> way, given all the &#x

Re: JIRA access (and hello!)

2021-01-19 Thread Ismaël Mejía

You were added to the Contributors group so you can now self assign issues.
Welcome to the project!

On Tue, Jan 19, 2021 at 3:50 PM David Huntsperger
 wrote:
>
> Hey Beam devs,
>
> I'm a maintainer of the Dataflow documentation, and I'd like to do some work 
> on Beam doc issues as well. May I have permission to assign issues to myself?
>
> I created a JIRA user for myself: David Huntsperger (pcoet). And this is me 
> on GitHub: https://github.com/pcoet
>
> Thanks,
>
> David
>
> P.S. I know we're working on re-architecting the website. I'll watch for code 
> freezes and etc.

Re: Builds Meeting this Thursday

2021-01-18 Thread Ismaël Mejía

Thanks for sharing this Pablo, This looks super interesting. We should
see if it could make sense to migrate our Jenkins infra to GitHub
Actions given that it is free and quickly becoming the new 'standard',
Good points it is 'free' because we will bring our machines and Google
pays :) bad points we will become 100% github dependant.

Related: In Avro we moved to github actions recently but we preserve a
docker based build workflow as a backup, but the Avro build is really
simple compared to Beam's.

Tobias it seems you did not copy builds@ or gavin in your answer (save
if it is in BCC)

Are there any interesting news after the meeting?

On Thu, Jan 14, 2021 at 9:54 AM Tobiasz Kędzierski
 wrote:
>
> Thanks Gavin for the detailed response.
>
> > I'll ensure it gets looked into. If you could edit the meeting wiki page
> with the above even better :)
>
> Unfortunately I don't have permissions to edit cwiki.
>
> BR
> Tobiasz
>
> On Thu, Jan 14, 2021 at 1:35 AM Pablo Estrada  wrote:
>>
>> Hi all,
>> I've found out about this presentation on Apache Builds. Sharing with dev@ 
>> in case it's of interest to anyone. See the agenda: 
>> https://cwiki.apache.org/confluence/display/INFRA/ASF+Builds+Agenda+2021-01-14
>> Best
>> -P.
>>
>> -- Forwarded message -
>> From: Jarek Potiuk 
>> Date: Tue, Jan 12, 2021 at 1:45 PM
>> Subject: Fwd: Builds Meeting this Thursday
>> To: 
>>
>>
>> For those who are interested in performance/security of Github Actions, 
>> there is a meeting of "builds" for ASF this Thursday - where Brian Douglas 
>> Github's Staff Developers Advocate will be present.
>>
>> J.
>>
>> -- Forwarded message -
>> From: Jarek Potiuk 
>> Date: Tue, Jan 12, 2021 at 10:42 PM
>> Subject: Re: Builds Meeting this Thursday
>> To: , 
>>
>>
>> Added my two topics. Thanks Gavin for this opportunity !
>>
>> On Tue, Jan 12, 2021 at 10:20 PM Gavin McDonald  wrote:
>>>
>>> Please list on the cwiki meeting page any questions you have
>>> for Brian so that I may send them to him ahead of time, if
>>> possible.
>>>
>>> On Tue, Jan 12, 2021 at 10:00 PM Gavin McDonald 
>>> wrote:
>>>
>>> > Hi All,
>>> >
>>> > Cwiki page -
>>> > https://cwiki.apache.org/confluence/display
>>
>>
>> On Tue, Jan 12, 2021 at 10:20 PM Gavin McDonald  wrote:
>> Please list on the cwiki meeting page any questions you have
>> for Brian so that I may send them to him ahead of time, if
>> possible.
>>
>> On Tue, Jan 12, 2021 at 10:00 PM Gavin McDonald 
>> wrote:
>>
>> > Hi All,
>> >
>> > Cwiki page -
>> > https://cwiki.apache.org/confluence/display/INFRA/ASF+Builds+Agenda+2021-01-14
>> >
>> >
>> >
>> > On Tue, Jan 12, 2021 at 9:53 PM Gavin McDonald 
>> > wrote:
>>>
>>> >/INFRA/ASF+Builds+Agenda+2021-01-14
>>> >
>>> >
>>> >
>>> > On Tue, Jan 12, 2021 at 9:53 PM Gavin McDonald 
>>> > wrote:
>>> >
>>> >> Hi All,
>>> >>
>>> >> Sorry for the last minute notice, this Thursday the 14th January
>>> >> at 1700 UTC time will be our next builds@ meeting.
>>> >>
>>> >> Just confirmed a few minutes ago, there will be a guest
>>> >> representing Github on the call to talk about and answer
>>> >> questions around Github Actions.
>>> >>
>>> >> The remainder of the call are self set topics , so whatever
>>> >> else you guys want to cover.
>>> >>
>>> >> We will try Jitsi again, will a fallback url provided.
>>> >> More details to follow
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >> *Gavin McDonald*
>>> >> Systems Administrator
>>> >> ASF Infrastructure Team
>>> >>
>>> >
>>> >
>>> > --
>>> >
>>> > *Gavin McDonald*
>>> > Systems Administrator
>>> > ASF Infrastructure Team
>>> >
>>>
>>>
>>> --
>>>
>>> *Gavin McDonald*
>>> Systems Administrator
>>> ASF Infrastructure Team
>>
>>
>>
>> --
>> +48 660 796 129
>>
>>
>> --
>> +48 660 796 129

Re: [VOTE] Release 2.27.0, release candidate #4

2021-01-07 Thread Ismaël Mejía

> Also I wonder if we now need to clarify both Java 8 and Java 11 versions 
> separately?

You mean for the docker images? Otherwise we should not be using Java
11 at all to produce the artifacts.

On Thu, Jan 7, 2021 at 4:51 PM Valentyn Tymofieiev  wrote:
>
> Noting that announcement does not include the version of the Java compilers 
> used - looks like the release guide still requires it:
>
> * Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle JDK 
> JDK_VERSION.
>
>
> Could you please add this info to this thread for posterity?
>
> Also I wonder if we now need to clarify both Java 8 and Java 11 versions 
> separately?
>
> Other than that, +1 from me. Ran several mobile gaming pipelines on Direct 
> and Dataflow runners with Python 3.8.
>
> On Thu, Jan 7, 2021 at 12:49 AM Jan Lukavský  wrote:
>>
>> +1 (non-binding).
>>
>> I've validated the RC against my dependent projects (mainly Java SDK, Flink 
>> and DirectRunner).
>>
>> Thanks,
>>
>>  Jan
>>
>> On 1/7/21 2:15 AM, Ahmet Altay wrote:
>>
>> +1 (binding) - validated python quickstarts.
>>
>> Thank you Pablo.
>>
>> On Wed, Jan 6, 2021 at 1:57 PM Pablo Estrada  wrote:
>>>
>>> +1 (binding)
>>> I've built and unit tested existing Dataflow Templates with the new version.
>>> Best
>>> -P.
>>>
>>> On Tue, Jan 5, 2021 at 11:17 PM Pablo Estrada  wrote:

 Hi everyone,
 Please review and vote on the release candidate #4 for the version 2.27.0, 
 as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 NOTE. What happened to RC #2? I started building RC2 before completing all 
 the cherry-picks, so the tag for RC2 was created on an incorrect commit.

 NOTE. What happened to RC #3? I started building RC3, but a new bug was 
 discovered (BEAM-11569) that required amending the branch. Thus this is 
 now RC4.

 Reviewers are encouraged to test their own use cases with the release 
 candidate, and vote +1
  if no issues are found.

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org 
 [2], which is signed with the key with fingerprint 
 C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.27.0-RC4" [5],
 * website pull request listing the release [6], publishing the API 
 reference manual [7], and the blog post [8].
 * Python artifacts are deployed along with the source release to the 
 dist.apache.org [2].
 * Validation sheet with a tab for 2.27.0 release to help with validation 
 [9].
 * Docker images published to Docker Hub [10].

 The vote will be open for at least 72 hours, but given the holidays, we 
 will likely extend for a few more days. The release will be adopted by 
 majority approval, with at least 3 PMC affirmative votes.

 Thanks,
 -P.

 [1] 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
 [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/orgapachebeam-1149/
 [5] https://github.com/apache/beam/tree/v2.27.0-RC4
 [6] https://github.com/apache/beam/pull/13602
 [7] https://github.com/apache/beam-site/pull/610
 [8] https://github.com/apache/beam/pull/13603
 [9] 
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
 [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image

Re: Making preview (sample) time consistent on Direct runner

2021-01-06 Thread Ismaël Mejía

> Those are good points. Do you know if the Interactive Runner has been tried 
> in those instances? If so, what were the shortcomings?

I am not aware of experiences or shortcomings with the Interactive
Runner. The issue is that the Interactive runner is based on python
and all the tools I mention above are Java-based so Python probably
won't be a valid alternative.

What is concerning for me is that in other similar systems (e.g.
Spark, Flink) a developer can consistently do a `.take(n)` read from a
data source and have results in constant time almost independently of
the size of the targeted data. This allows to  iterate faster and
improve the developer experience.

What is not clear for me yet is how we can achieve this in a clean
way, given all the 'wrappings' we already have in translation time. I
don't know if there could be a way to override some default
translation(s) to achieve this. Any ideas maybe?


On Tue, Jan 5, 2021 at 10:26 PM Sam Rohde  wrote:
>
> Hi Ismael,
>
> Those are good points. Do you know if the Interactive Runner has been tried 
> in those instances? If so, what were the shortcomings?
>
> I can also see the use of sampling for a performance benchmarking reason. We 
> have seen others send in known elements which are tracked throughout the 
> pipeline to generate timings for each transform/stage.
>
> -Sam
>
> On Fri, Dec 18, 2020 at 8:24 AM Ismaël Mejía  wrote:
>>
>> Hello,
>>
>> The use of direct runner for interactive local use cases has increased
>> with the years on Beam due to projects like Scio, Kettle/Hop and our
>> own SQL CLI. All these tools have in common one thing, they show a
>> sample of some source input to the user and interactively apply
>> transforms to it to help users build Pipelines more rapidly.
>>
>> If you build a pipeline today to produce this sample using the Beam’s
>> Sample transform from a set of files, the read of the files happens
>> first and then the sample, so the more files or the bigger they are
>> the longer it takes to produce the sample even if the number of
>> elements expected to read is constant.
>>
>> During Beam Summit last year there were some discussions about how we
>> could improve this scenario (and others) but I have the impression no
>> further discussions happened in the mailing list, so I wanted to know
>> if there are some ideas about how we can get direct runner to improve
>> this case.
>>
>> It seems to me that we can still ‘force’ the count with some static
>> field because it is not a distributed case but I don’t know how we can
>> stop reading once we have the number of sampled elements in a generic
>> way, specially now it seems to me a bit harder to do with pure DoFn
>> (SDF) APIs vs old Source ones, but well that’s just a guess.
>>
>> Does anyone have an idea of how could we generalize this and of course
>> if you see the value of such use case, other ideas for improvements?
>>
>> Regards,
>> Ismaël

Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-28 Thread Ismaël Mejía

It seems we are not publishing the latest versions of the Flink Job
Server (Flink 1.11 and 1.12) docker images

These do not exist:

https://hub.docker.com/r/apache/beam_flink1.11_job_server
https://hub.docker.com/r/apache/beam_flink1.12_job_server

but this does and has the good 2.27.0_rc1 tag:
https://hub.docker.com/r/apache/beam_flink1.10_job_server

I wonder if the issue might be related to the fact that we need to
request the repo to be created or if it is something different.

On Thu, Dec 24, 2020 at 5:33 PM Brian Hulette  wrote:
>
> +Boyuan Zhang helped me get to the bottom of the sql_taxi issue. The problem 
> is with the WriteStringsToPubSub API, which is deprecated since 2.7.0, but 
> used in the example. Boyuan has [1] out to fix WriteStringsToPubSub and I 
> just sent [2] to replace WriteStringsToPubSub with WriteToPubSub in example 
> code. Issue is tracked in [3].
>
> [1] https://github.com/apache/beam/pull/13614
> [2] https://github.com/apache/beam/pull/13615
> [3] https://issues.apache.org/jira/browse/BEAM-11524
>
> On Thu, Dec 24, 2020 at 8:26 AM Pablo Estrada  wrote:
>>
>> Alright! Thanks everyone for your validations. I'm cancelling this RC, and 
>> I'll perform cherry picks to prepare the next one.
>>
>> Please update this thread with any other cherry pick requests!
>> -P.
>>
>> On Thu, Dec 24, 2020, 3:17 AM Ismaël Mejía  wrote:
>>>
>>> It might be a good idea to include also:
>>>
>>> [BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
>>> SDF Wrapper DoFn
>>> https://github.com/apache/beam/pull/13592
>>>
>>> So Java development experience is less affected (as with 2.26.0) (There
>>> is a flag to exclude but defaults matter).
>>>
>>> On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev  
>>> wrote:
>>> >
>>> > We discovered a regression on CombineFn.from_callable() started in 
>>> > 2.26.0. Even though it's not a regression in 2.27.0, I strongly prefer we 
>>> > fix it in 2.27.0 as it leads to buggy behavior, so I vote -1.
>>> >
>>> > The fix to release branch is in flight: 
>>> > https://github.com/apache/beam/pull/13613.
>>> >
>>> >
>>> >
>>> > On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette  wrote:
>>> >>
>>> >> -1 (non-binding)
>>> >> Good news: I validated a dataframe pipeline on Dataflow which looked 
>>> >> good (with expected performance improvements!)
>>> >> Bad news: I also tried to run the sql_taxi example pipeline (streaming 
>>> >> SQL in python) on Dataflow and ran into PubSub IO related issues. The 
>>> >> example fails in the same way with 2.26.0, but it works in 2.25.0. It's 
>>> >> possible this is a Dataflow bug and not a Beam one, but I'd like to 
>>> >> investigate further to make sure.
>>> >>
>>> >> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:
>>> >>>
>>> >>> +1 (non-binding) Validated wordcount with Python source + Flink and 
>>> >>> Spark job server jars. Also checked that the ...:sql:udf jar was added 
>>> >>> and includes our cherry-picks. Thanks Pablo :)
>>> >>>
>>> >>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:
>>> >>>>
>>> >>>> +1 (binding).
>>> >>>>
>>> >>>> I validated python quickstarts. Thank you Pablo.
>>> >>>>
>>> >>>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
>>> >>>>  wrote:
>>> >>>>>
>>> >>>>> +1 (binding)
>>> >>>>>
>>> >>>>> Regards
>>> >>>>> JB
>>> >>>>>
>>> >>>>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>>> >>>>>
>>> >>>>> Hi everyone,
>>> >>>>> Please review and vote on the release candidate #1 for the version 
>>> >>>>> 2.27.0, as follows:
>>> >>>>> [ ] +1, Approve the release
>>> >>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> >>>>>
>>> >>>>>
>>> >>>>> Reviewers are encouraged to test their own use cases with the release 
>>> >>>>> candidate, and vote +1
>>> >>>>>  if no issue

Re: [VOTE] Release 2.26.0, release candidate #1

2020-12-28 Thread Ismaël Mejía

It seems the tag of the docker image for java8 was not updated after the
release went out, can somebody please fix this.

https://hub.docker.com/r/apache/beam_java8_sdk/tags?page=1&ordering=last_updated

On Sat, Dec 12, 2020 at 7:19 AM Jean-Baptiste Onofre 
wrote:

> +1 (binding)
>
> Sorry for the delay.
>
> Regards
> JB
>
> Le 10 déc. 2020 à 17:40, Tyson Hamilton  a écrit :
>
> +1 from me. I validated Nexmark performance tests.
>
> On Tue, Dec 8, 2020 at 7:53 PM Robert Burke  wrote:
>
>> I'm +1 on RC1 based on the 7 tests I know I can check successfully.  I'll
>> be trying more tomorrow, but remember that release validation requires the
>> community to validate it meets our standards, and I can't do it alone.
>>
>> Remember you can participate in the release validation by reviewing parts
>> of the documentation being published as well, not just by running the
>> Pyhton and Java artifacts.
>>
>>  If you have contributed new python or java docs into this release,
>> they'll appear in the to be published docs.
>>
>> Cheers,
>> Robert Burke
>> 2.26.0 release manager
>>
>> On Mon, Dec 7, 2020, 6:25 PM Robert Burke  wrote:
>>
>>> Turns out no changes required affecting the dataflow artifacts this time
>>> around, so Dataflow is cleared for testing.
>>>
>>> Cheers.
>>> Robert Burke
>>> 2.26.0 Release Manager
>>>
>>> On Mon, Dec 7, 2020, 6:03 PM Robert Burke  wrote:
>>>

 Robert Burke 
 Thu, Dec 3, 8:01 PM (4 days ago)
 to dev
 Hi everyone,
 Please review and vote on the release candidate #1 for the version
 2.26.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1
  if no issues are found.

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org [2],
 which is signed with the key with fingerprint
 A52F5C83BAE26160120EC25F3D56ACFBFB2975E1 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.26.0-RC1" [5],
 * website pull request listing the release [6], publishing the API
 reference manual [7], and the blog post [8].
 * Java artifacts were built with Maven 3.6.0 and OpenJDK 1.8.0_275.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 * Validation sheet with a tab for 2.26.0 release to help with
 validation [9].
 * Docker images published to Docker Hub [10].

 The vote will be open for at least 72 hours (10th ~6pm PST). It is
 adopted by majority approval, with at least 3 PMC affirmative votes.

 Thanks,
 Robert Burke
 2.26.0 Release Manager

 [1]
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12348833
 [2] https://dist.apache.org/repos/dist/dev/beam/2.26.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/org apache
 beam-1144/
 [5] https://github.com/apache/beam/tree/v2.26.0-RC1
 [6] https://github.com/apache/beam/pull/13481
 [7] https://github.com/apache/beam-site/pull/609
 [8] https://github.com/apache/beam/pull/13482
 [9]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=475997301
 [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image

 PS. New Dataflow artifacts likely need to be built and published but
 this doesn't block vetting the remainder of the RC at this time. Thank you
 for your patience.

>

Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-24 Thread Ismaël Mejía

It might be a good idea to include also:

[BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
SDF Wrapper DoFn
https://github.com/apache/beam/pull/13592

So Java development experience is less affected (as with 2.26.0) (There
is a flag to exclude but defaults matter).

On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev  wrote:
>
> We discovered a regression on CombineFn.from_callable() started in 2.26.0. 
> Even though it's not a regression in 2.27.0, I strongly prefer we fix it in 
> 2.27.0 as it leads to buggy behavior, so I vote -1.
>
> The fix to release branch is in flight: 
> https://github.com/apache/beam/pull/13613.
>
>
>
> On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette  wrote:
>>
>> -1 (non-binding)
>> Good news: I validated a dataframe pipeline on Dataflow which looked good 
>> (with expected performance improvements!)
>> Bad news: I also tried to run the sql_taxi example pipeline (streaming SQL 
>> in python) on Dataflow and ran into PubSub IO related issues. The example 
>> fails in the same way with 2.26.0, but it works in 2.25.0. It's possible 
>> this is a Dataflow bug and not a Beam one, but I'd like to investigate 
>> further to make sure.
>>
>> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:
>>>
>>> +1 (non-binding) Validated wordcount with Python source + Flink and Spark 
>>> job server jars. Also checked that the ...:sql:udf jar was added and 
>>> includes our cherry-picks. Thanks Pablo :)
>>>
>>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:

 +1 (binding).

 I validated python quickstarts. Thank you Pablo.

 On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre  
 wrote:
>
> +1 (binding)
>
> Regards
> JB
>
> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 
> 2.27.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release 
> candidate, and vote +1
>  if no issues are found.
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2], which is signed with the key with fingerprint 
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.27.0-RC1" [5],
> * website pull request listing the release [6], publishing the API 
> reference manual [7], and the blog post [8].
> * Python artifacts are deployed along with the source release to the 
> dist.apache.org [2].
> * Validation sheet with a tab for 2.27.0 release to help with validation 
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours, but given the holidays, we 
> will likely extend for a few more days. The release will be adopted by 
> majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> -P.
>
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1145/
> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
> [6] https://github.com/apache/beam/pull/13602
> [7] https://github.com/apache/beam-site/pull/610
> [8] https://github.com/apache/beam/pull/13603
> [9] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>
>

Re: Combine with multiple outputs case Sample and the rest

2020-12-23 Thread Ismaël Mejía

Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw  wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct 
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a 
> single PCollection (the difference between Map and FlatMap). In full 
> generality, one can always have a CombineFn that outputs lists (say  result>*) followed by a DoFn that emits to multiple places based on this 
> result.
>
> One other cons of emitting multiple values from a CombineFn is that they are 
> used in other contexts as well, e.g. combining state, and trying to make 
> sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw 
> most of the data away. If we kept most of the data, it likely wouldn't fit 
> into one machine to do the final sampling. The idea of using a side input to 
> filter after the fact should work well (unless there's duplicate elements, in 
> which case you'd have to uniquify them somehow to filter out only the "right" 
> copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía  wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël

Making preview (sample) time consistent on Direct runner

2020-12-18 Thread Ismaël Mejía

Hello,

The use of direct runner for interactive local use cases has increased
with the years on Beam due to projects like Scio, Kettle/Hop and our
own SQL CLI. All these tools have in common one thing, they show a
sample of some source input to the user and interactively apply
transforms to it to help users build Pipelines more rapidly.

If you build a pipeline today to produce this sample using the Beam’s
Sample transform from a set of files, the read of the files happens
first and then the sample, so the more files or the bigger they are
the longer it takes to produce the sample even if the number of
elements expected to read is constant.

During Beam Summit last year there were some discussions about how we
could improve this scenario (and others) but I have the impression no
further discussions happened in the mailing list, so I wanted to know
if there are some ideas about how we can get direct runner to improve
this case.

It seems to me that we can still ‘force’ the count with some static
field because it is not a distributed case but I don’t know how we can
stop reading once we have the number of sampled elements in a generic
way, specially now it seems to me a bit harder to do with pure DoFn
(SDF) APIs vs old Source ones, but well that’s just a guess.

Does anyone have an idea of how could we generalize this and of course
if you see the value of such use case, other ideas for improvements?

Regards,
Ismaël

Combine with multiple outputs case Sample and the rest

2020-12-18 Thread Ismaël Mejía

I had a question today from one of our users about Beam’s Sample
transform (a Combine with an internal top-like function to produce a
uniform sample of size n of a PCollection). They wanted to obtain also
the rest of the PCollection as an output (the non sampled elements).

My suggestion was to use the sample (since it was little) as a side
input and then reprocess the collection to filter its elements,
however I wonder if this is the ‘best’ solution.

I was thinking also if Combine is essentially GbK + ParDo why we don’t
have a Combine function with multiple outputs (maybe an evolution of
CombineWithContext). I know this sounds weird and I have probably not
thought much about issues or the performance of the translation but I
wanted to see what others thought, does this make sense, do you see
some pros/cons or other ideas.

Thanks,
Ismaël

Possible issue with bounded Read translation using SDF

2020-12-18 Thread Ismaël Mejía

Hello,

I was trying to profile some pipeline using Java's direct runner. It
reads ~30 60MB text files (CSV). When I started the profiler it
reported more than 40K instances of TextSource being built which
really surprised me given the small size of the data being processed.
I wonder if I found maybe an issue of over-splitting after we moved to
the SDF based translation that may affect simpler uses.

I have not gone deeper or created a JIRA because I wanted to ask here
first maybe to see if there is a 'valid' explanation for so many
'splits'.

Regards,
Ismaël

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Ismaël Mejía

The influence of checkpointing on the output of the results should be
minimal in particular for Direct Runner. It seems what Steve reports
here seems to be something different. Jan have you or others already
checked the influence of this on Flink who is now using this new
translation path?

I think the argument that the Direct runner is mostly about testing
and not about performance is an argument that is playing bad on Beam,
one should not necessarily exclude the other. Direct runner is our
most used runner, basically every Beam user relies on the direct
runners so every regression or improvement on it affects everyone, but
well that's a subject worth its own thread.

On Thu, Dec 17, 2020 at 10:55 AM Jan Lukavský  wrote:
>
> Hi,
>
> from my point of view the number in DirectRunner are set correctly. Primary 
> purpose of DirectRunner is testing, not performance, so DirectRunner makes 
> intentionally frequent checkpoints to easily exercise potential bugs in user 
> code. It might be possible to make the frequency configurable, though.
>
> Jan
>
> On 12/17/20 12:20 AM, Boyuan Zhang wrote:
>
> It's not a portable execution on DirectRunner so I would expect that outputs 
> from OutputAndTimeBoundedSplittableProcessElementInvoker should be emitted 
> immediately. For SDF execution on DirectRunner, the overhead could come from 
> the SDF expansion, SDF wrapper and the invoker.
>
> Steve, based on your findings, it seems like it takes more time for the SDF 
> pipeline to actually start to read from PubSub and more time to output 
> records. Are you able to tell how much time each part is taking?
>
> On Wed, Dec 16, 2020 at 1:53 PM Robert Bradshaw  wrote:
>>
>> If all it takes is bumping these numbers up a bit, that seems like a 
>> reasonable thing to do ASAP. (I would argue that perhaps they shouldn't be 
>> static, e.g. it might be preferable to start emitting results right away, 
>> but use larger batches for the steady state if there are performance 
>> benefits.)
>>
>> That being said, it sounds like there's something deeper going on here. We 
>> should also verify that this performance impact is limited to the direct 
>> runner.
>>
>> On Wed, Dec 16, 2020 at 1:36 PM Steve Niemitz  wrote:
>>>
>>> I tried changing my build locally to 10 seconds and 10,000 elements but it 
>>> didn't seem to make much of a difference, it still takes a few minutes for 
>>> elements to begin actually showing up to downstream stages from the Pubsub 
>>> read.  I can see elements being emitted from 
>>> OutputAndTimeBoundedSplittableProcessElementInvoker, and bundles being 
>>> committed by ParDoEvaluator.finishBundle, but after that, they seem to just 
>>> kind of disappear somewhere.
>>>
>>> On Wed, Dec 16, 2020 at 4:18 PM Boyuan Zhang  wrote:
>>>>
>>>> Making it as the PipelineOptions was my another proposal but it might take 
>>>> some time to do so. On the other hand, tuning the number into something 
>>>> acceptable is low-hanging fruit.
>>>>
>>>> On Wed, Dec 16, 2020 at 12:48 PM Ismaël Mejía  wrote:
>>>>>
>>>>> It sounds reasonable. I am wondering also on the consequence of these
>>>>> parameters for other runners (where it is every 10 seconds or 1
>>>>> elements) + their own configuration e.g. checkpointInterval,
>>>>> checkpointTimeoutMillis and minPauseBetweenCheckpoints for Flink. It
>>>>> is not clear for me what would be chosen now in this case.
>>>>>
>>>>> I know we are a bit anti knobs but maybe it makes sense to make this
>>>>> configurable via PipelineOptions at least for Direct runner.
>>>>>
>>>>> On Wed, Dec 16, 2020 at 7:29 PM Boyuan Zhang  wrote:
>>>>> >
>>>>> > I agree, Ismael.
>>>>> >
>>>>> > From my current investigation, the performance overhead should majorly 
>>>>> > come from the frequency of checkpoint in 
>>>>> > OutputAndTimeBoundedSplittableProcessElementinvoker[1], which is 
>>>>> > hardcoded in the DirectRunner(every 1 seconds or 100 elements)[2]. I 
>>>>> > believe configuring these numbers on DirectRunner should improve 
>>>>> > reported cases so far. My last proposal was to change the number to 
>>>>> > every 5 seconds or 1 elements. What do you think?
>>>>> >
>>>>> > [1] 
>>>>> > https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/ru

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Ismaël Mejía

It sounds reasonable. I am wondering also on the consequence of these
parameters for other runners (where it is every 10 seconds or 1
elements) + their own configuration e.g. checkpointInterval,
checkpointTimeoutMillis and minPauseBetweenCheckpoints for Flink. It
is not clear for me what would be chosen now in this case.

I know we are a bit anti knobs but maybe it makes sense to make this
configurable via PipelineOptions at least for Direct runner.

On Wed, Dec 16, 2020 at 7:29 PM Boyuan Zhang  wrote:
>
> I agree, Ismael.
>
> From my current investigation, the performance overhead should majorly come 
> from the frequency of checkpoint in 
> OutputAndTimeBoundedSplittableProcessElementinvoker[1], which is hardcoded in 
> the DirectRunner(every 1 seconds or 100 elements)[2]. I believe configuring 
> these numbers on DirectRunner should improve reported cases so far. My last 
> proposal was to change the number to every 5 seconds or 1 elements. What 
> do you think?
>
> [1] 
> https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvoker.java
> [2] 
> https://github.com/apache/beam/blob/3bb232fb098700de408f574585dfe74bbaff7230/runners/direct-java/src/main/java/org/apache/beam/runners/direct/SplittableProcessElementsEvaluatorFactory.java#L178-L181
>
> On Wed, Dec 16, 2020 at 9:02 AM Ismaël Mejía  wrote:
>>
>> I can guess that the same issues mentioned here probably will affect
>> the usability for people trying Beam's interactive SQL on Unbounded IO
>> too.
>>
>> We should really take into account that the performance of the SDF
>> based path should be as good or better than the previous version
>> before considering its removal (--experiments=use_deprecated_read) and
>> probably have consensus when this happens.
>>
>>
>> On Fri, Dec 11, 2020 at 11:33 PM Boyuan Zhang  wrote:
>> >
>> > > From what I've seen, the direct runner initiates a checkpoint after 
>> > > every element output.
>> > That seems like the 1 second limit kicks in before the output reaches 100 
>> > elements.
>> >
>> > I think the original purpose for DirectRunner to use a small limit on 
>> > issuing checkpoint requests is for exercising SDF better in a small data 
>> > set. But it brings overhead on a larger set owing to too many checkpoints. 
>> > It would be ideal to make this limit configurable from pipeline but the 
>> > easiest approach is that we figure out a number for most common cases. Do 
>> > you think we raise the limit to 1000 elements or every 5 seconds will help?
>> >
>> > On Fri, Dec 11, 2020 at 2:22 PM Steve Niemitz  wrote:
>> >>
>> >> From what I've seen, the direct runner initiates a checkpoint after every 
>> >> element output.
>> >>
>> >> On Fri, Dec 11, 2020 at 5:19 PM Boyuan Zhang  wrote:
>> >>>
>> >>> Hi Antonio,
>> >>>
>> >>> Thanks for the details! Which version of Beam SDK are you using? And are 
>> >>> you using --experiments=beam_fn_api with DirectRunner to launch your 
>> >>> pipeline?
>> >>>
>> >>> For ReadFromKafkaDoFn.processElement(), it will take a Kafka 
>> >>> topic+partition as input element and a KafkaConsumer will be assigned to 
>> >>> this topic+partition then poll records continuously. The Kafka consumer 
>> >>> will resume reading and return from the process fn when
>> >>>
>> >>> There are no available records currently(this is a feature of SDF which 
>> >>> calls SDF self-initiated checkpoint)
>> >>> The OutputAndTimeBoundedSplittableProcessElementInvoker issues 
>> >>> checkpoint request to ReadFromKafkaDoFn for getting partial results. The 
>> >>> checkpoint frequency for DirectRunner is every 100 output records or 
>> >>> every 1 seconds.
>> >>>
>> >>> It seems like either the self-initiated checkpoint or DirectRunner 
>> >>> issued checkpoint gives you the performance regression since there is 
>> >>> overhead when rescheduling residuals. In your case, it's more like that 
>> >>> the checkpoint behavior of 
>> >>> OutputAndTimeBoundedSplittableProcessElementInvoker gives you 200 
>> >>> elements a batch. I want to understand what kind of performance 
>> >>> regression you are noticing? Is it slower to output the same amount of 
>> >>> records?
>> >>&

Re: Farewell mail

2020-12-16 Thread Ismaël Mejía

Thanks Piotr,

You made an impact on Beam! Best wishes in the future projects and
feel welcome whenever you want to contribute again.

Ismaël

On Wed, Dec 16, 2020 at 9:02 PM Brian Hulette  wrote:
>
> Thank you for all your contributions! Good luck in your future endeavors :)
>
> Brian
>
> On Wed, Dec 16, 2020 at 9:35 AM Griselda Cuevas  wrote:
>>
>> Thank you Piotr for your contributions.
>>
>> On Wed, 16 Dec 2020 at 09:16, Ahmet Altay  wrote:
>>>
>>> Thank you Piotr and best wishes!
>>>
>>> On Wed, Dec 16, 2020 at 8:46 AM Alexey Romanenko  
>>> wrote:

 Piotr,

 Thanks a lot for your contributions, it was very useful and made Beam more 
 stable and finally even better! And it was always interesting to work with 
 you =)

 I wish you all the best in your next adventure but feel free to get back 
 to Beam and contribute in any way as you can. It is always welcome!

 Alexey

 > On 16 Dec 2020, at 17:16, Piotr Szuberski  
 > wrote:
 >
 > Hi all,
 >
 > This week is the last one I'm working on Beam. It was a pleasure to 
 > contribute to this project. I've learned a lot and had really good time 
 > with you guys!
 >
 > The IT world is quite small so there is no goodbye. See you in the 
 > future in the Web and another great projects!
 >
 > You can find me at https://github.com/piotr-szuberski
 >
 > Piotr

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Ismaël Mejía

I can guess that the same issues mentioned here probably will affect
the usability for people trying Beam's interactive SQL on Unbounded IO
too.

We should really take into account that the performance of the SDF
based path should be as good or better than the previous version
before considering its removal (--experiments=use_deprecated_read) and
probably have consensus when this happens.


On Fri, Dec 11, 2020 at 11:33 PM Boyuan Zhang  wrote:
>
> > From what I've seen, the direct runner initiates a checkpoint after every 
> > element output.
> That seems like the 1 second limit kicks in before the output reaches 100 
> elements.
>
> I think the original purpose for DirectRunner to use a small limit on issuing 
> checkpoint requests is for exercising SDF better in a small data set. But it 
> brings overhead on a larger set owing to too many checkpoints. It would be 
> ideal to make this limit configurable from pipeline but the easiest approach 
> is that we figure out a number for most common cases. Do you think we raise 
> the limit to 1000 elements or every 5 seconds will help?
>
> On Fri, Dec 11, 2020 at 2:22 PM Steve Niemitz  wrote:
>>
>> From what I've seen, the direct runner initiates a checkpoint after every 
>> element output.
>>
>> On Fri, Dec 11, 2020 at 5:19 PM Boyuan Zhang  wrote:
>>>
>>> Hi Antonio,
>>>
>>> Thanks for the details! Which version of Beam SDK are you using? And are 
>>> you using --experiments=beam_fn_api with DirectRunner to launch your 
>>> pipeline?
>>>
>>> For ReadFromKafkaDoFn.processElement(), it will take a Kafka 
>>> topic+partition as input element and a KafkaConsumer will be assigned to 
>>> this topic+partition then poll records continuously. The Kafka consumer 
>>> will resume reading and return from the process fn when
>>>
>>> There are no available records currently(this is a feature of SDF which 
>>> calls SDF self-initiated checkpoint)
>>> The OutputAndTimeBoundedSplittableProcessElementInvoker issues checkpoint 
>>> request to ReadFromKafkaDoFn for getting partial results. The checkpoint 
>>> frequency for DirectRunner is every 100 output records or every 1 seconds.
>>>
>>> It seems like either the self-initiated checkpoint or DirectRunner issued 
>>> checkpoint gives you the performance regression since there is overhead 
>>> when rescheduling residuals. In your case, it's more like that the 
>>> checkpoint behavior of OutputAndTimeBoundedSplittableProcessElementInvoker 
>>> gives you 200 elements a batch. I want to understand what kind of 
>>> performance regression you are noticing? Is it slower to output the same 
>>> amount of records?
>>>
>>> On Fri, Dec 11, 2020 at 1:31 PM Antonio Si  wrote:

 Hi Boyuan,

 This is Antonio. I reported the KafkaIO.read() performance issue on the 
 slack channel a few days ago.

 I am not sure if this is helpful, but I have been doing some debugging on 
 the SDK KafkaIO performance issue for our pipeline and I would like to 
 provide some observations.

 It looks like in my case the ReadFromKafkaDoFn.processElement()  was 
 invoked within the same thread and every time kafaconsumer.poll() is 
 called, it returns some records, from 1 up to 200 records. So, it will 
 proceed to run the pipeline steps. Each kafkaconsumer.poll() takes about 
 0.8ms. So, in this case, the polling and running of the pipeline are 
 executed sequentially within a single thread. So, after processing a batch 
 of records, it will need to wait for 0.8ms before it can process the next 
 batch of records again.

 Any suggestions would be appreciated.

 Hope that helps.

 Thanks and regards,

 Antonio.

 On 2020/12/04 19:17:46, Boyuan Zhang  wrote:
 > Opened https://issues.apache.org/jira/browse/BEAM-11403 for tracking.
 >
 > On Fri, Dec 4, 2020 at 10:52 AM Boyuan Zhang  wrote:
 >
 > > Thanks for the pointer, Steve! I'll check it out. The execution paths 
 > > for
 > > UnboundedSource and SDF wrapper are different. It's highly possible 
 > > that
 > > the regression either comes from the invocation path for SDF wrapper, 
 > > or
 > > the implementation of SDF wrapper itself.
 > >
 > > On Fri, Dec 4, 2020 at 6:33 AM Steve Niemitz  
 > > wrote:
 > >
 > >> Coincidentally, someone else in the ASF slack mentioned [1] yesterday
 > >> that they were seeing significantly reduced performance using 
 > >> KafkaIO.Read
 > >> w/ the SDF wrapper vs the unbounded source.  They mentioned they were 
 > >> using
 > >> flink 1.9.
 > >>
 > >> https://the-asf.slack.com/archives/C9H0YNP3P/p1607057900393900
 > >>
 > >> On Thu, Dec 3, 2020 at 1:56 PM Boyuan Zhang  
 > >> wrote:
 > >>
 > >>> Hi Steve,
 > >>>
 > >>> I think the major performance regression comes from
 > >>> OutputAndTimeBoundedSplittableProcessElementInvoker[1], which will
 > >>

Re: Tests for compatibility with Avro 1.8 and 1.9

2020-12-04 Thread Ismaël Mejía

After some offline discussion with Piotr we discovered two issues:

1. The gradle avro plugin we use needs a specific version of Avro in
each of his versions, so we would need to use different versions of
the plugin to generate the Avro objects for our tests with each
version, because the generated objects are not compatible if they have
dates on it (because of the removal of joda-time).

2. There is an Avro test class we depend on to generate random data in
our tests that it is in a different package in recent versions of Avro
so we will probably need to detect this or move the class into Beam
(maintenance should not be hard for that class).

Both annoyances happen also with Avro 1.9, at least it seems that once
we have compatibility with 1.9 we will get 1.10 'for free'.

On Thu, Dec 3, 2020 at 6:29 PM Brian Hulette  wrote:
>
>
>
> On Thu, Dec 3, 2020 at 1:02 AM Piotr Szuberski  
> wrote:
>>
>> > A softer approach would be to let it as it is (1.8) and document
>> > explicitly that we check upwards compatibility with 1.9 and suggest
>> > users to explicitly override the version if required.
>>
>> Ok, thanks! I think it's the better option.
>>
>>
>> > I have not followed your work on the compatibility tests but I am
>> > curious what is the issue with Avro 1.10?
>>
>> AFAIR Avro 1.10 completely removes the support for joda time and for now 
>> Beam makes use of both Avro time interfaces (one of which is removed in 1.10)
>
>
> Could you file a jira for this (if we don't have one already)?

1 2 3 4 5 6 7 8 >

1 - 100 of 798 matches

Mail list logo