Data Engineering track for Community Over Code NA is calling for presentations

2023-07-07 Thread Jarek Potiuk
Hello Beam community,

Just a reminder, that there are just 6 days left to submit your proposal
for The Community Over Code NA (former ApacheCon) conference.

This is the flagship event for the ASF in Halifax, Nova Scotia, Canada,
October 7-10, 2023 and together with Ismael, we want to encourage you to
submit talks for the Data Engineering track.

More info on the conference https://communityovercode.org/ and more info on
the Data Engineering track:
https://medium.com/@jarekpotiuk/data-engineering-community-over-code-conference-38e9677bb440

We hope to see you in Halifax in October!

J.


Invitation for CFP for Data Engineering Track at the Community Over Code NA

2023-06-16 Thread Jarek Potiuk
Hello Beam community members !

TL;DR: Call For Papers for Community Over Code NA conference in Halifax in
October *ends in 4 weeks (13th of July!)* and this is about the last moment
to prepare and submit your proposals:
https://communityovercode.org/call-for-presentations/

*Community Over Code North America* (formerly named ApacheCon) is a
flagship conference of the ASF and is back this year in October - fully in
person, onsite event, following the great New Orleans event last year. This
time it's going to be in *Halifax, Canada, Nova Scotia, October 7-10 2023 *-
read more here: https://communityovercode.org/

Together with Ismaël Mejía (in cc:), we organize the second edition of Data
Engineering Track there.

Last year we had the first edition of the Data Engineering track, and we
recorded and published all the talks. You can see them here:
https://s.apache.org/data-engineering-videos-2022 if you want to see what
was there. We thought with Ismaël  it was a great start and we want to
bring it to a new level now - that's why we are reaching out. We think your
community might be interested in submitting a proposal.

If you are curious why we think we need separate Data Engineering track you
can also read our blog post: https://s.apache.org/data-engineering

The Call for Presentations (CfP) closes in less than a month, and the time
is ticking.

If you got that far - the link to CFP again:
https://communityovercode.org/call-for-presentations/ to make it easier to
follow!

We are looking forward to receiving your submissions and hopefully seeing
you in Halifax in October.

If you have any questions - do not hesitate to contact us back.

Thanks,

Ismaël and Jarek


Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-26 Thread Jarek Potiuk
Happy to help and I hope we can work together with Valentyn and others to
get the "google clients" approach improved :)

J.


On Fri, Aug 26, 2022 at 3:40 PM Kerry Donny-Clark via dev <
dev@beam.apache.org> wrote:

> Jarek, I really appreciate you sharing your experience and expertise here.
> I think Beam would benefit from adopting some of these practices.
> Kerry
>
> On Fri, Aug 26, 2022, 7:35 AM Jarek Potiuk  wrote:
>
>>
>>> I'm curious Jarek, does Airflow take any dependencies on popular
>>> libraries like pandas, numpy, pyarrow, scipy, etc... which users are likely
>>> to have their own dependency on? I think these dependencies are challenging
>>> in a different way than the client libraries - ideally we would support a
>>> wide version range so as not to require users to upgrade those libraries in
>>> lockstep with Beam. However in some cases our dependency is pretty tight
>>> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to
>>> explicitly test with multiple different versions. Does Airflow have any
>>> similar issues?
>>>
>>
>> Yes we do (all of those I think :) ). Complete set of all our deps can be
>> found here
>> https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt
>> (continuously updated and we have different sets for different python
>> versions).
>>
>> We took a rather interesting and unusual approach (more details in my
>> talk) - mainly because Airflow is both an application to install (for
>> users) and library to use (for DAG authors) and both have contradicting
>> expectations (installation stability versus flexibility in
>> upgrading/downgrading dependencies). Our approach is really smart in making
>> sure water and fire play well with each other.
>>
>> Most of those dependencies are coming from optional extras (list of all
>> extras here:
>> https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html).
>> More often than not the "problematic" dependencies you mention are
>> transitive dependencies through some client libraries we use (for example
>> Apache Beam SDK is a big contributor to those :).
>>
>> Airflow "core" itself has far less dependencies
>> https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt
>> (175 currently) and we actively made sure that all "pandas" of this world
>> are only optional extra deps.
>>
>> Now - the interesting thing is that we use "constraints'' (the links you
>> with dependencies that I posted are those constraints) to pin versions of
>> the dependencies that are "golden" - i.e. we test those continuously in our
>> CI and we automatically upgrade the constraints when all the unit and
>> integration tests pass.
>> There is a little bit of complexity and sometimes conflicts to handle (as
>> `pip` has to find the right set of deps that will work for all our optional
>> extras), but eventually we have really one "golden" set of constraints at
>> any moment in time main (or v2-x branch - we have a separate set for each
>> branch) that we are dealing with. And this is the only "set" of dependency
>> versions that Airflow gets tested with. Note - these are *constraints *not
>> *requirements *- that makes a whole world of difference.
>>
>> Then when we release airflow, we "freeze" the constraints with the
>> version tag. We know they work because all our tests pass with them in CI.
>>
>> Then we communicate to our users (and we use it in our Docker image) that
>> the only "supported" way of installing airflow is with using `pip` and
>> constraints
>> https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html.
>> And we do not support poetry, pipenv - we leave it up to users to handle
>> them (until poetry/pipenv will support constraints - which we are waiting
>> for and there is an issue where I explained  why it is useful). It looks
>> like that `pip install "apache-airflow==2.3.4" --constraint "
>> https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"`
>> (different constraints for different airflow version and Python version you
>> have)
>>
>> Constraints have this nice feature that they are only used during the
>> "pip install" phase and thrown out immediately after the install is
>> complete. They do not create "hard" requirements for airflow. Airflow still
>> has a number of "low

Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-26 Thread Jarek Potiuk
>
> I'm curious Jarek, does Airflow take any dependencies on popular libraries
> like pandas, numpy, pyarrow, scipy, etc... which users are likely to have
> their own dependency on? I think these dependencies are challenging in a
> different way than the client libraries - ideally we would support a wide
> version range so as not to require users to upgrade those libraries in
> lockstep with Beam. However in some cases our dependency is pretty tight
> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to
> explicitly test with multiple different versions. Does Airflow have any
> similar issues?
>

Yes we do (all of those I think :) ). Complete set of all our deps can be
found here
https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt
(continuously updated and we have different sets for different python
versions).

We took a rather interesting and unusual approach (more details in my talk)
- mainly because Airflow is both an application to install (for users) and
library to use (for DAG authors) and both have contradicting expectations
(installation stability versus flexibility in upgrading/downgrading
dependencies). Our approach is really smart in making sure water and fire
play well with each other.

Most of those dependencies are coming from optional extras (list of all
extras here:
https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html).
More often than not the "problematic" dependencies you mention are
transitive dependencies through some client libraries we use (for example
Apache Beam SDK is a big contributor to those :).

Airflow "core" itself has far less dependencies
https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt
(175 currently) and we actively made sure that all "pandas" of this world
are only optional extra deps.

Now - the interesting thing is that we use "constraints'' (the links you
with dependencies that I posted are those constraints) to pin versions of
the dependencies that are "golden" - i.e. we test those continuously in our
CI and we automatically upgrade the constraints when all the unit and
integration tests pass.
There is a little bit of complexity and sometimes conflicts to handle (as
`pip` has to find the right set of deps that will work for all our optional
extras), but eventually we have really one "golden" set of constraints at
any moment in time main (or v2-x branch - we have a separate set for each
branch) that we are dealing with. And this is the only "set" of dependency
versions that Airflow gets tested with. Note - these are *constraints
*not *requirements
*- that makes a whole world of difference.

Then when we release airflow, we "freeze" the constraints with the version
tag. We know they work because all our tests pass with them in CI.

Then we communicate to our users (and we use it in our Docker image) that
the only "supported" way of installing airflow is with using `pip` and
constraints
https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html.
And we do not support poetry, pipenv - we leave it up to users to handle
them (until poetry/pipenv will support constraints - which we are waiting
for and there is an issue where I explained  why it is useful). It looks
like that `pip install "apache-airflow==2.3.4" --constraint "
https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"`
(different constraints for different airflow version and Python version you
have)

Constraints have this nice feature that they are only used during the "pip
install" phase and thrown out immediately after the install is complete.
They do not create "hard" requirements for airflow. Airflow still has a
number of "lower-bound" limits for a number of constraints but we try to
avoid putting upper-bounds at all (only in specific cases and documenting
them) and our bounds are rather relaxed. This way we achieve two things:

1) when someone does not use constraints and has a problem with broken
dependency - we tell them to use constraints - this is what we as a
community commit to and support
2) but by using constraints mechanism we do not limit our users if they
want to upgrade or downgrade any dependencies. They are free to do it (as
long as it fits the - rather relaxed lower/upper bounds of Airflow). But
"with great powers come great responsibilities" - if they want to do that.,
THEY have to make sure that airflow will work. We make no guarantees there.
3) we are not limited by the 3rd-party libraries that come as extras - if
you do not use those, the limits do not apply

I think this works really well - but it is rather complex to setup and
maintain - I built a whole complex set of scripts and I have the whole
`breeze` ("It's a breeze to develop airflow" is the theme) development/CI
environment based on docker and docker-compose that allows us to automate
all of that.

J.


Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-24 Thread Jarek Potiuk
Comment (from a bit outsider)

Fantastic document Valentyn.

Very, very insightful and interesting. We feel a lot of the same pain in
Apache Airflow (actually even more because we have not 20 but 620+
dependencies) but we are also a bit more advanced in the way how we are
managing the dependencies - some of the ideas you had there are already
tested and tried in Airflow, some of them are a bit different but we can
definitely share "principles" and we are a little higher in the "supply
chain" (i.e. Apache Beam Python SDK is our dependency).

I left some suggestions and some comments describing in detail how the same
problems look like in Airflow and how we addressed them (if we did) and I
am happy to participate in further discussions. I am "the dependency guy"
in Airflow and happy to share my experiences and help to work out some
problems - and especially help to solve problems coming from using multiple
google-client libraries and diamond dependencies (we are just now dealing
with similar issue - where likely we will have to do a massive update of
several of our clients - hopefully with the involvement of Composer team.
And I'd love to be involved in a joint discussion with the google client
team to work out some common and expectations that we can rely on when we
define our future upgrade strategy for google clients.

I will watch it here and be happy to spend quite some time on helping to
hash it out.

BTW. You can also watch my talk I gave last year at PyWaw about "Managing
Python dependencies at Scale"
https://www.youtube.com/watch?v=_SjMdQLP30s=2549s where I explain the
approach we took, reasoning behind it etc.

J.


On Wed, Aug 24, 2022 at 2:45 AM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Hi everyone,
>
> Recently, several issues [1-3]  have highlighted outage risks and
> developer inconveniences due to  dependency management practices in Beam
> Python.
>
> With dependabot and other tooling  that we have integrated with Beam, one
> of the missing pieces seems to be having a clear guideline of how we should
> be specifying requirements for our dependencies and when and how we should
> be updating them to have a sustainable process.
>
> As a conversation starter, I put together a retrospective
> [4]
> covering a recent incident and would like to get community opinions on the
> open questions.
>
> In particular, if you have experience managing dependencies for other
> Python libraries with rich dependency chains, knowledge of available
> tooling or first hand experience dealing with other dependency issues in
> Beam, your input would be greatly appreciated.
>
> Thanks,
> Valentyn
>
> [1] https://github.com/apache/beam/issues/22218
> [2] https://github.com/apache/beam/pull/22550#issuecomment-1217348455
> [3] https://github.com/apache/beam/issues/22533
> [4]
> https://docs.google.com/document/d/1gxQF8mciRYgACNpCy1wlR7TBa8zN-Tl6PebW-U8QvBk/edit
>


Re: Apache Trift vs GRPC summary

2022-07-16 Thread Jarek Potiuk
Thanks for all the pointers. I have finally gotten to implement a POC
based on GRPC and I am super-happy with it so far. It has all the
modern support we need in Airflow and seems performant enough to serve
our case.

J.

On Thu, Feb 17, 2022 at 6:52 PM Kenneth Knowles  wrote:
>
> Another TL;DR that may not be covered in the history is that we initially set 
> out with a couple of goals that have since been abandoned:
>
> 1. Allow Beam to be used in a particular language/ecosystem without a 
> dependency on the portability framework (NO - we want everything to use the 
> portability framework)
> 2. Allow Beam's portable model to be independent of transport (NO - using 
> protobuf for the messages it really only makes sense to use protobuf + gRPC 
> for transport)
> 2a. Potentially allow Beam's portable model to be represented in multiple 
> serialization formats (NO - there are enough impedance mismatches that it is 
> just not worthwhile, even though proto has lots of problems at least we can 
> develop workarounds only once)
>
> We never did develop with anything other than proto+gRPC in mind.
>
> Kenn
>
> On Thu, Feb 17, 2022 at 4:55 AM Jarek Potiuk  wrote:
>>
>> Thank you ! I will dive deeper - but having just those pointers is a good 
>> start (I likely mixed up gRPC - Thrift bridges with replacing of Thrift  
>> Luke!)
>>
>> On Thu, Feb 17, 2022 at 5:28 AM Kenneth Knowles  wrote:
>>>
>>> I can find you that fun mailing list pointer, if you like. Here's a 
>>> starting point with the subject "[DISCUSS] Beam data plane serialization 
>>> tech"
>>>
>>> https://lists.apache.org/thread/dz24chmm18skzgcmxl2jxookd3yn79r1
>>>
>>> Kenn
>>>
>>> On Wed, Feb 16, 2022 at 10:23 AM Luke Cwik  wrote:
>>>>
>>>> Apache Beam never had an RPC layer for the internal workings of the 
>>>> project until the portability project[1] started so there never was a 
>>>> transition from Apache Thrift to gRPC.
>>>>
>>>> Generally the support for HTTP2 and long lived streaming connections were 
>>>> the key differentiators for gRPC.
>>>>
>>>> 1: https://beam.apache.org/roadmap/portability/
>>>>
>>>> On Wed, Feb 16, 2022 at 2:38 AM Jarek Potiuk  wrote:
>>>>>
>>>>> Hello Beam friends,
>>>>>
>>>>> I have a question, we are preparing (as part of 
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API)
>>>>>  to split Airflow into more components which will be communicating using 
>>>>> RPC.
>>>>>
>>>>> Basically we need to extract some of the internal methods into a "remote 
>>>>> procedure calls" which then we would like to be able to call either 
>>>>> "really remotely" (over HTTPS) or locally (via local TCP/Unix domain 
>>>>> sockets).
>>>>>
>>>>> I have narrowed down the options we have to Apache Thrift and gRPC. I 
>>>>> know that Apache Beam was (is ?) in a transition period Thrift -> GRPC 
>>>>> and I am sure you have some experiences to share and (following your 
>>>>> mailing lists) I am sure there was a deep analysis done for those two 
>>>>> before you decided to switch.
>>>>>
>>>>> Before I start searching through your mailing list, maybe someone knows a 
>>>>> document or some summary of the two that you could share with us - that 
>>>>> probably could save us a lot of effort deciding which of those two might 
>>>>> be better for our needs.
>>>>>
>>>>> Is there something that you know of easily that can be shared?
>>>>>
>>>>> J,
>>>>>


Re: Trigger phrases in Github Actions

2022-07-15 Thread Jarek Potiuk
Sure :). I think it's really up to you to decide. There are pros/cons
of both - it was just my view from experience of running "master"
workflows for a while to do stuff like that (we had "labelling" of the
PRs on "approve" action). But everyone's mileage is different

One con that I did not mention (and that is quite annoying). This is
more of a UI issue than anything else (and the main reason why
Idecided to remove the labelling workflow) this kind of "comment"
action - they introduce a lot of noise in the "actions" view of
Github. By default it shows all actions - and you will mostly see
comments actions in the first page if you add the action. Surely you
can go to the link with only the workflow you are interested in and
save the link, but it's not sticky and it is kinda annoying. But again
- up to you :)

J.

On Fri, Jul 15, 2022 at 2:23 PM Danny McCormick
 wrote:
>
> I tend to disagree with this approach. I think I can generally distill what 
> you said to 3 points, which I'll address below:
>
> 1. Every workflow needs a clean worker, and Apache has a limited number of 
> those. As a result, queue times can build up.
>
> This is a problem, but it's one that should be resolved by introducing our 
> own self-hosted runners. (Jarek, I know you're actually involved in that 
> effort). Any significant push towards GHA is going to rely on that. Once we 
> have self-hosted runners, we won't necessarily need a clean machine every 
> time, and we also won't need to worry about capacity issues like we do now.
>
> 2. There are security issues with running workflows that are based on the 
> master branch.
>
> This is sorta true - this is basically only a problem if we pass a non-locked 
> down secret into a step that runs code outside of our GH workflow 
> configuration (and is either running on a PR or on master with some git 
> checkout magic to target a different ref). That's not a trivial risk, so we 
> need to be careful, but it's also not one that's solved by using probot - if 
> our end goal is to kick off CI runs that use user code, this is just a 
> problem that we're going to have to deal with. It's basically just one of 
> those problems that Open Source CI is always going to have (and one we 
> already have with Jenkins) - how much do you restrict resources available to 
> a fork? As a rule, it is a good idea for any code we want to trigger on 
> arbitrary users' comments to not pass in secrets to that code, or to 
> explicitly limit the scope of them.
>
> Kicking off workflows on master is not generally a problem because any code 
> they run will be vetted by the normal PR process.
>
> 3. Probot is a better option.
>
> I'd argue Probot is just a different option, with different tradeoffs (and 
> I'd say is overall worse for our use case). Here are some pros and cons I see 
> of using GHA over ProBot
>
> Pros:
> - All config lives in the repo and can be managed without involving Infra 
> (you already called this out).
> - We're already using GHA for some of our CI, this limits the number of tools 
> we're using.
> - Lots of built in, and supported, functionality exists for Actions - 
> technically there are a lot of Probot apps out there that do similar things 
> to GHA, but most of them are deprecated or not actively maintained. Speaking 
> from experience, GH had a team of >100 working on building out/maintaining 
> Actions, and virtually nobody working on Probot related things. A bunch of 
> Probot apps were also deprecated when GitHub Actions v2 was launched. It's 
> also worth calling out that most of the things you have your Airflow Probot 
> app doing exist already as actions.
> - Relatedly, I expect future improvements to come to GHA, not Probot.
> - Actions logging shows up in the repo where anyone can view it. Probot 
> requires a separate logging service.
>
> Cons:
> - Probot is a little snappier, even with private runners
> - Maintaining state is easier with probot (not impossible with GHA, but 
> requires being kinda hacky).
>
> To me, those cons don't justify partially moving off of GHA to ProBot.
>
> Thanks,
> Danny
>
> On Fri, Jul 15, 2022 at 3:00 AM Jarek Potiuk  wrote:
>>
>> My 3 cents.
>>
>> We've been playing with similar approaches in Apache Airflow and I
>> think Github Actions Workflows are not a good idea for this kind of
>> behaviour. Github Action workflows are really "heavy-weight" in many
>> ways, you should really think of them to be spinned to actually do
>> some processing - building, compiling, processing. They have a lot of
>> overhead and problems for "quick" actions like triggering something
>> based on an event.
>>
>> There 

Re: Trigger phrases in Github Actions

2022-07-15 Thread Jarek Potiuk
My 3 cents.

We've been playing with similar approaches in Apache Airflow and I
think Github Actions Workflows are not a good idea for this kind of
behaviour. Github Action workflows are really "heavy-weight" in many
ways, you should really think of them to be spinned to actually do
some processing - building, compiling, processing. They have a lot of
overhead and problems for "quick" actions like triggering something
based on an event.

There are ways to use "master" workflows - we do that in Airflow, it
is possible to have a master workflow that checks out the code from
the branch that the event came from. It has a number of security
implications though, so it's nothing but complex and dangerous to use
(because your master workflow can get dangerous write permissions and
if you are not careful and execute PR-provided code, this might get
exploited by bots opening PRs and making comments to your repo). But
even if you solve the problems with "master-only"  you will hit the
problem that there are 150 workers total for All Apache projects and
every workflow needs a "clean worker" - which means that just to
trigger an action on comment you will sometimes have to wait in a
queue for available workers.

I think for that kind of "triggering" and commenting, we have much
better experience with Github Apps - which even if they require
installing by Infra, provide a much better "snappiness" and you can do
more with them 
https://docs.github.com/en/developers/apps/managing-github-apps/installing-github-apps
- they react on webhooks coming from GitHub. It's a bit more hassle to
develop and install but the effect is much better IMHO for the kind of
behaviour you want. In Airflow we have Boring Cyborg which we
developed years ago and it is rock-solid for this kind of quick
actions https://github.com/kaxil/boring-cyborg  (based on Probot).

I think Github Apps are kinda "forgotten" but when Github Actions were
added - the main reason is that you had to develop it outside your
repo, where GitHub Actions are mostly "in-repo" and you can "easily"
make a change in a PR, but this is deceptive a bit. The "simple" PR
workflows are fine, but when you feel the need of "write" access to
your repo things get very complex, very quickly.

I hope this helps.

J.

On Fri, Jul 15, 2022 at 4:41 AM Danny McCormick via dev
 wrote:
>
> I don't think using trigger comments is generally super common, so no. I do 
> still think it's a useful feature to have though, and part of GHA's approach 
> is that everything is intentionally extensible and feature light so that you 
> can tailor it to your org's needs.
>
> Actually, digging in further, it does look like the approach I recommended 
> might not work out - from 
> https://github.community/t/on-issue-comment-events-are-not-triggering-workflows/16784/13
>  it looks like issue_comment always triggers on master unfortunately.
>
> Maybe a ]better approach here would be to use an action like 
> https://github.com/peter-evans/repository-dispatch to trigger workflows on 
> issue comment. That would allow us to have a single workflow that reads the 
> issue comment and triggers a workflow as needed. The only downside is that it 
> would rely on an external token that would need to be rotated on a yearly 
> basis. It also puts all the issue triggering logic in a single place which is 
> nice.
>
>
> On Thu, Jul 14, 2022 at 5:07 PM Kenneth Knowles  wrote:
>>
>> Is this an idiomatic way to trigger GHA?
>>
>> On Thu, Jul 14, 2022 at 1:36 PM Danny McCormick via dev 
>>  wrote:
>>>
>>> Hey Fer,
>>>
>>> I'm not 100% sure I follow what you're trying to do, but one approach you 
>>> could take is to gate everything off of an if like:
>>>
>>> ` if {{ (github.event.issue.pull_request && github.event.comment && 
>>> contains(github.event.comment.body, "PHRASE")) || !github.event.comment }}
>>>
>>> Basically, that's doing: `if ( &&  && >> comment contains the phrase>) || `
>>>
>>> I haven't tested it so it might be a little off syntactically, but I 
>>> believe that should generally do what you're trying to do. FWIW, you might 
>>> find it helpful to dump the context with an action like this in a test repo 
>>> - that will show you exactly what is available in the github context for 
>>> each kind of event so that you can use them accordingly.
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Thu, Jul 14, 2022 at 3:20 PM Fer Morales Martinez 
>>>  wrote:

 Hello everyone!

 As part of the migration of the precommit and postcommit jobs from jenkins 
 over to github actions, we're trying to implement the trigger phrase 
 functionality.
 Our first approach was to use issue_comment
 One problem we noticed after testing is that it looks like issue_comment 
 is mutually exclusive with the other events. For example, given the 
 following flow

 name: Java Tests
 on:
   schedule:
 - cron: '10 2 * * *'
   push:
 branches: ['master', 'release-*']
   

Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23rd of May !

2022-05-10 Thread Jarek Potiuk
Hello Beam developers !

ApacheCon North America is back in person this year in October.
https://apachecon.com/acna2022/

Together with Ismaël Mejía, we are organizing for the first time a Data
Engineering Track as part of ApacheCon.

You might be wondering why a different track if we already have the Big
Data track. Simple, this new track covers the ‘other’ open-source projects
we use to clean data, orchestrate workloads, do observability,
visualization, governance, data lineage and many other tasks that are part
of data engineering and that are usually not covered by the data processing
/ database tracks.

If you are curious you can find more details here:
https://s.apache.org/apacheconna-2022-dataeng-track

So why are you getting this message? Well it could be that (1) you are
already a contributor to a project in the data engineering space and you
might be interested in sending your proposal, or (2) you are interested in
integrations of these tools with your existing data tools.

If you are interested you can submit a proposal using the CfP link below.
Don’t forget to choose the Data Engineering Track.
https://apachecon.com/acna2022/cfp.html

The Call for Presentations (CfP) closes in less than two weeks on May 23th,
2022.

We are looking forward to receiving your submissions and hopefully seeing
you in

New Orleans in October.

Thanks,

Ismaël and Jarek


Re: Data Engineering track at ApacheCon (October 3-6, New Orleans)

2022-04-13 Thread Jarek Potiuk
Cool 23 May is the deadline (I forgot to mention it).

On Wed, Apr 13, 2022 at 9:16 PM Pablo Estrada  wrote:

> Thanks Jarek!
> This is a great idea, and I'll try and submit something for this : )
> Best
> -P.
>
> On Wed, Apr 13, 2022 at 1:22 AM Jarek Potiuk  wrote:
>
>> Hello Beam Friends.
>>
>> There is an ApacheCon N coming this year in October (
>> https://apachecon.com/acna2022/)  and it's going to be an "ONSITE"
>> event  - 3-6 October, New Orleans, Louisiana!
>>
>> It's one of the best events ever when it comes to community building
>> at Apache so I heartily invite everyone. This year also is the first
>> year ApacheCon has a dedicated DataEngineering track.
>>
>> Last few years none of the tracks at Apache Con (including any of the
>> "BigData" tracks) matched a lot of the subjects we were touching at
>> Apache Airflow, Apache Beam, Apache Superset, Dolphin Scheduler and
>> many others so I proposed (and I am chairing) a Data Engineering
>> track.
>>
>> The Call for Papers is open http://cfp.apachecon.com/
>>
>> And feel free to spread the news to anyone you might find interested.
>> Also if anyone is interested in co-leading the track, feel free to
>> reach out to me directly. Happy to co-share the leadership (and a
>> little responsibilities too) :).
>>
>> J,
>>
>


Data Engineering track at ApacheCon (October 3-6, New Orleans)

2022-04-13 Thread Jarek Potiuk
Hello Beam Friends.

There is an ApacheCon N coming this year in October (
https://apachecon.com/acna2022/)  and it's going to be an "ONSITE"
event  - 3-6 October, New Orleans, Louisiana!

It's one of the best events ever when it comes to community building
at Apache so I heartily invite everyone. This year also is the first
year ApacheCon has a dedicated DataEngineering track.

Last few years none of the tracks at Apache Con (including any of the
"BigData" tracks) matched a lot of the subjects we were touching at
Apache Airflow, Apache Beam, Apache Superset, Dolphin Scheduler and
many others so I proposed (and I am chairing) a Data Engineering
track.

The Call for Papers is open http://cfp.apachecon.com/

And feel free to spread the news to anyone you might find interested.
Also if anyone is interested in co-leading the track, feel free to
reach out to me directly. Happy to co-share the leadership (and a
little responsibilities too) :).

J,


Re: Apache Trift vs GRPC summary

2022-02-17 Thread Jarek Potiuk
Thank you ! I will dive deeper - but having just those pointers is a good
start (I likely mixed up gRPC - Thrift bridges with replacing of Thrift
Luke!)

On Thu, Feb 17, 2022 at 5:28 AM Kenneth Knowles  wrote:

> I can find you that fun mailing list pointer, if you like. Here's a
> starting point with the subject "[DISCUSS] Beam data plane serialization
> tech"
>
> https://lists.apache.org/thread/dz24chmm18skzgcmxl2jxookd3yn79r1
>
> Kenn
>
> On Wed, Feb 16, 2022 at 10:23 AM Luke Cwik  wrote:
>
>> Apache Beam never had an RPC layer for the internal workings of the
>> project until the portability project[1] started so there never was a
>> transition from Apache Thrift to gRPC.
>>
>> Generally the support for HTTP2 and long lived streaming connections were
>> the key differentiators for gRPC.
>>
>> 1: https://beam.apache.org/roadmap/portability/
>>
>> On Wed, Feb 16, 2022 at 2:38 AM Jarek Potiuk  wrote:
>>
>>> Hello Beam friends,
>>>
>>> I have a question, we are preparing (as part of
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API)
>>> to split Airflow into more components which will be communicating using
>>> RPC.
>>>
>>> Basically we need to extract some of the internal methods into a "remote
>>> procedure calls" which then we would like to be able to call either "really
>>> remotely" (over HTTPS) or locally (via local TCP/Unix domain sockets).
>>>
>>> I have narrowed down the options we have to Apache Thrift and gRPC. I
>>> know that Apache Beam was (is ?) in a transition period Thrift -> GRPC and
>>> I am sure you have some experiences to share and (following your mailing
>>> lists) I am sure there was a deep analysis done for those two before
>>> you decided to switch.
>>>
>>> Before I start searching through your mailing list, maybe someone knows
>>> a document or some summary of the two that you could share with us - that
>>> probably could save us a lot of effort deciding which of those two might be
>>> better for our needs.
>>>
>>> Is there something that you know of easily that can be shared?
>>>
>>> J,
>>>
>>>


Apache Trift vs GRPC summary

2022-02-16 Thread Jarek Potiuk
Hello Beam friends,

I have a question, we are preparing (as part of
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API)
to split Airflow into more components which will be communicating using
RPC.

Basically we need to extract some of the internal methods into a "remote
procedure calls" which then we would like to be able to call either "really
remotely" (over HTTPS) or locally (via local TCP/Unix domain sockets).

I have narrowed down the options we have to Apache Thrift and gRPC. I know
that Apache Beam was (is ?) in a transition period Thrift -> GRPC and I am
sure you have some experiences to share and (following your mailing lists)
I am sure there was a deep analysis done for those two before you decided
to switch.

Before I start searching through your mailing list, maybe someone knows a
document or some summary of the two that you could share with us - that
probably could save us a lot of effort deciding which of those two might be
better for our needs.

Is there something that you know of easily that can be shared?

J,


Re: [RFC][Design] Automate Reviewer Assignment

2022-02-11 Thread Jarek Potiuk
Cool. Looking forward to see how it goes for Beam. We will also be at the
point soon that likely we will want to do something more sophisticated!

On Fri, Feb 11, 2022 at 4:08 PM Danny McCormick 
wrote:

> Hey Jared, thanks for chiming in - I've been really appreciative of the
> Airflow perspective (here and in the GitHub issues conversation), and
> definitely hope we can keep learning from each other! We did consider
> CODEOWNERs, but ultimately decided against it because it couldn't hit some
> of our goals - specifically:
>
> 1. Providing multiple passes of assignment (once to a larger set of
> reviewers, and then again to a second set of committers).
>
> 2. Balancing reviews - like you mentioned, there's not a great way to do
> round robining, or even assign to a single person from a set of people.
> Technically you can actually do this if every codeowner is part of a team (
> https://twitter.com/github/status/1194673101117808653?lang=en), but many
> Beam reviewers in our new model won't be a part of the Apache org. (Maybe
> that feature would be of interest to Airflow though? It looks like maybe
> all of your CODEOWNERS are part of the Apache org? I can't 100% tell).
>
> 3. Don't break the existing use case where a contributor wants a review
> from a specific person.
>
> Thanks,
> Danny
>
> On Thu, Feb 10, 2022 at 7:52 AM Jarek Potiuk  wrote:
>
>> Very interesting one - as an outsider I am interested to see how this
>> initiative will work out for the beam community.
>>
>> Just one comment - maybe you do not know but in GitHub there is a
>> "CODEOWNERS" feature (I notice you are not using it). Quote from
>> https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
>>
>> | Code owners are automatically requested for review when someone opens a
>> pull request that modifies code that they own. Code owners are not
>> automatically requested to review draft pull requests. For more information
>> about draft pull requests, see "About pull requests." When you mark a draft
>> pull request as ready for review, code owners are automatically notified.
>> If you convert a pull request to a draft, people who are already subscribed
>> to notifications are not automatically unsubscribed. For more information,
>> see "Changing the stage of a pull request."
>>
>> This is an extremely poor version of what you try to do in Beam (just
>> assign everyone who is code owner as reviewer, no round-robin, no reviewers
>> role etc.), but maybe you want to try it quickly if you want to test if any
>> kind of "ownership" might help with at least initial vetting of PRs.
>> This feature is enabled by literally committing one - gitignore-like -
>> file to repo, so it can be introduced extremely quickly.
>>
>> Airlfow's CODEOWNERS here as an example:
>> https://github.com/apache/airflow/blob/main/.github/CODEOWNERS
>>
>> J.
>>
>> On Thu, Feb 10, 2022 at 7:31 AM Ahmet Altay  wrote:
>>
>>> Thank you Danny. I think this is a great problem to solve, and the
>>> proposal looks great too :) I added comments as others but overall I like
>>> it.
>>>
>>> On Wed, Feb 9, 2022 at 3:02 PM Brian Hulette 
>>> wrote:
>>>
>>>> Thanks Danny! I left a few suggestions in the doc but I very much like
>>>> this idea overall.
>>>>
>>>> I especially like that "reviewers" is orthogonal to "committers",
>>>> giving new contributors a clear way to volunteer to help out with code
>>>> reviews. If we do this we should document it in the contribution guide [1].
>>>>
>>>> [1] https://beam.apache.org/contribute/
>>>>
>>>> On Wed, Feb 9, 2022 at 2:54 PM Kerry Donny-Clark 
>>>> wrote:
>>>>
>>>>> Danny, this looks like a great mechanism to ensure we review PRs
>>>>> quickly and distribute the review work more evenly.
>>>>> Thanks for outlining a clear plan. I strongly support this.
>>>>> Kerry
>>>>>
>>>>> On Wed, Feb 9, 2022, 5:16 PM Danny McCormick <
>>>>> dannymccorm...@google.com> wrote:
>>>>>
>>>>>> Hey everyone, I put together a design doc for automating the
>>>>>> assignment of reviewers in Beam pull requests. I'd appreciate any 
>>>>>> thoughts
>>>>>> you have!
>>>>>>
>>>>>> Right now, we don't have a well defined automated system for sta

Re: [RFC][Design] Automate Reviewer Assignment

2022-02-10 Thread Jarek Potiuk
Very interesting one - as an outsider I am interested to see how this
initiative will work out for the beam community.

Just one comment - maybe you do not know but in GitHub there is a
"CODEOWNERS" feature (I notice you are not using it). Quote from
https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners

| Code owners are automatically requested for review when someone opens a
pull request that modifies code that they own. Code owners are not
automatically requested to review draft pull requests. For more information
about draft pull requests, see "About pull requests." When you mark a draft
pull request as ready for review, code owners are automatically notified.
If you convert a pull request to a draft, people who are already subscribed
to notifications are not automatically unsubscribed. For more information,
see "Changing the stage of a pull request."

This is an extremely poor version of what you try to do in Beam (just
assign everyone who is code owner as reviewer, no round-robin, no reviewers
role etc.), but maybe you want to try it quickly if you want to test if any
kind of "ownership" might help with at least initial vetting of PRs.
This feature is enabled by literally committing one - gitignore-like - file
to repo, so it can be introduced extremely quickly.

Airlfow's CODEOWNERS here as an example:
https://github.com/apache/airflow/blob/main/.github/CODEOWNERS

J.

On Thu, Feb 10, 2022 at 7:31 AM Ahmet Altay  wrote:

> Thank you Danny. I think this is a great problem to solve, and the
> proposal looks great too :) I added comments as others but overall I like
> it.
>
> On Wed, Feb 9, 2022 at 3:02 PM Brian Hulette  wrote:
>
>> Thanks Danny! I left a few suggestions in the doc but I very much like
>> this idea overall.
>>
>> I especially like that "reviewers" is orthogonal to "committers", giving
>> new contributors a clear way to volunteer to help out with code reviews. If
>> we do this we should document it in the contribution guide [1].
>>
>> [1] https://beam.apache.org/contribute/
>>
>> On Wed, Feb 9, 2022 at 2:54 PM Kerry Donny-Clark 
>> wrote:
>>
>>> Danny, this looks like a great mechanism to ensure we review PRs quickly
>>> and distribute the review work more evenly.
>>> Thanks for outlining a clear plan. I strongly support this.
>>> Kerry
>>>
>>> On Wed, Feb 9, 2022, 5:16 PM Danny McCormick 
>>> wrote:
>>>
 Hey everyone, I put together a design doc for automating the assignment
 of reviewers in Beam pull requests. I'd appreciate any thoughts you have!

 Right now, we don't have a well defined automated system for staying on
 top of pull request reviews - we rely on contributors being able to find
 the correct OWNERS file and committers manually triaging/calling attention
 to old pull requests. This doc proposes adding automation driven by GitHub
 Actions to automatically round robin new PR reviews to a set of
 contributors, thus balancing the load. It also proposes adding a new role
 within the beam community of a reviewer who is responsible for an
 initial code review on some PRs before they are routed to a committer for
 final review.

 Please share any feedback or support here -
 https://docs.google.com/document/d/1FhRPRD6VXkYlLAPhNfZB7y2Yese2FCWBzjx67d3TjBo/edit?usp=sharing

 Thanks,
 Danny

>>>


Re: Developing on an M1 Mac

2022-02-08 Thread Jarek Potiuk
Just for your information: Thanks to that change - i will soon be adding
ARM support for Apache Airflow - including building and publishing the
images and running our tests (using self-hosted runners).
As soon as I get it I will be able to share the code/experiences with you.

J

On Tue, Feb 8, 2022 at 2:50 PM Ismaël Mejía  wrote:

> For awareness with the just released Beam 2.36.0 Beam works out of the
> box to develop on a Mac M1.
>
> I tried Java and Python pipelines with success running locally on both
> Flink/Spark runner.
> I found one issue using zstd and created [1] that was merged today,
> with this the sdks:core tests and Spark runner tests fully pass.
>
> I would see 2.36.0 is the first good enough release for someone
> working on a Mac M1 or ARM64 processor.
>
> There are still some missing steps to have full ARM64 [apart of testing it
> :)]
>
> 1. In theory we could run docker x86 images on ARM but those would be
> emulated so way slower so it is probably better to support 'native'
> CPUs) via multiarchitecture docker images [2].
> BEAM-11704 Support Beam docker images on ARM64
>
> I could create the runners images from master, for the SDK containers
> there are some issues with hardcoded paths [2] and virtualenv that
> probably will be solved once we move to venv, and we will need to
> upgrade our release process to include multiarch images (for user
> friendliness).
>
> Also golang only supports officially ARM64 starting with version
> 1.18.0 so we need to move up to that version.
>
> Anyway Beam is in a waaay better shape for ARM64 now than 1y ago when
> I created the initial JIRAs.
>
> Ismaël
>
> [1] https://github.com/apache/beam/pull/16755
> [2] https://issues.apache.org/jira/browse/BEAM-11704
> [3]
> https://github.com/apache/beam/blob/d1b8e569fd651975f08823a3db49dbee56d491b5/sdks/python/container/Dockerfile#L79
>
>
>
>> Could not find protoc-3.14.0-osx-aarch_64.exe
> (com.google.protobuf:protoc:3.14.0).
>  Searched in the following locations:
>
> https://jcenter.bintray.com/com/google/protobuf/protoc/3.14.0/protoc-3.14.0-osx-aarch_64.exe
>
>
>
>
>
> On Wed, Jan 12, 2022 at 9:53 PM Luke Cwik  wrote:
> >
> > The docker container running in an x86 based cloud machine should work
> pretty well. This is what Apache Beam's Jenkins setup effectively does.
> >
> > No experience with developing on an ARM based CPU.
> >
> > On Wed, Jan 12, 2022 at 9:28 AM Jarek Potiuk  wrote:
> >>
> >> Comment from the side - If you use Docker - experience from Airflow -
> >> until we will get ARM images, docker experience is next to unusable
> >> (docker filesystem slowness + emulation).
> >>
> >> J.
> >>
> >> On Wed, Jan 12, 2022 at 6:21 PM Daniel Collins 
> wrote:
> >> >
> >> > I regularly develop on a non-m1 mac using intellij, which mostly
> works out of the box. Are you running into any particular issues building
> or just looking for advice?
> >> >
> >> > -Daniel
> >> >
> >> > On Wed, Jan 12, 2022 at 12:16 PM Matt Rudary <
> matt.rud...@twosigma.com> wrote:
> >> >>
> >> >> Does anyone do Beam development on an M1 Mac? Any tips to getting
> things up and running?
> >> >>
> >> >>
> >> >>
> >> >> Alternatively, does anyone have a good “workstation in the cloud”
> setup?
> >> >>
> >> >>
> >> >>
> >> >> Thanks
> >> >>
> >> >> Matt
>


Re: [ANNOUNCE] Apache Beam 2.36.0 Release

2022-02-08 Thread Jarek Potiuk
Thanks a lot for that Emily!

It's been a release we were waiting for at Apache Airflow.
I believe It will unblock a number of "modernizations" in our pipeline -
Python 3.10, ARM support were quite a bit depending on it (mostly through
numpy transitive dependency limitation). Great to see this one out!

J.

On Tue, Feb 8, 2022 at 3:39 AM Emily Ye  wrote:

> The Apache Beam team is pleased to announce the release of version 2.36.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed
> on the Beam blog: https://beam.apache.org/blog/beam-2.36.0/
>
> Thank you to everyone who contributed to this release, and we hope you
> enjoy using Beam 2.36.0
>
> - Emily, on behalf of the Apache Beam community.
>


Re: [DISCUSS] Migrate Jira to GitHub Issues?

2022-01-31 Thread Jarek Potiuk
> I know that we currently can link multiple PRs to a single Jira, but
GitHub assumes a PR linked to an issue fixes the issue. You also need write
access to the repository to link the PR outside of using a "closing
keyword". (For reference: Linking a pull request to an issue
<https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue>
)

Not entirely correct.

You can link PR to the issue by just mentioning #Issue in the commit
message. If you do not prefix it with "Closes:" "Fixes:" or similar it will
be just linked - and yeah you can have multiple PRs linked to the same JIRA
issue without closing it.
You can also (anyone even with read access) can  link the issues and PR by
referring to each other in the comments in GitHub UI. This has the - very
nicely working - auto-complete feature. You just start typing #some words -
and if the words are related to the issue, it will more often than not
allow you to choose the issue via autocomplete.

This is actually one of the reasons why in Airflow we not only have
everything PRs and issue interlinked, but we are even able to make this
kind of automated issues: https://github.com/apache/airflow/issues/20615
which not only refer to the PRs and issues solved since the last release
but also automatically mark people who were involved in solving them. This
is because those users are common among PRs and issues.

J.


On Mon, Jan 31, 2022 at 6:21 PM Zachary Houfek  wrote:

> I added a suggestion that I don't think was discussed here:
>
> I know that we currently can link multiple PRs to a single Jira, but
> GitHub assumes a PR linked to an issue fixes the issue. You also need write
> access to the repository to link the PR outside of using a "closing
> keyword". (For reference: Linking a pull request to an issue
> <https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue>
> )
>
> I'm not sure how much this could sway the decisions but thought it was
> worth bringing up.
>
> Regards,
> Zach
>
> On Mon, Jan 31, 2022 at 12:06 PM Jarek Potiuk  wrote:
>
>> Just a comment here to clarify the labels from someone who has been using
>> both - ASF (and not only) JIRA and GitHub.
>>
>> The experience from  JIRA labels might be awfully misleading. The JIRA
>> labels are a mess in the ASF because they are shared between projects and
>> everyone can create a new label. "Mess" is actually quite an understatement
>> IMHO.
>>
>> The labels in GitHub Issues are "per-project" and they can only be added
>> and modified by maintainers (and only maintainers and "issue triagers" can
>> actually assign them other than the initial assignment when you create an
>> issue.
>>
>> Thanks to that, it is much easier to agree on the common "conventions" to
>> use and avoid creating new ones accidentally.
>>
>> We have quite a success with using the labels in Airflow as we use some
>> of the stuff below:
>>
>> Re - some fancy enforcement/management, yeah. There are good techniques
>> to control how/when the labels are attached:
>>
>> 1) You can create separate templates for Bugs/Features that can have
>> different labels pre-assigned. See here:
>> https://github.com/apache/airflow/tree/main/.github/ISSUE_TEMPLATE -
>> this way you can delegate to the users to make basic "label choice" when
>> they enter issues (though limited - 4-5 types of issues to choose from is
>> really maximum what is reasonable).
>> 2) The same "Issue Templates" already have the option to choose
>> "selectable fields" at entry - you can define free-form entries, drop-down,
>> checkboxes and a few others. This is as close as it can get to "fields".
>> Then (this is something you'd have to code) you could easily write or use
>> an existing GithubAction or bot that will assign the labels based on the
>> initial selection done by the user at entry. We have not done it yet but we
>> might.
>> 3) In PRs you can (and we do that in Airflow) write your bot or use
>> existing GitHub Actions to automatically select the labels based on the
>> "files" that have been changed in the PR: We are doing precisely that in
>> airflow and it works pretty well:
>> https://github.com/apache/airflow/blob/main/.github/boring-cyborg.yml
>>
>> You are in full control, and you can choose the convention and approach
>> for the project.
>> There are literally hundreds of GitHub Actions out there and you can
>> easily write a new one to manage it and you do not need anything but PR
>> merged to the repository to e

Re: [DISCUSS] Migrate Jira to GitHub Issues?

2022-01-31 Thread Jarek Potiuk
gt;> >>>> >   • We use nested issues and issue relations in jira, but
>>>>>>> as far as I know robots don’t use them and we don’t query them much, so
>>>>>>> we’re not losing anything by moving from an API to plain English
>>>>>>> descriptions: “This issue is blocked by issue #n.” Mentions show up
>>>>>>> automatically on other issues.
>>>>>>> >>>> >   • For component, type, priority, etc., we can use
>>>>>>> Github labels.
>>>>>>> >>>> >   • Version(s) affected is used inconsistently, and as
>>>>>>> far as I know only by humans, so a simple English description is fine. 
>>>>>>> We
>>>>>>> can follow the example of other projects and make the version affected a
>>>>>>> part of the issue template.
>>>>>>> >>>> >   • For fix version, which we use to track which issues
>>>>>>> we want to fix in upcoming releases, as well as automatically generate
>>>>>>> release notes: Github has “milestones,” which can be marked on PRs or
>>>>>>> issues, or both.
>>>>>>> >>>> >   • IMO the automatically generated JIRA release
>>>>>>> notes are not especially useful anyway. They are too detailed for a 
>>>>>>> quick
>>>>>>> summary, and not precise enough to show everything. For a readable 
>>>>>>> summary,
>>>>>>> we use CHANGES.md to highlight changes we especially want users to know
>>>>>>> about. For a complete list of changes, there’s the git commit log, 
>>>>>>> which is
>>>>>>> the ultimate source of truth.
>>>>>>> >>>> >   • We’d only want to preserve reporter and assignee if
>>>>>>> we’re planning on migrating everything automatically, and even then I 
>>>>>>> think
>>>>>>> it’d be fine to compile a map of active contributors and drop the rest.
>>>>>>> >>>> >
>>>>>>> >>>> > As for the advantages of switching (just the ones off the top
>>>>>>> of my head):
>>>>>>> >>>> >   • As others have mentioned, it’s less burden for new
>>>>>>> contributors to create new issues and comment on existing ones.
>>>>>>> >>>> >   • Effortless linking between issues and PRs.
>>>>>>> >>>> >   • Github -> jira links were working for a short
>>>>>>> while, but they seem to be broken at the moment.
>>>>>>> >>>> >   • Jira -> github links only show: “links to
>>>>>>> GitHub Pull Request #x”. They don’t say the status of the PR, so you
>>>>>>> have to follow the link to find out. Especially inconvenient when one 
>>>>>>> jira
>>>>>>> maps to several PRs, and you have to open all the links to get a 
>>>>>>> summary of
>>>>>>> what work was done.
>>>>>>> >>>> >   • When you mention a GH issue in a pull
>>>>>>> request, a link to the PR will automatically appear on the issue, 
>>>>>>> including
>>>>>>> not just the ID but also the PR’s description and status
>>>>>>> (open/closed/draft/merged/etc.), and if you hover it will show a 
>>>>>>> preview as
>>>>>>> well.
>>>>>>> >>>> >   • We frequently merge a PR and then forget to
>>>>>>> mark the jira as closed. Whereas if a PR is linked to a GH issue using 
>>>>>>> the
>>>>>>> “closes” keyword, the GH issue will automatically be closed [3].
>>>>>>> >>>> >   • I don’t have to look up or guess whether a github
>>>>>>> account and jira account belong to the same person.
>>>>>>> >>>> >   • There’s a single unified search bar to find issues,
>>>>>>> PRs, and code.
>>>>>>> >>>> >   • Github enables markdown formatting everywhere, which
>>>>>>> is more or less the industry standard, whereas Jira has its own be

Re: Python SDK release good for Python 3.10/M1

2022-01-24 Thread Jarek Potiuk
Thanks Kamil! That will do! We shall wait :D.

On Mon, Jan 24, 2022 at 12:21 PM Kamil Bregula 
wrote:

> The Beam 2.36.0 release is scheduled to be cut on 2021-12-29 (Wednesday)
> and released by 2022-02-02 according to the release calendar [1]. This
> release requires numpy>=1.14.3,<1.22.0' in Python SDK [2]. We should have
> the first RC this week.
>
> [1]
> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
> [2]
> https://github.com/apache/beam/blob/release-2.36.0/sdks/python/setup.py#L145
>
>
> On Sun, Jan 23, 2022 at 11:34 PM Jarek Potiuk  wrote:
>
>> Hello Apache Beam Friends,
>>
>> I have attempted today (that was yet another attempt) to prepare an
>> Apache Airflow CI image for testing with Python 3.10.
>>
>> Unlike previous attempts (where there were quite a few deps that lagged
>> behind)  - this one was **almost** successful.
>>
>> I think the last (or at least one of the last ones that is a serious
>> blocker) limitation was Numpy < 1.21 which has no binary wheels for Python
>> 3.10 (but also for MacOS M1). This one comes from the latest apache-beam
>> SDK 2.35.0.
>>
>> I noticed, however, that the Beam Python SDK NumPy limit has already been
>> bumped to Numpy < 1.22  (for M1) in the main branch of your repo. This one
>> could solve the Python 3.10 problem (and we are also looking at M1 images
>> next):
>>
>>
>> https://github.com/apache/beam/commit/d845a0074d39a1604fde1879157f55e048a5d01b#diff-1275c48808de339ef6f282d844c83ec441b5cfa0debc373fdcb7dba497da4fc8
>>
>>
>> Any chance for a new release of Apache Beam Python SDK soon with numpy <
>> 1.22 instead of <1.21?
>>
>> In the past we did some exclusion for Beam when we wanted to release
>> Python 3.9, but maybe - if we know that release is coming, we could simply
>> wait :).
>>
>> Any way we can help and speed this up ?
>>
>> J.
>>
>>


Python SDK release good for Python 3.10/M1

2022-01-23 Thread Jarek Potiuk
Hello Apache Beam Friends,

I have attempted today (that was yet another attempt) to prepare an Apache
Airflow CI image for testing with Python 3.10.

Unlike previous attempts (where there were quite a few deps that lagged
behind)  - this one was **almost** successful.

I think the last (or at least one of the last ones that is a serious
blocker) limitation was Numpy < 1.21 which has no binary wheels for Python
3.10 (but also for MacOS M1). This one comes from the latest apache-beam
SDK 2.35.0.

I noticed, however, that the Beam Python SDK NumPy limit has already been
bumped to Numpy < 1.22  (for M1) in the main branch of your repo. This one
could solve the Python 3.10 problem (and we are also looking at M1 images
next):

https://github.com/apache/beam/commit/d845a0074d39a1604fde1879157f55e048a5d01b#diff-1275c48808de339ef6f282d844c83ec441b5cfa0debc373fdcb7dba497da4fc8


Any chance for a new release of Apache Beam Python SDK soon with numpy <
1.22 instead of <1.21?

In the past we did some exclusion for Beam when we wanted to release Python
3.9, but maybe - if we know that release is coming, we could simply wait :).

Any way we can help and speed this up ?

J.


Re: Developing on an M1 Mac

2022-01-12 Thread Jarek Potiuk
Comment from the side - If you use Docker - experience from Airflow -
until we will get ARM images, docker experience is next to unusable
(docker filesystem slowness + emulation).

J.

On Wed, Jan 12, 2022 at 6:21 PM Daniel Collins  wrote:
>
> I regularly develop on a non-m1 mac using intellij, which mostly works out of 
> the box. Are you running into any particular issues building or just looking 
> for advice?
>
> -Daniel
>
> On Wed, Jan 12, 2022 at 12:16 PM Matt Rudary  wrote:
>>
>> Does anyone do Beam development on an M1 Mac? Any tips to getting things up 
>> and running?
>>
>>
>>
>> Alternatively, does anyone have a good “workstation in the cloud” setup?
>>
>>
>>
>> Thanks
>>
>> Matt


Re: [DISCUSS] Migrate Jira to GitHub Issues?

2021-12-07 Thread Jarek Potiuk
> Do I understand correctly that this transition (if it will happen) includes 
> the transfer of all Beam Jira archive to GitHub issues with a proper 
> statuses/comments/refs/etc? If not, what are the options?

Suggestion from the experience of Airflow again - you can look it up
in our notes.

We've tried it initially to copy the issues manually or in bulk, but
eventually we decided to tap into the wisdom and cooperation of our
community.

We migrated some (not many) important things only and asked our users
to move the important issues if they think they are still
relevant/important to them. We closed the JIRA for entry and left the
issues in JIRA in read-only state so that we could always refer to
them if needed.

So rather than proactively copy the issues, we asked the users to make
the decision which issues are important to them and proactively move
it and we left an option of reactive moving if someone came back to
the issue later.

That turned out to be a smart decision considering the effort it would
require to smartly move the issues vs. the results achieved. And
helped us to clean some "stale/useless/not important" issues.

We've had 1719 open JIRA issues when we migrated. Over the course of
~1.5 years (since about April 2020) we've had ~140 issues that refer
to any of the JIRA issues
https://github.com/apache/airflow/issues?q=is%3Aissue+is%3Aclosed+%22https%3A%2F%2Fissues.apache.org%2Fjira%22+.
Currently we have > 4500 GH issues (3700 closed, 800 opened).

This means that roughly speaking only < 10% of original open JIRA
issues were actually somewhat valuable (roughly speaking of course)
and they were < 5% of today's numbers. Of course some of the new GH
issues duplicated those JIRA ones. But not many I think, especially
that those issues in JIRA referred mostly to older Airflow versions.

One more comment for the migration - I STRONGLY recommend using well
designed templates for GH issues from day one. That significantly
improves the quality of issues - and using Discussions as the place
where you move unclear/not reproducible issues (and for example
guiding users to use discussions if they have no clearly reproducible
case). This significantly reduces the "bad issue overload" (see also
more detailed comments in
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191332632).

I personally think a well designed issue entry process for new issues
is more important than migrating old issues in bulk. Especially if you
will ask users to help - as they will have to make a structured entry
with potentially more detailed information/reproducibility) or they
will decide themselves that opening a github discussion is better than
opening an issue if they do not have a reproducible case. Or they will
give up if too much information is needed (but this means that their
issue is essentially not that important IMHO).

But this is just friendly advice from the experience of those who did
it quite some time ago :)

J.

On Wed, Dec 8, 2021 at 1:08 AM Brian Hulette  wrote:
>
> At this point I just wanted to see if the community is interested in such a 
> change or if there are any hard blockers. If we do go down this path I think 
> we should port jiras over to GH Issues. You're right this isn't trivial, 
> there's no ready-made solution we can use, we'd need to decide on a mapping 
> for everything and write a tool to do the migration. It sounds like there may 
> be other work in this area we can build on (e.g. Airflow may have made a tool 
> we can work from?).
>
> I honestly don't have much experience with GH Issues so I can't provide 
> concrete examples of better usability (maybe Jarek can?). From my perspective:
> - I hear a lot of grumbling about jira, and a lot of praise for GitHub Issues.
> - Most new users/contributors already have a GitHub account, and very few 
> already have an ASF account. It sounds silly, but I'm sure this is a barrier 
> for engaging with the community. Filing an issue, or commenting on one to 
> provide additional context, or asking a clarifying question about a starter 
> task should be very quick and easy - I bet a lot of these interactions are 
> blocked at the jira registration page.
>
> Brian
>
> On Tue, Dec 7, 2021 at 9:04 AM Alexey Romanenko  
> wrote:
>>
>> Do I understand correctly that this transition (if it will happen) includes 
>> the transfer of all Beam Jira archive to GitHub issues with a proper 
>> statuses/comments/refs/etc? If not, what are the options?
>>
>> Since this transfer looks quite complicated at the first glance, what are 
>> the real key advantages (some concrete examples are very appreciated) to 
>> initiate this process and what are the show-stoppers for us with a current 
>> Jira workflow?
>>
>> —
>> Alexey
>>
>> On 6 Dec 2021, at 19:48, Ud

Re: [DISCUSS] Migrate Jira to GitHub Issues?

2021-12-04 Thread Jarek Potiuk
Just to add a comment on those requirements Kenneth, looking into the
near future.

Soon GitHub issues will open for GA a whole new way of interacting
with the issues (without removing the current way) which will greatly
improve iI think all aspects of what You mentioned). The issues (and
associated projects) will gain new capabilities:

* structured metadata that you will be able to define (much better
than unstructured labels)
* table-like visualisations which will allow for fast, bulk,
keyboard-driven management
* better automation of workflows
* complete APIs to manage the issues (good for GitHub Actions
integration for example)

Re: assigning by non-committers is one of the things that won't work
currently. Only comitters can assign the issues, and only if a user
commented on the issue. But it nicely works - when a user comments "I
want to work on that issue", a committer assigns the user. And It
could be easily automated as well.

You can see what it will is about here: https://github.com/features/issues

They are currently at the "Public Beta" and heading towards General
Availability, but it is not available to "open" projects yet. However
I have a promise from the GitHub Product manager (my friend heads the
team implementing it) that ASF will be the first on the list when the
public projects will be enabled, because it looks like it will make
our triaging and organisation much better.

J.

On Sat, Dec 4, 2021 at 1:46 AM Kenneth Knowles  wrote:
>
> This sounds really good to me. Much more familiar to newcomers. I think we 
> end up doing a lot more ad hoc stuff with labels, yes? Probably worth having 
> a specific plan. Things I care about:
>
> - priorities with documented meaning
> - targeting issues to future releases
> - basic visualizations (mainly total vs open issues over time)
> - tags / components
> - editing/assigning by non-committers
> - workflow supporting "needs triage" (default) -> open -> resolved
>
> I think a lot of the above is done via ad hoc labels but I'm not sure if 
> there are other fancy ways to do it.
>
> Anyhow we should switch even if there is a feature gap for the sake of 
> community.
>
> Kenn
>
> On Fri, Dec 3, 2021 at 3:06 PM David Huntsperger  
> wrote:
>>
>> Yes, please. I can help clean up the website issues as part of a migration.
>>
>> On Fri, Dec 3, 2021 at 1:46 PM Robert Burke  wrote:
>>>
>>> Similar thing happened for Go migrating to use GH issues for everything 
>>> from Language Feature proposals to bugs. Much easier than the very gerrit 
>>> driven process it was before, and User Discussions are far more 
>>> discoverable by users: they usually already have a GH account, and don't 
>>> need to create a new separate one.
>>>
>>> GitHub does seem to permit user directed templates for issues so we can 
>>> simplify issue triage by users: Eg for Go there are a number of requests 
>>> one can make: https://github.com/golang/go/issues/new/choose
>>>
>>> On Fri, Dec 3, 2021, 12:17 PM Andy Ye  wrote:
>>>>
>>>> Chiming in from the perspective of a new Beam contributor. +1 on Github 
>>>> issues. I feel like it would be easier to learn about and contribute to 
>>>> existing issues/bugs if it were tracked in the same place as that of the 
>>>> source code, rather than bouncing back and forth between the two different 
>>>> sites.
>>>>
>>>> On Fri, Dec 3, 2021 at 1:18 PM Jarek Potiuk  wrote:
>>>>>
>>>>> Comment from a friendly outsider.
>>>>>
>>>>> TL; DR; Yes. Do migrate. Highly recommended.
>>>>>
>>>>> There were already similar discussions happening recently (community
>>>>> and infra mailing lists) and as a result I captured Airflow's
>>>>> experiences and recommendations in the BUILD wiki. You might find some
>>>>> hints and suggestions to follow as well as our experiences at Airflow:
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191332632
>>>>>
>>>>> J,
>>>>>
>>>>>
>>>>> On Fri, Dec 3, 2021 at 7:46 PM Brian Hulette  wrote:
>>>>> >
>>>>> > Hi all,
>>>>> > I wanted to start a discussion to gauge interest on moving our issue 
>>>>> > tracking from the ASF Jira to GitHub Issues.
>>>>> >
>>>>> > Pros:
>>>>> > + GH Issues is more discoverable and approachable for new users and 
>>>>> > contributors.
>>>>> > + For contributors at

Re: [DISCUSS] Migrate Jira to GitHub Issues?

2021-12-03 Thread Jarek Potiuk
Comment from a friendly outsider.

TL; DR; Yes. Do migrate. Highly recommended.

There were already similar discussions happening recently (community
and infra mailing lists) and as a result I captured Airflow's
experiences and recommendations in the BUILD wiki. You might find some
hints and suggestions to follow as well as our experiences at Airflow:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191332632

J,


On Fri, Dec 3, 2021 at 7:46 PM Brian Hulette  wrote:
>
> Hi all,
> I wanted to start a discussion to gauge interest on moving our issue tracking 
> from the ASF Jira to GitHub Issues.
>
> Pros:
> + GH Issues is more discoverable and approachable for new users and 
> contributors.
> + For contributors at Google: we have tooling to integrate GH Issues with 
> internal issue tracking, which would help us be more accountable (Full 
> disclosure: this is the reason I started thinking about this).
>
> Cons:
> - GH Issues can't be linked to jiras for other ASF projects (I don't think we 
> do this often in jira anyway).
> - We would likely need to do a one-time migration of jiras to GH Issues, and 
> update any processes or automation built on jira (e.g. release notes).
> - Anything else?
>
> I've always thought that using ASF Jira was a hard requirement for Apache 
> projects, but that is not the case. Other Apache projects are using GitHub 
> Issues today, for example the Arrow DataFusion sub-project uses GitHub issues 
> now [1,2] and Airflow migrated from jira [3] to GitHub issues [4].
>
> [1] https://lists.apache.org/thread/w3dr1vlt9115r3x9m7bprmo4zpnog483
> [2] https://github.com/apache/arrow-datafusion/issues
> [3] https://issues.apache.org/jira/projects/AIRFLOW/issues
> [4] https://github.com/apache/airflow/issues


Re: Debugging GitHub Actions workflows

2021-11-07 Thread Jarek Potiuk
You can try https://github.com/nektos/act

J.

On Wed, Nov 3, 2021 at 9:09 PM Valentyn Tymofieiev  wrote:
>
> Does anybody know how one can replicate an environment used by GitHub actions 
> so that one can SSH (or some equivalent) modify the environment in realtime, 
> and try out  commands that GH action runs.
>
> I am trying to reduce the feedback loop of having to iterate on  a PR for a 
> change that does not pass tests due to issues with environment on the actions 
> worker.
>
> Thanks!


Re: Following up on the migration of the GA runners over to Google Cloud.

2021-08-11 Thread Jarek Potiuk
Just one more caveat and few comments so that you realise the full scope
(of at least what I know) of the danger and can make informed decisions.

I think you should simply weigh the risks vs. costs. As usual with
security, there is never a 0-1 case, it's always how much investment you
can do to mitigate some known risks and what is the cost of potential
'breach".

Costs:

* our solution with patched Runner relies on periodic updates and
re-releasing of the patched runner. GA have the policy that when they
release a new one, the old one stops working "a few days later".. This
happened to us today. We do not have the process fully automated and some
parts of it (building the AMI and deploying it)  is done
semi-automatically. And we had a few days of delay and the person doing it
was on vacations ... It all ended up with ~20 hours of downtime for our CI
tests today :( . So if you are going the "patching" route, be prepared for
some maintenance overhead and disruption (We can probably automate more but
just the nature of it and the fact that you can test it at most once every
few weeks when they release a new version, makes it "brittle". But we run
it for many months and other than occasional disruptions, it looks like a
"workable" solution.

Risks:

* I believe you could have containers in GKE deployed with the runners in
containers instead of VMs (to be confirmed! we never tried it), as long as
you make a DinD working for those containers. In fact those are also our
plans. We already secured funds from the Cloud Composer team and we are
planning to run those runners in GCP.  Actually running them as GKE
containers (and killing each container after the job is done) was my
thinking as well. That might then be really possible to run an unpatched
version for most of the concerns about "cleaning the environment". One of
the big concerns of security for the VM setup is that user A can
potentially influence a follow-up build of user B in ways that might have
some adverse effect. By proper containerization and one-time-container-use,
at least that part is addressed.
I might actually be happy to help (and I am quite sure other team members
from airflow CI team) and we could come up with a "reusable" solution for
hosting that we could reuse between projects. Also if you don't use GA to
push any publicly available, user-facing artifacts (you should not,
really), the danger is really minimal here IMHO.

* however this in-container approach  does not address several other
concerns (fully) - but we are still in a much better situation than say in
February. One of the problems is that such a setup might be easily used for
crypto mining. And yes, this is a REAL concern. It's the reason why GitHub
introduced "approve to run" buttons to protect their GitHub Runners from
being exploited
https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/
. Maybe this is actually quite an acceptable risk, taking into account that
we could also do some monitoring and flag such cases and the same "Approve
to run" works for self-hosted runners as well. You can also limit the CPU
usage / disk usage etc, limiting total number of jobs etc. If we can risk
it and have some mechanism to react in case our monitoring detects some
anomaly (for example - pausing the workflows or switching them to
PublicRunners temporarily) - just monitoring for potential abuse could be
enough. And simply making it "limited" capacity can make it a very poor
target for the bad players - paired with the "Approve to run" and we should
be safe.

* yet another (and this one is I think still not fully addressed) problem
is that when you run your builds in "pull_request_target" workflows, the
jobs there might have access to "write" tokens that can give potential
adversary uncontrolled access to your repo, packages, etc. this is the most
dangerous part, actually and one that is really dangerous. It has been
somewhat mitigated by the recent introduction of permission control (BTW. I
raised a bounty on that one in December and they told me I will not get it
because they "knew it" in December):
https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/.
However if you are using any kind of "pull_request_target" or similar
workflows where you have to enable "Write" access to your repo, you have to
be extra careful there (and there is a risk of exploiting your mistakes
there in ways that might get unnoticed). However, here, I think proper and
careful code review of the "workflow" related parts is crucial.For example
in Airflow we run everything "substantial" during the build in docker
containers - in GitHub Runner host we execute merely scripts that prepare
variables, build docker images and then run everything "substantial" in
those docker containers, isolating them from the host. This might be a good
strategy to limit the risk here.

My current thinking is, that you could get rather secure solution without

Re: Help needed with migration of GitHub Action Runners from GitHub to GKE.

2021-08-05 Thread Jarek Potiuk
I'd love to help, but I am on vacation next week, just one word of warning.
If you want to run GitHub Runner on your own infrastructure, that might
introduce several security risks.

Basically anyone who makes a PR to your repo can compromise your runners.
The dangers of compromising runners are explained here:
https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#potential-impact-of-a-compromised-runner
And Github still very very strongly discourages people to use self-hosted
runners for Public repositories:

https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#hardening-for-self-hosted-runners

Self-hosted runners on GitHub do not have guarantees around running in
ephemeral clean virtual machines, and can be persistently compromised by
untrusted code in a workflow.
As a result, self-hosted runners should almost never be used for public
repositories on GitHub, because any user can open pull requests against the
repository and compromise the environment.

We've workarounded it in Airflow and we run our self-hosted runners on
Amazon, however we have a patched version of runner:
https://github.com/ashb/runner. We needed to patch it as the original
runner cannot reject builds started from "unknown" users.
Which is automatically rebased and patch re-applied whenever GitHub
releases their new runner (the old versions stop working ~ few days after
new version is released). We keep a list of allowed people ("committers")
who can utilise such runners.
We also maintain list of maintainers in our workflow that are "routed" to
self-hosted runners:
https://github.com/apache/airflow/blob/1bd3a5c68c88cf3840073d6276460a108f864187/.github/workflows/ci.yml#L86

Those are a bit of hacks (until GitHub implements proper support for it),
but if you want to avoid people using your runners to mine bitcoin, or
stealing secrets or potentially being able to modify your repository
without you knowing it, similar steps should be taken.

J.





On Thu, Aug 5, 2021 at 7:17 PM Pablo Estrada  wrote:

> That works for me! Maybe create a calendar invite and share it in this
> thread.
> Best
> -P.
>
> On Thu, Aug 5, 2021, 8:36 AM Fernando Morales Martinez <
> fernando.mora...@wizeline.com> wrote:
>
>> Thanks for the help, Pablo!
>> I'm also available most of Monday, Tuesday and Wednesday; how about we
>> set the meeting for Tuesday August 10th 2pm EST? In case someone interested
>> can't make it, we can adjust accordingly.
>> Thanks again!
>> -Fer
>>
>> On Wed, Aug 4, 2021 at 5:55 PM Pablo Estrada  wrote:
>>
>>> Hello Fernando!
>>> The people that built the GKE cluster infrastructure have not been
>>> involved with Beam for a while.
>>> I think you can set a time that is convenient for you, and invite others
>>> to participate - and we'll all figure it out together.
>>> I'm available most of Monday, Tuesday, Wednesday of next week. I'll be
>>> happy to jump on a call and be confused with you (and maybe others will
>>> too).
>>> Best
>>> -P.
>>>
>>> On Tue, Aug 3, 2021 at 11:20 AM Fernando Morales Martinez <
>>> fernando.mora...@wizeline.com> wrote:
>>>
 Hi everyone,
 As part of the work done to migrate the GitHub Actions runners over to
 GKE, the, not exhaustive, changes below were performed:

 - added a new secret to the apache-beam-testing project (this is the
 PAT needed by the docker image to connect to GitHub)
 - added a new docker image (which will execute the test flows still
 being ran by GitHub)

 and other changes which you can take a look at here:
 https://github.com/apache/beam/pull/15039/files
 Among those, the ones of notice are the following:
 - .github/workflows/build_wheels.yml
 - .github/workflows/cancel.yml
 - .github/workflows/java_tests.yml
 - .github/workflows/python_tests.yml
 - .test-infra/metrics/build.gradle
 - .test-infra/metrics/build_and_publish_containers.sh
 - .test-infra/metrics/docker-compose.yml
 - .test-infra/metrics/kubernetes/beamgrafana-deploy.yaml
 - .test-infra/metrics/sync/githubactions/Dockerfile
 - .test-infra/metrics/sync/githubactions/entrypoint.sh

 The docker image (.test-infra/metrics/sync/githubactions/Dockerfile)
 has been built and pushed to the GCP container registry already.

 Now, I need to deploy the new docker image to GKE. I've done that
 before, but in a testing cluster/namespace. I'd definitely feel more
 comfortable if someone knowledgeable with the Apache Beam Testing project
 GKE architecture watches over my shoulder.
 I think I have everything covered, but I could be missing some
 important piece.

 Thanks for the help!








 *This email and its contents (including any attachments) are being sent
 toyou on the condition of confidentiality and may be protected by
 legalprivilege. Access to this email by anyone other 

Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Jarek Potiuk
Also we have very similar discussion about it in
https://issues.apache.org/jira/browse/LEGAL-572
Just to be clear about the context of it, it's not a legal requirement of
Apache Licence, it's Apache Software Foundation policy, that we should not
limit our users in using our software. If the LGPL dependency is
"optional", it's fine to add such optional dependency. If it is "required"
to run your software, then it is not allowed as it limits the users of ASF
software in further redistributing the software in the way they want (this
is at least my understanding of it).

On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:

> Hi
>
> You can take a look on
>
> https://www.apache.org/legal/resolved.html
>
> Regards
> JB
>
> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
> écrit :
>
> Anyone have a link to the official Apache policy about this? Thanks.
>
> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>
>
> Hi,
>
>
> we are bundling dependencies with LGPL-2.1, according to license header
>
> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>
> might be an issue, already reported here: [1]. I created [2] to track it
>
> on our side.
>
>
>  Jan
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-22555
>
>
> [2] https://issues.apache.org/jira/browse/BEAM-12316
>
>
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>
>

-- 
+48 660 796 129


Re: Consider Cloudpickle instead of dill for Python pickling

2021-05-01 Thread Jarek Potiuk
Just my 2 cents  comment from the users perspective.

In Airflow, the narrow limits of `dill` caused some problems with
dependencies. We had to add some exceptions in our process for that:
https://github.com/apache/airflow/blob/master/Dockerfile#L246
https://github.com/apache/airflow/blob/master/Dockerfile.ci#L271  - so the
problem is largely solved for now, but if dill would be used by any
different library it could be a problem. I imagine cloudpickle is more
frequently used than dill, so it might become a problem if those
dependencies are narrowly defined.

Currently cloudpickle for Airflow is already pulled in by
Dask's  "distributed"  library (but they have just > limits there):

distributed==2.19.0
  - click [required: >=6.6, installed: 7.1.2]
  - cloudpickle [required: >=1.3.0, installed: 1.4.1]
  - dask [required: >=2.9.0, installed: 2021.4.1]
- cloudpickle [required: >=1.1.1, installed: 1.4.1]
- fsspec [required: >=0.6.0, installed: 2021.4.0]

However, I have a better idea - why don't you simply vendor-in either
`dill` or `cloudpickle` (I am not sure which one is best) ?

Since you are not planning to upgrade it often (that's the whole point of
narrow versioning), you can have the best of both worlds - stable version
used in both client/server AND you would not be limiting others.

J.


On Fri, Apr 30, 2021 at 9:42 PM Stephan Hoyer  wrote:

> Glad to hear this is something you've open to and in fact have already
> considered :)
>
> I may give implementing this a try, though I'm not familiar with how
> configuration options are managed in Beam, so that may be easier for a core
> developer to deal with.
>
> On Fri, Apr 30, 2021 at 10:58 AM Robert Bradshaw 
> wrote:
>
>> As I've mentioned before, I would be in favor of moving to cloudpicke,
>> first as an option, and if that works out well as the default. In
>> particular, pickling functions from the main session in a hermetic (if
>> sometimes slightly bulkier way) way as opposed to the main session
>> pickling gymnastics is far preferable (especially for interactive).
>>
>> Versioning is an issue in general, and a tradeoff between the
>> overheads of re-building the worker every time (either custom
>> containers or at runtime) vs. risking different versions, and we could
>> possibly do better more generally on both fronts (as well as making
>> this tradeoff clear). Fair point that Cloudpickle is less likely to
>> just work with pinning. On the other hand, Cloudpickle looks fairly
>> mature/stable at this point, so hopefully it wouldn't be too hard to
>> keep our containers closet to head. If there is an error, we could
>> consider catching it and raising a more explicit message about the
>> version things were pickled vs. unpickled with.
>>
>> I would welcome as a first step a PR that conditionally allows the use
>> of CloudPickle in the place of Dill (with the exception of DillCoder,
>> there should of course probably be a separate CloudPickleCoder).
>>
>> On Fri, Apr 30, 2021 at 10:17 AM Valentyn Tymofieiev
>>  wrote:
>> >
>> >
>> >
>> > On Fri, Apr 30, 2021 at 9:53 AM Brian Hulette 
>> wrote:
>> >>
>> >> > I think with cloudpickle we will not be able have a tight range.
>> >>
>> >> If cloudpickle is backwards compatible, we should be able to just keep
>> an upper bound in setup.py [1] synced up with a pinned version in
>> base_image_requirements.txt [2], right?
>> >
>> >
>> > With an upper bound only, dependency resolver could still downgrade
>> pickler on the runner' side, ideally we should be detecting that.
>> >
>> > Also if we ever depend on a newer functionality, we would add a lower
>> bound as well, which (for that particular Beam release), makes it a tight
>> bound, so potentially a friction point.
>> >
>> >>
>> >>
>> >> > We could solve this problem by passing the version of pickler used
>> at job submission
>> >>
>> >> A bit of a digression, but it may be worth considering something more
>> general here, for a couple of reasons:
>> >> - I've had a similar concern for the Beam DataFrame API. Our goal is
>> for it to match the behavior of the pandas version used at construction
>> time, but we could get into some surprising edge cases if the version of
>> pandas used to compute partial results in the SDK harness is different.
>> >> - Occasionally we have Dataflow customers report
>> NameErrors/AttributeErrors that can be attributed to a dependency mismatch.
>> It would be nice to proactively warn about this.
>> >>
>> >>
>> >> That being said I imagine it would be hard to do something truly
>> general since every dependency will have different compatibility guarantees.
>> >>
>> > I think it should be considered a best practice to have matching
>> dependencies on job submission and execution side. We can:
>> > 1)  consider sending a manifest of all locally installed dependencies
>> to the runner and verify on the runner's side that critical dependencies
>> are compatible.
>> > 2) help make it easier to ensure the dependencies match:

Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-25 Thread Jarek Potiuk
+1 (non-binding) 

Thanks for tirelessly working on improving the python client :).

This is a friendly visit from Apache Airflow here. I've just tested the 
2.29.0rc1 in our "apache.beam" provider's tests and they are all Green. Just to 
give a bit of context here. We are eagerly waiting for the 2.29.0rc1 release as 
it will unblock a few things for us - most notably, relaxing PyArrow dependency 
will help us to add Python 3.9 support to Apache Airflow (It's been long 
overdue and pyarrow < 3.0.0 coming from Apache Beam was one of the last 
blockers).

Also FYI. I am happy to be a bit more involved with some (possible) future 
dependency improvements for Beam. We had a bit of struggle with PIP 21 which 
has hard time with some of the dependency conflicts. We've managed to 
workaround it for the moment (https://github.com/apache/airflow/pull/15513), 
but looking forward to improve this and make it better (especially moving all 
google python clients to > 2).

On 2021/04/23 01:46:51, Ahmet Altay  wrote: 
> +1 (binding)
> 
> I ran some python quick start examples. Most validations in the sheet were
> already done :) Thank you all!
> 
> On Thu, Apr 22, 2021 at 9:15 AM Kyle Weaver  wrote:
> 
> > +1 (non-)
> >
> > Ran Python wordcount on Flink and Spark.
> >
> > On Wed, Apr 21, 2021 at 5:20 PM Brian Hulette  wrote:
> >
> >> +1 (non-binding)
> >>
> >> I ran a python pipeline exercising the DataFrame API, and another
> >> exercising SQLTransform in Python, both on Dataflow.
> >>
> >> On Wed, Apr 21, 2021 at 12:55 PM Kenneth Knowles  wrote:
> >>
> >>> Since the artifacts were changed about 26 hours ago, I intend to leave
> >>> this vote open until 46 hours from now. Specifically, around noon my time
> >>> (US Pacific) on Friday I will close the vote and finalize the release, if
> >>> no problems are discovered.
> >>>
> >>> Kenn
> >>>
> >>> On Wed, Apr 21, 2021 at 12:52 PM Kenneth Knowles 
> >>> wrote:
> >>>
>  +1 (binding)
> 
>  I ran the script at
>  https://beam.apache.org/contribute/release-guide/#run-validations-using-run_rc_validationsh
>  except for the part that requires a GitHub PR, since Cham already did 
>  that
>  part.
> 
>  Kenn
> 
>  On Wed, Apr 21, 2021 at 12:11 PM Valentyn Tymofieiev <
>  valen...@google.com> wrote:
> 
> > +1, verified that my previous findings are fixed.
> >
> > On Wed, Apr 21, 2021 at 8:17 AM Chamikara Jayalath <
> > chamik...@google.com> wrote:
> >
> >> +1 (binding)
> >>
> >> Ran some Python scenarios and updated the spreadsheet.
> >>
> >> Thanks,
> >> Cham
> >>
> >> On Tue, Apr 20, 2021 at 3:39 PM Kenneth Knowles 
> >> wrote:
> >>
> >>>
> >>>
> >>> On Tue, Apr 20, 2021 at 3:24 PM Robert Bradshaw 
> >>> wrote:
> >>>
>  The artifacts and signatures look good to me. +1 (binding)
> 
>  (The release branch still has the .dev name, maybe you didn't push?
>  https://github.com/apache/beam/blob/release-2.29.0/sdks/python/apache_beam/version.py
>  )
> 
> >>>
> >>> Good point. I'll highlight that I finally implemented the branching
> >>> changes from
> >>> https://lists.apache.org/thread.html/205472bdaf3c2c5876533750d417c19b0d1078131a3dc04916082ce8%40%3Cdev.beam.apache.org%3E
> >>>
> >>> The new guide with diagram is here:
> >>> https://beam.apache.org/contribute/release-guide/#tag-a-chosen-commit-for-the-rc
> >>>
> >>> TL;DR:
> >>>  - the release branch continues to be dev/SNAPSHOT for 2.29.0 while
> >>> the main branch is now dev/SNAPSHOT for 2.30.0
> >>>  - the RC tag v2.29.0-RC1 no longer lies on the release branch. It
> >>> is a single tagged commit that removes the dev/SNAPSHOT suffix
> >>>
> >>> Kenn
> >>>
> >>>
>  On Tue, Apr 20, 2021 at 10:36 AM Kenneth Knowles 
>  wrote:
> 
> > Please take another look.
> >
> >  - I re-ran the RC creation script so the source release and
> > wheels are new and built from the RC tag. I confirmed the source 
> > zip and
> > wheels have version 2.29.0 (not .dev or -SNAPSHOT).
> >  - I fixed and rebuilt Dataflow worker container images from
> > exactly the RC commit, added dataclasses, with internal changes to 
> > get the
> > version to match.
> >  - I confirmed that the staged jars already have version 2.29.0
> > (not -SNAPSHOT).
> >  - I confirmed with `diff -r -q` that the source tarball matches
> > the RC tag (minus the .git* files and directories and gradlew)
> >
> > Kenn
> >
> > On Mon, Apr 19, 2021 at 9:19 PM Kenneth Knowles 
> > wrote:
> >
> >> At this point, the release train has just about come around to
> >> 2.30.0 which will pick up that change. I don't think it makes 
> >> sense to
>