Re: Call for fixes for Airflow 1.10.1

2018-09-13 Thread Gerardo Curiel
Hello,

I wonder if there is still time to wait for a fix for "BigQuery hook does
not allow specifying both the partition field name and table name at the
same time": https://issues.apache.org/jira/browse/AIRFLOW-2772.

The BigQueryHook is taking some liberties and implementing some client-side
logic that shouldn't be there.


On Wed, Sep 12, 2018 at 6:59 PM Driesprong, Fokko 
wrote:

> Hi Ash,
>
> I've cherry-picked two commits on top of 1.10-test branch:
>

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Re: Use Docker for running Airflow tests

2018-09-05 Thread Gerardo Curiel
Hi Sid,

Definitely, there are a few things about this flow that could be better
documented. I'm planning to address a few things like this as part of
https://issues.apache.org/jira/browse/AIRFLOW-87. Do you have any specifics
about what you would like to be documented about this flow?

The build times have gone down but not by much, given we still need to
download the docker images, although there might be a way to cache those
between builds. I'll have a proper look at some point. The builds seem more
reliable though, given there are fewer failure points now (fewer package
downloads).

Cheers,

On Thu, Sep 6, 2018 at 11:46 AM Sid Anand  wrote:

> Any chance you can document a flow that ties the CI build artifacts to this
> repo and how we would use them? Would our build times go down?
>

-- 
Gerardo Curiel // https://gerar.do


Re: Jira cleanup and triage

2018-08-23 Thread Gerardo Curiel
On Wed, Aug 22, 2018 at 6:51 PM Driesprong, Fokko 
wrote:

> Recently we've moved the Apache Airflow repo to the Gitbox repo (
> https://gitbox.apache.org/). Before with the Apache repo, the Github repo
> was just a mirror of the Apache one. Now we do everything on Github itself.
> We still need to hook up the hooks to automagically close a Jira issue when
> the PR is being closed. This would work very well combined with the stale
> robot.
>

Sounds good!

Ps. If you're looking for a new ticket, the CI still needs to cache the
> Docker images instead of pulling them every time ;-)
>

Heh, I'll have a look :-)

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Jira cleanup and triage

2018-08-22 Thread Gerardo Curiel
Hi folks,

Is there a recommended way for contributors to help close/triage Jira
issues?

I've been looking at issues to work on next, and I've found a few
categories of issues:

- Issues in need of triage: these might need to be checked against the
latest version and then closed if they can't be reproduced
- Duplicated issues
- Issues that are still open issues with merged PRs (one example:
https://issues.apache.org/jira/browse/AIRFLOW-2856)

How can we help to point out these out to committers? Cleaning up Jira
should help newcomers to easily visualise the work being done and pick what
to work on.

Also, has something like https://github.com/probot/stale (or whatever the
equivalent in Jira is) being considered for closing issues and PRs
automatically?

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Re: Master is broken

2018-06-12 Thread Gerardo Curiel
Hi Kaxil,

On Tue, Jun 12, 2018 at 5:12 PM, Naik Kaxil  wrote:

>
> I have merged the PR that reverts the commit that broke the CI. Waiting
> for the CI result.
>
>
I see master is green again! Thanks :)

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Re: Master is broken

2018-06-12 Thread Gerardo Curiel
Hi Fokko,

I've been keeping track of the airflow builds just recently. So far I've
seen TravisCI failing because of HTTP errors coming from Ubuntu repos, and
maybe a few timeouts from tar downloads. But the recent ones were
definitely related to bad merges. It seems to me that they can be
identified easily:

* If it happens in less than 10 minutes, flaky infrastructure
* If it happens after 10 minutes (unless it's an integration test), then
it's an actual bad merge

Checking the build shouldn't take more than 2 minutes.

In https://github.com/apache/incubator-airflow/pull/3393 I'm trying to
address the flaky infrastructure by (among other things) reducing the
number of downloads the build makes. Several small downloads have a higher
probability of failure than a few big downloads. We would still need to fix
these flaky tests though.

Cheers,

On Tue, Jun 12, 2018 at 4:54 PM, Driesprong, Fokko 
wrote:

> Hi Gerardo,
>
> I totally agree that when master turns red, we should stop merging and fix
> the build or revert the commit that broke the build.
>
> I think one of the underlying problems is having flaky tests, I tried to
> fix a few of those, but they are quite persistent. Sometimes it is hard to
> indentify if it is just a flaky test or if you really broke something.
>
> Cheers, Fokko
>
> Cheers, Fokko
>
> Op di 12 jun. 2018 om 07:37 schreef Daniel Imberman <
> daniel.imber...@gmail.com>
>
> > +1 for merge blocking hooks. It would be great to have safety knowing
> that
> > any commit I revert to will still pass tests (for rebase testing, etc)
> >
> > On Mon, Jun 11, 2018 at 10:23 PM Alex Tronchin-James 949-412-7220
> > <(949)%20412-7220>  wrote:
> >
> > > Could we adopt some sort of merge-blocking hook that prohibits merge of
> > PRs
> > > failing unit tests? My team has such an approach at work and it reduces
> > the
> > > volume of breakage quite a bit. The only time we experience problems
> now
> > is
> > > where our unit test coverage is poor, but we improve the coverage every
> > > time a breaking PR shows up. If our goal is to harden airflow for
> ongoing
> > > functionality with reduced breakage, this would be one good way to get
> > > there.
> > >
> > > On Mon, Jun 11, 2018 at 7:55 PM Gerardo Curiel 
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > The master branch has been broken for a couple of days already. But
> > that
> > > > hasn't stopped the project from merging pull requests. As time passes
> > by,
> > > > it gets hard to identify what change caused the breakage. And of
> > course,
> > > > fixing it might cause conflicts with the changes introduced by the
> > merged
> > > > PRs.
> > > >
> > > > It seems to me that there should be some sort of process or
> guidelines
> > in
> > > > place to avoid this sort of situations. "Don't merge if master is
> red"
> > > > seems like a reasonable option.
> > > >
> > > > If this guideline sounds obvious enough that it shouldn't be spelled
> > out
> > > in
> > > > the commiters' documentation, then that's fine, but it hasn't been
> > > followed
> > > > recently.
> > > >
> > > > Cheers,
> > > >
> > > > --
> > > > Gerardo Curiel // https://gerar.do
> > > >
> > >
> >
>


-- 
Gerardo Curiel // https://gerar.do


Master is broken

2018-06-11 Thread Gerardo Curiel
Hi folks,

The master branch has been broken for a couple of days already. But that
hasn't stopped the project from merging pull requests. As time passes by,
it gets hard to identify what change caused the breakage. And of course,
fixing it might cause conflicts with the changes introduced by the merged
PRs.

It seems to me that there should be some sort of process or guidelines in
place to avoid this sort of situations. "Don't merge if master is red"
seems like a reasonable option.

If this guideline sounds obvious enough that it shouldn't be spelled out in
the commiters' documentation, then that's fine, but it hasn't been followed
recently.

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Dockerised CI and testing environment

2018-05-21 Thread Gerardo Curiel
Hello folks,

I just submitted a PR for using Docker as part of Airflow's build pipeline:
https://github.com/apache/incubator-airflow/pull/3393

Currently, running unit tests is a difficult process. Airflow tests depend
on many external services and other custom setup, which makes it hard for
contributors to work on this codebase. CI builds have also been unreliable,
and it is hard to reproduce the causes. Having contributors trying to
emulate the build environment every time makes it easier to get to an "it
works on my machine" sort of situation. The proposed docker setup aims to
simplify this.

You can check the PR description, which goes into more details. Now, I
bring this to the list because I have a few requests:

- Could you try the branch out on your local machines? The instructions are
provided in the PR (pending: actual docs). I would love to get feedback
about it.
- Is anyone familiar with the impersonation tests? There have proven to be
hard to fix with my limited knowledge of the codebase

It's WIP, but I wanted to submit what I've got so far and check if you guys
think I'm going in the right direction.

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Re: Lineage

2018-05-07 Thread Gerardo Curiel
On Sun, May 6, 2018 at 7:05 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:

>
> Apache Atlas is agnostic and can receive lineage info by rest API (used in
> my implementation) and Kafk topic. It does also come with a lot of
> connectors out of the box that tie into the hadoop ecosystem and make your
> live easier there. The Airflow Atlas connector supplies Atlas with
> information that it doesn't know about yet closing the loop further.
>
>
Thanks for the explanation. Good to hear it has an API. I though the
"bridges" were the main point of integration.


Cheers,

-- 
Gerardo Curiel // https://gerar.do


Re: Lineage

2018-05-06 Thread Gerardo Curiel
Hi Bolke,

Data lineage support sounds very interesting.

I'm not very familiar with Atlas but first sight seems like a tool specific
to the Hadoop ecosystem. How would this look like if the files (inlets or
outlets) were stored on s3?.

An example of a service that manages a similar use case is AWS Glue[1],
which creates a hive metastore based on the schema and other metadata it
can get from different sources (amongst them, s3 files).


On Sun, May 6, 2018 at 7:49 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi All,
>
> I have made a first implementation that allows tracking of lineage in
> Airflow and integration with Apache Atlas. It was inspired by Jeremiah’s
> work in the past on Data Flow pipelines, but I think I kept it a little bit
> simpler.
>
> Operators now have two new parameters called “inlets” and “outlets”. These
> can be filled with objects derived from “DataSet”, like “File” and
> “HadoopFile”. Parameters are jinja2 templated, which
> means they receive the context of the task when it is running and get
> rendered. So you can get definitions like this:
>
> f_final = File(name="/tmp/final")
> run_this_last = DummyOperator(task_id='run_this_last', dag=dag,
> inlets={"auto": True},
> outlets={"datasets": [f_final,]})
>
> f_in = File(name="/tmp/whole_directory/")
> outlets = []
> for file in FILE_CATEGORIES:
> f_out = File(name="/tmp/{}/ execution_date ".format(file))
> outlets.append(f_out)
> run_this = BashOperator(
> task_id='run_after_loop', bash_command='echo 1', dag=dag,
> inlets={"auto": False, "task_ids": [], "datasets": [f_in,]},
> outlets={"datasets": outlets}
> )
> run_this.set_downstream(run_this_last)
>
> So I am trying to keep to boilerplate work down for developers. Operators
> can also extend inlets and outlets automatically. This will probably be a
> bit harder for the BashOperator without some special magic, but an update
> to the DruidOperator can be relatively quite straightforward.
>
> In the future Operators can take advantage of the inlet/outlet definitions
> as they are also made available as part of the context for templating (as
> “inlets” and “outlets”).
>
> I’m looking forward to your comments!
>
> https://github.com/apache/incubator-airflow/pull/3321
>
> Bolke.



[1] https://aws.amazon.com/glue/

Cheers,

-- 
Gerardo Curiel // https://gerar.do


Bug fixes release

2018-04-23 Thread Gerardo Curiel
Hello folks,

I'll start by saying thank you all for writing and maintaining Airflow. It
has helped us streamline most of our data engineering tasks at work and it
has been a joy to use .

So, I was wondering why the following issues are marked for version 2.0.0

- psycopg2 version 2.7.4 was released and the wheel package was renamed to
psycopg2-binary (https://issues.apache.org/jira/browse/AIRFLOW-2125)
- Add ability to remove DAG and all dependencies (
https://issues.apache.org/jira/browse/AIRFLOW-1002)

The first is small enough that it could be released as part of the planned
1.10 release. The second one seems self-contained enough to be part of the
release as well, but I understand there might be more important features to
release.

I want to get a sense of how things are prioritised. I couldn't find
anything specific in the Contributors' Guide or the Committers' Guide.

Cheers,

Gerardo.