Re: Is `airflow backfill` disfunctional?

2018-06-06 Thread Jeremiah Lowin
Similarly, it's been a while since I touched the backfill code -- my last
commit was more than 2 years ago apparently!! -- so it's certainly
progressed considerably from my early changes. However, to echo Bolke, the
biggest issue with Backfill was that it behaved differently than the
scheduler, which was especially problematic because it was actually used by
the scheduler to run SubDagOperators. And I agree, the behavior you're
seeing probably was a consequence of introducing DagRuns as a unifying
concept.

So if an "automaticly clear failed operators" option is desirable (as in
the PR), then I would suggest making it an option that is False by default.
That way SubDagOperators will continue to operate properly in the
scheduler, but users who want to take advantage of that functionality can
simply enable it for manual backfills.




On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin 
wrote:

> Thanks for the input, this is helpful.
>
> To add to the list, there's some complexity around concurrency management
> and multiple executors:
> I just hit this thing where backfill doesn't check DAG-level concurrency,
> fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
> limit and exits. Right after backfill reschedules right away and so on,
> burning a bunch of CPU doing nothing. In this specific case it seems like
> `airflow run` should skip that specific check when in the context of a
> backfill.
>
> Max
>
> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin  wrote:
>
> > Thinking out loud here, because it is a while back that I did work on
> > backfills. There were some real issues with backfills:
> >
> > 1. Tasks were running in non deterministic order ending up in regular
> > deadlocks
> > 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> > could not be enforced. Ui could really display it, lots of minor other
> > issues because of it.
> > 3. Behavior was different from the scheduler, while subdagoperators
> > particularly make use of backfills at the moment.
> >
> > I think with 3 the behavior you are observing crept in. And given 3 I
> > would argue a consistent behavior between the scheduler and the backfill
> > mechanism is still paramount. Thus we should explicitly clear tasks from
> > failed if we want to rerun them. This at least until we move the
> > subdagoperator out of backfill and into the scheduler (which is actually
> > not too hard). Also we need those command line options anyway.
> >
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim  .INVALID>
> > het volgende geschreven:
> > >
> > > The request was for opposition, but I’d like to weigh in on the side of
> > “it’s a better behavior [to have failed tasks re-run when cleared in a
> > backfill"
> > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > maximebeauche...@gmail.com>, wrote:
> > >> @Jeremiah Lowin  & @Bolke de Bruin <
> bdbr...@gmail.com>
> > I
> > >> think you may have some context on why this may have changed at some
> > point.
> > >> I'm assuming that when DagRun handling was added to the backfill
> logic,
> > the
> > >> behavior just happened to change to what it is now.
> > >>
> > >> Any opposition in moving back towards re-running failed tasks when
> > starting
> > >> a backfill? I think it's a better behavior, though it's a change in
> > >> behavior that we should mention in UPDATE.md.
> > >>
> > >> One of our goals is to make sure that a failed or killed backfill can
> be
> > >> restarted and just seamlessly pick up where it left off.
> > >>
> > >> Max
> > >>
> > >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng  wrote:
> > >>>
> > >>> After discussing with Max, we think it would be great if `airflow
> > backfill`
> > >>> could be able to auto pick up and rerun those failed tasks.
> Currently,
> > it
> > >>> will throw exceptions(
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> > >>> )
> > >>> without rerunning the failed tasks.
> > >>>
> > >>> But since it broke some of the previous assumptions for backfill, we
> > would
> > >>> like to get some feedback and see if anyone has any concerns(pr could
> > be
> > >>> found at https://github.com/apache/incubator-a

Re: Moving on to Lyft

2017-10-26 Thread Jeremiah Lowin
Congrats on your new role, Maxime! Wishing you the best!

On Thu, Oct 26, 2017 at 4:34 AM Diogo Franco 
wrote:

> Best of luck Maxime!
>
> On 25 October 2017 at 22:37, Sergei Iakhnin  wrote:
>
> > Congrats!
> >
> > On Wed, 25 Oct 2017, 23:24 Maxime Beauchemin, <
> maximebeauche...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > Quick note to say that I recently left Airbnb and will be joining Lyft
> on
> > > the 30th of this month.
> > >
> > > First I want to thank the amazing folks at Airbnb for enabling and
> > > sponsoring the open source projects I’ve been working on over the past
> 3
> > > years there. Airbnb has invested a lot in both Airflow and Superset and
> > > will keep doing so moving forward.
> > >
> > > I want to assert that this change should not affect my involvement in
> > both
> > > Airflow and Superset. I’m planning on being active and to contribute
> at a
> > > similar pace as before in both communities.
> > >
> > > We’ve come such a long way together and I’m looking forward for what is
> > > ahead!
> > >
> > > Max
> > >
> > --
> >
> > Sergei
> >
>


Re: Airflow + Kubernetes discussion

2017-07-20 Thread Jeremiah Lowin
I'm interested as well.

On Thu, Jul 20, 2017 at 1:51 PM Marc Bollinger  wrote:

> +1 We're in the middle of moving some services to k8s, and have had our
> eye on Airflow.
>
> > On Jul 20, 2017, at 10:37 AM, Sumit Maheshwari 
> wrote:
> >
> > I would join as well for sure.
> >
> > Thanks,
> > Sumit Maheshwari
> > cell. 9632202950
> >
> >
> > On Thu, Jul 20, 2017 at 11:00 PM, Chris Riccomini  >
> > wrote:
> >
> >> I would definitely be up to joining. We're interested in the K8s work
> >> that's going on. That time works for me.
> >>
> >> On Thu, Jul 20, 2017 at 9:54 AM, Daniel Imberman <
> >> daniel.imber...@gmail.com>
> >> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> Recently there's been a fair amount of discussion regarding the
> >> integration
> >>> of airflow with kubernetes. If there is interest I would love to host
> an
> >>> e-meeting to discuss this integration. I can go over the architecture
> as
> >> it
> >>> stands right now and would love feedback on
> >> improvements/features/design. I
> >>> could also attempt to get one or two members of google's kubernetes
> team
> >> to
> >>> join to discuss best practices.
> >>>
> >>> I'm currently thinking that next Thursday at 11AM PST over zoom.us,
> >> though
> >>> if there's strong opinions otherwise I'd be glad to propose other
> times.
> >>>
> >>> Cheers!
> >>>
> >>> Daniel
> >>>
> >>
>


Re: Airflow kubernetes executor

2017-07-13 Thread Jeremiah Lowin
p.s. it looks like git-sync has received an "official" release since the
last time I looked at it: https://github.com/kubernetes/git-sync

On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin  wrote:

> Hi Gerard (and anyone else for whom this might be helpful),
>
> We've run Airflow on GCP for a few years. The structure has changed over
> time but at the moment we use the following basic outline:
>
> 1. Build a container that includes all Airflow and DAG dependencies and
> push it to Google container registry. If you need to add/update
> dependencies or update airflow.cfg, simply push a new image
> 2. All DAGs are pushed to a git repo
> 3. Host the AirflowDB in Google Cloud SQL
> 4. Create a Kuberenetes deployment that runs the following containers:
> -- Airflow scheduler (using the dependencies image)
> -- Airflow webserver (using the dependencies image)
> -- Airflow maintainence (using the dependencies image) - this container
> does nothing (sleep infinity) but since it shares the same setup as the
> scheduler/webserver, it's an easy place to `exec` into the cluster to
> investigate any issues that might be crashing the main containers. We limit
> its CPU to minimize impact on cluster resources. Hacky but effective.
> -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy)
> - to connect to the Airflow DB
> -- git-sync (https://github.com/jlowin/git-sync)
>
> The last container (git-sync) is a small library I wrote to solve the
> issue of syncing DAGs. It's not perfect and ***I am NOT offering any
> support for it*** but it gets the job done. It's meant to be a sidecar
> container and does one thing: constantly fetch a git repo to a local
> folder. In your deployment, create an EmptyDir volume and mount it in all
> containers (except cloud sql). Git-sync should use that volume as its
> target, and scheduler/webserver should use the volume as the DAGs folder.
> That way, every 30 seconds, git-sync will fetch the git repo in to that
> volume, and the Airflow containers will immediately see the latest files
> appear.
>
> 5. Create a Kubernetes service to expose the webserver UI
>
> Our actual implementation is considerably more complicated than this since
> we have extensive custom modules that are loaded via git-sync rather than
> being baked into the image, as well as a few other GCP service
> integrations, but this overview should point in the right direction.
> Getting it running the first time requires a little elbow grease but once
> built, it's easy to automate the process.
>
> Best,
> Jeremiah
>
>
>
> On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra 
> wrote:
>
>> It would be really good if you'd share experiences on how to run this on
>> kubernetes and ECS.
>> I'm not aware of a good guide on how to run this on either for example,
>> but
>> it's a very useful and
>> quick setup to start with, especially combining that with deployment
>> manager and cloudformation (probably).
>>
>> I'm talking to someone else who's looking at running on kubernetes and
>> potentially opensourcing a generic
>> template for kubernetes deployments.
>>
>>
>> Would it be possible to share your experiences?  What tech are you using
>> for specific issues?
>>
>> - how do you deploy and sync dags?  Are you using EFS?
>> - how you do build the container with airflow + executables?
>> - where do you send log files or log lines to?
>> - High Availability and how?
>>
>> Really looking forward to how that's done, so we can put this on the wiki.
>>
>> Especially since GCP is now also starting to embrace airflow, it'd be good
>> to have a better understanding
>> how easy and quick it can be to deploy airflow on gcp:
>>
>>
>> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>>
>> Rgds,
>>
>> Gerard
>>
>>
>> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis 
>> wrote:
>>
>> > for what it's worth we've been running airflow on ECS for a few years
>> > already.
>> >
>> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
>> > grantnicholas2...@u.northwestern.edu> wrote:
>> >
>> > > Is having a static set of workers necessary? Launching a job on
>> > Kubernetes
>> > > from a cached docker image takes a few seconds max. I think this is an
>> > > acceptable delay for a batch processing system like airflow.
>> > >
>> > > Additionally, if you dynamically launch workers you can start
>> dynamicall

Re: Airflow kubernetes executor

2017-07-13 Thread Jeremiah Lowin
Hi Gerard (and anyone else for whom this might be helpful),

We've run Airflow on GCP for a few years. The structure has changed over
time but at the moment we use the following basic outline:

1. Build a container that includes all Airflow and DAG dependencies and
push it to Google container registry. If you need to add/update
dependencies or update airflow.cfg, simply push a new image
2. All DAGs are pushed to a git repo
3. Host the AirflowDB in Google Cloud SQL
4. Create a Kuberenetes deployment that runs the following containers:
-- Airflow scheduler (using the dependencies image)
-- Airflow webserver (using the dependencies image)
-- Airflow maintainence (using the dependencies image) - this container
does nothing (sleep infinity) but since it shares the same setup as the
scheduler/webserver, it's an easy place to `exec` into the cluster to
investigate any issues that might be crashing the main containers. We limit
its CPU to minimize impact on cluster resources. Hacky but effective.
-- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy) -
to connect to the Airflow DB
-- git-sync (https://github.com/jlowin/git-sync)

The last container (git-sync) is a small library I wrote to solve the issue
of syncing DAGs. It's not perfect and ***I am NOT offering any support for
it*** but it gets the job done. It's meant to be a sidecar container and
does one thing: constantly fetch a git repo to a local folder. In your
deployment, create an EmptyDir volume and mount it in all containers
(except cloud sql). Git-sync should use that volume as its target, and
scheduler/webserver should use the volume as the DAGs folder. That way,
every 30 seconds, git-sync will fetch the git repo in to that volume, and
the Airflow containers will immediately see the latest files appear.

5. Create a Kubernetes service to expose the webserver UI

Our actual implementation is considerably more complicated than this since
we have extensive custom modules that are loaded via git-sync rather than
being baked into the image, as well as a few other GCP service
integrations, but this overview should point in the right direction.
Getting it running the first time requires a little elbow grease but once
built, it's easy to automate the process.

Best,
Jeremiah



On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra  wrote:

> It would be really good if you'd share experiences on how to run this on
> kubernetes and ECS.
> I'm not aware of a good guide on how to run this on either for example, but
> it's a very useful and
> quick setup to start with, especially combining that with deployment
> manager and cloudformation (probably).
>
> I'm talking to someone else who's looking at running on kubernetes and
> potentially opensourcing a generic
> template for kubernetes deployments.
>
>
> Would it be possible to share your experiences?  What tech are you using
> for specific issues?
>
> - how do you deploy and sync dags?  Are you using EFS?
> - how you do build the container with airflow + executables?
> - where do you send log files or log lines to?
> - High Availability and how?
>
> Really looking forward to how that's done, so we can put this on the wiki.
>
> Especially since GCP is now also starting to embrace airflow, it'd be good
> to have a better understanding
> how easy and quick it can be to deploy airflow on gcp:
>
>
> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>
> Rgds,
>
> Gerard
>
>
> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis 
> wrote:
>
> > for what it's worth we've been running airflow on ECS for a few years
> > already.
> >
> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
> > grantnicholas2...@u.northwestern.edu> wrote:
> >
> > > Is having a static set of workers necessary? Launching a job on
> > Kubernetes
> > > from a cached docker image takes a few seconds max. I think this is an
> > > acceptable delay for a batch processing system like airflow.
> > >
> > > Additionally, if you dynamically launch workers you can start
> dynamically
> > > launching *any type* of worker and you don't have to statically
> allocate
> > > pools of worker types. IE) A single DAG could use a scala docker image
> to
> > > do spark calculations, a C++ docker image to use some low level
> numerical
> > > library,  and a python docker image by default to do any generic
> airflow
> > > stuff. Additionally, you can size workers according to their usage.
> Maybe
> > > the spark driver program only needs a few GBs of RAM but the C++
> > numerical
> > > library needs many hundreds.
> > >
> > > I agree there is a bit of extra book-keeping that needs to be done, but
> > > the tradeoff is an important one to explicitly make.
> > >
> >
>


Re: [VOTE] Release Airflow 1.8.1 RC2

2017-05-03 Thread Jeremiah Lowin
+1 (binding)

On Mon, May 1, 2017 at 1:58 PM Chris Riccomini 
wrote:

> Dear All,
>
> _WARN: The package version for this RC is 1.8.1 (does not include RC2 in
> version number). As such, any future 1.8.1 installatinos will have to be
> force installed. PIP will not be able to distinguish between RCs and final
> versions. Again, you'll have to force install the package. This can be done
> by adding `--force-reinstall` to your `pip install` commands._
>
> I've made Airflow 1.8.1 RC2 available at:
> https://dist.apache.org/repos/dist/dev/incubator/airflow, public keys are
> available at https://dist.apache.org/repos/dist/release/incubator/airflow.
>
> New issues fixed in 1.8.1 RC2:
>
> [AIRFLOW-1142] SubDAG Tasks Not Executed Even Though All Dependen
> [AIRFLOW-1004] `airflow webserver -D` runs in foreground
> [AIRFLOW-492] Insert into dag_stats table results into failed ta
>
> Issues fixed in 1.8.1 RC0/RC1, and included in RC2:
>
> [AIRFLOW-1138] Add licenses to files in scripts directory
> [AIRFLOW-1127] Move license notices to LICENSE instead of NOTICE
> [AIRFLOW-1124] Do not set all task instances to scheduled on back
> [AIRFLOW-1120] Update version view to include Apache prefix
> [AIRFLOW-1062] DagRun#find returns wrong result if external_trigg
> [AIRFLOW-1054] Fix broken import on test_dag
> [AIRFLOW-1050] Retries ignored - regression
> [AIRFLOW-1033] TypeError: can't compare datetime.datetime to None
> [AIRFLOW-1017] get_task_instance should return None instead of th
> [AIRFLOW-1011] Fix bug in BackfillJob._execute() for SubDAGs
> [AIRFLOW-1001] Landing Time shows "unsupported operand type(s) fo
> [AIRFLOW-1000] Rebrand to Apache Airflow instead of Airflow
> [AIRFLOW-989] Clear Task Regression
> [AIRFLOW-974] airflow.util.file mkdir has a race condition
> [AIRFLOW-906] Update Code icon from lightning bolt to file
> [AIRFLOW-858] Configurable database name for DB operators
> [AIRFLOW-853] ssh_execute_operator.py stdout decode default to A
> [AIRFLOW-832] Fix debug server
> [AIRFLOW-817] Trigger dag fails when using CLI + API
> [AIRFLOW-816] Make sure to pull nvd3 from local resources
> [AIRFLOW-815] Add previous/next execution dates to available def
> [AIRFLOW-813] Fix unterminated unit tests in tests.job (tests/jo
> [AIRFLOW-812] Scheduler job terminates when there is no dag file
> [AIRFLOW-806] UI should properly ignore DAG doc when it is None
> [AIRFLOW-794] Consistent access to DAGS_FOLDER and SQL_ALCHEMY_C
> [AIRFLOW-785] ImportError if cgroupspy is not installed
> [AIRFLOW-784] Cannot install with funcsigs > 1.0.0
> [AIRFLOW-780] The UI no longer shows broken DAGs
> [AIRFLOW-777] dag_is_running is initlialized to True instead of
> [AIRFLOW-719] Skipped operations make DAG finish prematurely
> [AIRFLOW-694] Empty env vars do not overwrite non-empty config v
> [AIRFLOW-139] Executing VACUUM with PostgresOperator
> [AIRFLOW-111] DAG concurrency is not honored
> [AIRFLOW-88] Improve clarity Travis CI reports
>
> I would like to raise a VOTE for releasing 1.8.1 based on release candidate
> 2.
>
> Please respond to this email by:
>
> +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you are
> not.
>
> Vote will run for 72 hours (ends this Thursday).
>
> Thanks!
> Chris
>
> My VOTE: +1 (binding)
>


Re: Force DAGs run up to the last task

2017-04-28 Thread Jeremiah Lowin
My apologies, I didn't read closely enough. Concurrency does indeed refer
to tasks, not dagruns. As Bolke points out, I believe the only mechanism to
restrict dagruns is depends_on_past, which will limit both concurrency and
ordering.

On Fri, Apr 28, 2017 at 7:11 AM Wim Berchmans 
wrote:

> Hi Jeremiah,
>
> We've been struggling with this problem as well.
> If you set the concurrency parameter on a Dag to 1, the docstring states
> this will only relate to task instances, and not to dag_run instances,
> which I think David is referring to.
>
> :param concurrency: the number of task instances allowed to run
> concurrently
> :type concurrency: int
>
>
>
>
> -Original Message-
> From: Jeremiah Lowin [mailto:jlo...@apache.org]
> Sent: vrijdag 28 april 2017 12:51
> To: dev@airflow.incubator.apache.org
> Subject: Re: Force DAGs run up to the last task
>
> Hi David -- you'll want to set the concurrency parameter of your DAG to 1.
>
> J
>
> On Fri, Apr 28, 2017 at 4:12 AM David Batista  wrote:
>
> > Hello everyone,
> >
> > is there a simple way to tell Airflow to only start running another
> > DAG when all the tasks for the current running DAG are completed?
> > i.e., when a DAG is triggered Airflow first runs all the tasks for
> > that DAG, and only then picks up another DAG to run.
> >
> >
> > --
> > *David Batista* *Data Engineer**, HelloFresh Global* Saarbrücker Str.
> > 37a | 10405 Berlin d...@hellofresh.com 
> >
> > --
> >
> > [image: logo]
> >   <http://www.facebook.com/hellofreshde>   <
> > http://twitter.com/HelloFreshde>
> ><http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
> > <
> > https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh
> > %3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Fu
> > tm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%
> > 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demai
> > l_signature
> > >
> >
> > *HelloFresh App –Download Now!*
> > <
> > https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh
> > %3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Fu
> > tm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%
> > 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demai
> > l_signature
> > >
> > *We're active in:*
> > US <
> > https://www.hellofresh.com/?utm_medium=email&utm_source=email_signatur
> > e>
> >  |  DE
> > <https://www.hellofresh.de/?utm_medium=email&utm_source=email_signatur
> > e> | UK
> > <https://www.hellofresh.co.uk/?utm_medium=email&utm_source=email_signa
> > ture
> > >
> > |  NL
> > <https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signatur
> > e> | AU <
> > https://www.hellofresh.com.au/?utm_medium=email&utm_source=email_signa
> > ture
> > >
> >  |  BE
> > <https://www.hellofresh.be/?utm_medium=email&utm_source=email_signatur
> > e> | AT
> > <https://www.hellofresh.at/?utm_medium=email&utm_source=email_signatur
> > e
> > >
> > |  CH
> > <https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signatur
> > e> | CA
> > <https://www.hellofresh.ca/?utm_medium=email&utm_source=email_signatur
> > e
> > >
> >
> > www.HelloFreshGroup.com
> > <
> > http://www.hellofreshgroup.com/?utm_medium=email&utm_source=email_sign
> > ature
> > >
> >
> > We are hiring around the world – Click here to join us <
> > https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_sig
> > nature
> > >
> >
> > --
> >
> > <
> > https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_sig
> > nature
> > >
> > HelloFresh SE, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> > Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner |
> > Vorsitzender des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim
> > Amtsgericht Charlottenburg, HRB 182382 B | USt-Id Nr.: DE 302210417
> >
> > *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> > confidential and may be privileged. It may be read, copied and used
> > only by the intended recipient. If you have received it in error
> > please contact the sender (by return e-mail) immediately and delete
> > this message. Any unauthorized use or dissemination of this message in
> > whole or in parts is strictly prohibited.
> >
>


Re: Force DAGs run up to the last task

2017-04-28 Thread Jeremiah Lowin
Hi David -- you'll want to set the concurrency parameter of your DAG to 1.

J

On Fri, Apr 28, 2017 at 4:12 AM David Batista  wrote:

> Hello everyone,
>
> is there a simple way to tell Airflow to only start running another DAG
> when all the tasks for the current running DAG are completed? i.e., when a
> DAG is triggered Airflow first runs all the tasks for that DAG, and only
> then picks up another DAG to run.
>
>
> --
> *David Batista* *Data Engineer**, HelloFresh Global*
> Saarbrücker Str. 37a | 10405 Berlin
> d...@hellofresh.com 
>
> --
>
> [image: logo]
>      <
> http://twitter.com/HelloFreshde>
>   
> <
> https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature
> >
>
> *HelloFresh App –Download Now!*
> <
> https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature
> >
> *We're active in:*
> US <
> https://www.hellofresh.com/?utm_medium=email&utm_source=email_signature>
>  |  DE
>  |
> UK
>  >
> |  NL
>  |
> AU
> <
> https://www.hellofresh.com.au/?utm_medium=email&utm_source=email_signature
> >
>  |  BE
>  |
> AT  >
> |  CH
>  |
> CA  >
>
> www.HelloFreshGroup.com
> <
> http://www.hellofreshgroup.com/?utm_medium=email&utm_source=email_signature
> >
>
> We are hiring around the world – Click here to join us
> <
> https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_signature
> >
>
> --
>
> <
> https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_signature
> >
> HelloFresh SE, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner | Vorsitzender
> des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
> Charlottenburg, HRB 182382 B | USt-Id Nr.: DE 302210417
>
> *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> confidential and may be privileged. It may be read, copied and used only by
> the intended recipient. If you have received it in error please contact the
> sender (by return e-mail) immediately and delete this message. Any
> unauthorized use or dissemination of this message in whole or in parts is
> strictly prohibited.
>


Re: variable scope with dynamic dags

2017-03-22 Thread Jeremiah Lowin
In vanilla Python, your DAGs will all reference the same object, so when
your DAG file is parsed and 200 DAGs are created, there will still only be
1 60MB dict object created (I say vanilla because there are obviously ways
to create copies of the object).

HOWEVER, you should assume that each Airflow TASK is being run in a
different process, and each process is going to load your DAG file when it
runs. If resource use is a concern, I suggest you look at out-of-core or
persistent storage for the object so you don't need to load the whole thing
every time.

On Wed, Mar 22, 2017 at 11:20 AM Boris Tyukin  wrote:

> hi Jeremiah, thanks for the explanation!
>
> i am very new to Python so was surprised that it works and my external
> dictionary object was still accessible to all dags generated. I think it
> makes sense but I would like to confirm one thing and I do not know how to
> test it myself.
>
> do you think that large dictionary object will still be loaded to memory
> only once even if I generate 200 dags that will be accessing it? so
> basically they will just use a reference to it or they would create a copy
> of the same 60Mb structure.
>
> I hope my question makes sense :)
>
> On Wed, Mar 22, 2017 at 10:54 AM, Jeremiah Lowin 
> wrote:
>
> > At the risk of oversimplifying things, your DAG definition file is loaded
> > *every* time a DAG (or any task in that DAG) is run. Think of it as a
> > literal Python import of your dag-defining module: any variables are
> loaded
> > along with the DAGs, which are then executed. That's why your dict is
> > always available. This will work with Celery since it follows the same
> > approach, parsing your DAG file to run each task.
> >
> > (By the way, this is why it's critical that all parts of your Airflow
> > infrastructure have access to the same DAGS_FOLDER)
> >
> > Now it is true that the DagBag loads DAG objects but think of it as more
> of
> > an "index" so that the scheduler/webserver know what DAGs are available.
> > When it's time to actually run one of those DAGs, the executor loads it
> > from the underlying source file.
> >
> > Jeremiah
> >
> > On Wed, Mar 22, 2017 at 8:45 AM Boris Tyukin 
> > wrote:
> >
> > > Hi,
> > >
> > > I have a weird question but it bugs my mind. I have some like below to
> > > generate dags dynamically, using Max's example code from FAQ.
> > >
> > > It works fine but I have one large dict (let's call it my_outer_dict)
> > that
> > > takes over 60Mb in memory and I need to access it from all generated
> > dags.
> > > Needless to say, i do not want to recreate that dict for every dag as I
> > > want to load it to memory only once.
> > >
> > > To my surprise, if i define that dag outside of my dag definition
> code, I
> > > can still access it.
> > >
> > > Can someone explain why and where is it stored? I thought only dag
> > > definitions are loaded to dagbag and not the variables outside it.
> > >
> > > Is it even a good practice and will it work still if I switch to celery
> > > executor?
> > >
> > >
> > > def get_dag(i):
> > > dag_id = 'foo_{}'.format(i)
> > > dag = DAG(dag_id)
> > > 
> > > print my_outer_dict
> > >
> > > my_outer_dict = {}
> > > for i in range(10):
> > > dag = get_dag(i)
> > > globals()[dag.dag_id] = dag
> > >
> >
>


Re: variable scope with dynamic dags

2017-03-22 Thread Jeremiah Lowin
At the risk of oversimplifying things, your DAG definition file is loaded
*every* time a DAG (or any task in that DAG) is run. Think of it as a
literal Python import of your dag-defining module: any variables are loaded
along with the DAGs, which are then executed. That's why your dict is
always available. This will work with Celery since it follows the same
approach, parsing your DAG file to run each task.

(By the way, this is why it's critical that all parts of your Airflow
infrastructure have access to the same DAGS_FOLDER)

Now it is true that the DagBag loads DAG objects but think of it as more of
an "index" so that the scheduler/webserver know what DAGs are available.
When it's time to actually run one of those DAGs, the executor loads it
from the underlying source file.

Jeremiah

On Wed, Mar 22, 2017 at 8:45 AM Boris Tyukin  wrote:

> Hi,
>
> I have a weird question but it bugs my mind. I have some like below to
> generate dags dynamically, using Max's example code from FAQ.
>
> It works fine but I have one large dict (let's call it my_outer_dict) that
> takes over 60Mb in memory and I need to access it from all generated dags.
> Needless to say, i do not want to recreate that dict for every dag as I
> want to load it to memory only once.
>
> To my surprise, if i define that dag outside of my dag definition code, I
> can still access it.
>
> Can someone explain why and where is it stored? I thought only dag
> definitions are loaded to dagbag and not the variables outside it.
>
> Is it even a good practice and will it work still if I switch to celery
> executor?
>
>
> def get_dag(i):
> dag_id = 'foo_{}'.format(i)
> dag = DAG(dag_id)
> 
> print my_outer_dict
>
> my_outer_dict = {}
> for i in range(10):
> dag = get_dag(i)
> globals()[dag.dag_id] = dag
>


Re: Airflow Newbie Question

2017-03-21 Thread Jeremiah Lowin
Indeed, Airflow has an implicit assumption that your DAGs are "slowly
changing" at worst. There's absolutely nothing wrong with using a factory
to generate multiple DAGs at once, but as Jorge says you will run in to
trouble if you are generating different DAGs each time the server,
scheduler, executor, or worker parses the file (which is often).

In other words, if your factory function generates 10 DAGs for 10 lob's,
that's great -- just try to have it always be the same 10, or at least make
changes infrequently.

On Tue, Mar 21, 2017 at 2:50 PM Jorge Alpedrinha Ramos <
jalpedrinhara...@gmail.com> wrote:

> Hi Andrew,
>
> I've ran into the same problem and I found out that the best way to do this
> is to use a dag_factory that takes your parameter and builds a dag for each
> lob in the "list".
>
> This way you'll have one dag per lob, and they will run independently as I
> believe its your intention.
>
> This solution will define which dags are active or inactive not at DAG
> runtime, but at scheduler boot time. The problem you may have or not, is if
> the lob "list" is somehow dynamic, which means that your dags will be
> active or inactive depending on the lob being in the "list" at the time the
> scheduler instantiates your dag and not necessarily at the time your dag
> runs.
>
> I hope I've helped, I believe this is a point where best practices are
> still in some kind of grey area and is definitely something airflow should
> have one clear way of doing.
>
>
>
>
> On Tue, Mar 21, 2017 at 4:06 PM Andrew Maguire 
> wrote:
>
> > Sorry if this is not the right place for this.
> >
> > I'm wondering if anyone might be able to help me with this question?
> >
> >
> >
> http://stackoverflow.com/questions/42930221/best-way-to-loop-through-parameters-in-airflow
> >
> > Feel free to ignore if this is not the right place to try find help.
> >
> > Loving airflow btw!
> >
> > Cheers,
> > Andy
> >
>


Re: Airflow Committers: Landscape checks doing more harm than good?

2017-03-16 Thread Jeremiah Lowin
FWIW I recently started using yapf (https://github.com/google/yapf) with a
slightly custom config to format all of my projects. Rather than alert to
discrete linting errors and concrete style rules (like PEP8) -- things I'm
sure we all do anyway -- it reformats all code in compliance with your
chosen style rules. It even reformats code that is already PEP8 compliant
to make it more "pythonic" (and still PEP8 compliant). Basically: if you
like (or create) a yapf style, it takes care of all the hard reformatting
work and produces pleasing, consistent results. /plug

On Thu, Mar 16, 2017 at 8:42 PM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Let's wire a custom a linter command that can be called locally and respect
> an agreed upon set of parameters (pylint + config file, based off of our
> current .landscape.yml ).
>
> flake8 is far from being as good as pylint and can't be customized much
> AFAICT, but variations on the command bellow can help you lint [only] your
> PR:
>
> `git diff HEAD^ | flake8 --diff`
>
> It's a good thing to integrate in your workflow until we get an equivalent
> pylint command/config
>
> On Thu, Mar 16, 2017 at 5:03 PM, Alex Guziel  .invalid
> > wrote:
>
> > +1 also
> >
> > We have code review already and the amount of false positives makes this
> > useless.
> >
> > On Thu, Mar 16, 2017 at 5:02 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > +1 as well
> > >
> > > I'm disappointed because the service is inches away from getting
> > everything
> > > right. As Bolke said, behind the cover it's little more than pylint,
> git
> > > hooks, and a somewhat-fancy ui.
> > >
> > > Operationally it's been getting in the way.
> > >
> > > There's a way to pipe the output of git diff into pylint and check
> > whether
> > > the touched lines need linting, in which case we should break the
> build.
> > > This could run in it's own slot in the Travis build matrix.
> > >
> > > Max
> > >
> > > On Thu, Mar 16, 2017 at 4:51 PM, Bolke de Bruin 
> > wrote:
> > >
> > > > We can do it in Travis’ afaik. We should replace it.
> > > >
> > > > So +1.
> > > >
> > > > B.
> > > >
> > > > > On 16 Mar 2017, at 16:48, Jeremiah Lowin 
> wrote:
> > > > >
> > > > > This may be an unpopular opinion, but most Airflow PRs have a
> little
> > > red
> > > > > "x" next to them not because they have failing unit tests, but
> > because
> > > > the
> > > > > Landscape check has decided they introduce bad code.
> > > > >
> > > > > Unfortunately Landscape is often wrong -- here it is telling me my
> > > latest
> > > > > PR introduced no less than 30 errors... in files I didn't touch!
> > > > > https://github.com/apache/incubator-airflow/pull/2157 (however, it
> > > > gives me
> > > > > credit for fixing 23 errors in those same files, so I've got that
> > going
> > > > for
> > > > > me... which is nice.)
> > > > >
> > > > > The upshot is that Github's "health" indicator can be swayed by
> minor
> > > or
> > > > > erroneous issues, and therefore it serves little purpose other than
> > > > making
> > > > > it look like every PR is bad. This creates committer fatigue, since
> > > every
> > > > > PR needs to be parsed to see if it actually is OK or not.
> > > > >
> > > > > Don't get me wrong, I'm all for proper style and on occasion
> > Landscape
> > > > has
> > > > > pointed out problems that I've gone and fixed. But on the whole, I
> > > > believe
> > > > > that having it as part of our red / green PR evaluation -- equal to
> > and
> > > > > often superseding unit tests -- is harmful. I'd much rather be able
> > to
> > > > scan
> > > > > the PR list and know unequivocally that "green" indicates ready to
> > > merge.
> > > > >
> > > > > J
> > > >
> > > >
> > >
> >
>


Re: Airflow Committers: Landscape checks doing more harm than good?

2017-03-16 Thread Jeremiah Lowin
Sounds like a probable consensus. I think this is a good time for me to
admit I have no idea how to turn it off or move it to Travis.

Could anyone please volunteer to take that on?

On Thu, Mar 16, 2017 at 8:11 PM Alex Guziel 
wrote:

> +1 also
>
> We have code review already and the amount of false positives makes this
> useless.
>
> On Thu, Mar 16, 2017 at 5:02 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > +1 as well
> >
> > I'm disappointed because the service is inches away from getting
> everything
> > right. As Bolke said, behind the cover it's little more than pylint, git
> > hooks, and a somewhat-fancy ui.
> >
> > Operationally it's been getting in the way.
> >
> > There's a way to pipe the output of git diff into pylint and check
> whether
> > the touched lines need linting, in which case we should break the build.
> > This could run in it's own slot in the Travis build matrix.
> >
> > Max
> >
> > On Thu, Mar 16, 2017 at 4:51 PM, Bolke de Bruin 
> wrote:
> >
> > > We can do it in Travis’ afaik. We should replace it.
> > >
> > > So +1.
> > >
> > > B.
> > >
> > > > On 16 Mar 2017, at 16:48, Jeremiah Lowin  wrote:
> > > >
> > > > This may be an unpopular opinion, but most Airflow PRs have a little
> > red
> > > > "x" next to them not because they have failing unit tests, but
> because
> > > the
> > > > Landscape check has decided they introduce bad code.
> > > >
> > > > Unfortunately Landscape is often wrong -- here it is telling me my
> > latest
> > > > PR introduced no less than 30 errors... in files I didn't touch!
> > > > https://github.com/apache/incubator-airflow/pull/2157 (however, it
> > > gives me
> > > > credit for fixing 23 errors in those same files, so I've got that
> going
> > > for
> > > > me... which is nice.)
> > > >
> > > > The upshot is that Github's "health" indicator can be swayed by minor
> > or
> > > > erroneous issues, and therefore it serves little purpose other than
> > > making
> > > > it look like every PR is bad. This creates committer fatigue, since
> > every
> > > > PR needs to be parsed to see if it actually is OK or not.
> > > >
> > > > Don't get me wrong, I'm all for proper style and on occasion
> Landscape
> > > has
> > > > pointed out problems that I've gone and fixed. But on the whole, I
> > > believe
> > > > that having it as part of our red / green PR evaluation -- equal to
> and
> > > > often superseding unit tests -- is harmful. I'd much rather be able
> to
> > > scan
> > > > the PR list and know unequivocally that "green" indicates ready to
> > merge.
> > > >
> > > > J
> > >
> > >
> >
>


Airflow Committers: Landscape checks doing more harm than good?

2017-03-16 Thread Jeremiah Lowin
This may be an unpopular opinion, but most Airflow PRs have a little red
"x" next to them not because they have failing unit tests, but because the
Landscape check has decided they introduce bad code.

Unfortunately Landscape is often wrong -- here it is telling me my latest
PR introduced no less than 30 errors... in files I didn't touch!
https://github.com/apache/incubator-airflow/pull/2157 (however, it gives me
credit for fixing 23 errors in those same files, so I've got that going for
me... which is nice.)

The upshot is that Github's "health" indicator can be swayed by minor or
erroneous issues, and therefore it serves little purpose other than making
it look like every PR is bad. This creates committer fatigue, since every
PR needs to be parsed to see if it actually is OK or not.

Don't get me wrong, I'm all for proper style and on occasion Landscape has
pointed out problems that I've gone and fixed. But on the whole, I believe
that having it as part of our red / green PR evaluation -- equal to and
often superseding unit tests -- is harmful. I'd much rather be able to scan
the PR list and know unequivocally that "green" indicates ready to merge.

J


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-13 Thread Jeremiah Lowin
+1 (binding) extremely impressed by the work and diligence all contributors
have put in to getting these blockers fixed, Bolke in particular.

On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer  wrote:

> +1 (binding)
>
> Thanks again for steering us through Bolke.
>
> Best,
> Arthur
>
> On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin  wrote:
>
> > Dear All,
> >
> > Finally, I have been able to make the FIFTH RELEASE CANDIDATE of Airflow
> > 1.8.0 available at: https://dist.apache.org/repos/
> > dist/dev/incubator/airflow/  > repos/dist/dev/incubator/airflow/> , public keys are available at
> > https://dist.apache.org/repos/dist/release/incubator/airflow/ <
> > https://dist.apache.org/repos/dist/release/incubator/airflow/> . It is
> > tagged with a local version “apache.incubating” so it allows upgrading
> from
> > earlier releases.
> >
> > Issues fixed since rc4:
> >
> > [AIRFLOW-900] Double trigger should not kill original task instance
> > [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run protection
> > [AIRFLOW-932] Do not mark tasks removed when backfilling
> > [AIRFLOW-961] run onkill when SIGTERMed
> > [AIRFLOW-910] Use parallel task execution for backfills
> > [AIRFLOW-967] Wrap strings in native for py2 ldap compatibility
> > [AIRFLOW-941] Use defined parameters for psycopg2
> > [AIRFLOW-719] Prevent DAGs from ending prematurely
> > [AIRFLOW-938] Use test for True in task_stats queries
> > [AIRFLOW-937] Improve performance of task_stats
> > [AIRFLOW-933] use ast.literal_eval rather eval because ast.literal_eval
> > does not execute input.
> > [AIRFLOW-919] Running tasks with no start date shouldn't break a DAGs UI
> > [AIRFLOW-897] Prevent dagruns from failing with unfinished tasks
> > [AIRFLOW-861] make pickle_info endpoint be login_required
> > [AIRFLOW-853] use utf8 encoding for stdout line decode
> > [AIRFLOW-856] Make sure execution date is set for local client
> > [AIRFLOW-830][AIRFLOW-829][AIRFLOW-88] Reduce Travis log verbosity
> > [AIRFLOW-794] Access DAGS_FOLDER and SQL_ALCHEMY_CONN exclusively from
> > settings
> > [AIRFLOW-694] Fix config behaviour for empty envvar
> > [AIRFLOW-365] Set dag.fileloc explicitly and use for Code view
> > [AIRFLOW-931] Do not set QUEUED in TaskInstances
> > [AIRFLOW-899] Tasks in SCHEDULED state should be white in the UI instead
> > of black
> > [AIRFLOW-895] Address Apache release incompliancies
> > [AIRFLOW-893][AIRFLOW-510] Fix crashing webservers when a dagrun has no
> > start date
> > [AIRFLOW-793] Enable compressed loading in S3ToHiveTransfer
> > [AIRFLOW-863] Example DAGs should have recent start dates
> > [AIRFLOW-869] Refactor mark success functionality
> > [AIRFLOW-856] Make sure execution date is set for local client
> > [AIRFLOW-814] Fix Presto*CheckOperator.__init__
> > [AIRFLOW-844] Fix cgroups directory creation
> >
> > No known issues anymore.
> >
> > I would also like to raise a VOTE for releasing 1.8.0 based on release
> > candidate 5, i.e. just renaming release candidate 5 to 1.8.0 release.
> >
> > Please respond to this email by:
> >
> > +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you
> are
> > not.
> >
> > Thanks!
> > Bolke
> >
> > My VOTE: +1 (binding)
>


Re: Cutting down on testing time

2017-02-27 Thread Jeremiah Lowin
(I am far from an expert in nose but) I tried running nose in parallel
simply by passing the --processes flag (
http://nose.readthedocs.io/en/latest/doc_tests/test_multiprocess/multiprocess.html
).

The SQLite envs ran about 2-3 minutes quicker than normal. All other envs
deadlocked and timed out. I suspect it's because Travis only provides open
source projects with two cores but I'm not sure.

On Mon, Feb 27, 2017 at 1:55 PM Dan Davydov 
wrote:

> This looks like a great effort to me at least in the short term (in the
> long term I think most of the integration tests should be run together if
> the infra allows this). Another thing we could start looking into is
> parallelizing tests (though this may require beefier machines from Travis).
>
> On Sat, Feb 25, 2017 at 8:58 AM, Bolke de Bruin  wrote:
>
> > Hi All,
> >
> > Jeremiah and I have been looking into optimising the time that is spend
> on
> > tests. The reason for this was that Travis’ runs are taking more and more
> > time and we are being throttled by travis. As part of that we enabled
> color
> > coding of test outcomes and timing of tests. The results kind of
> > …surprising.
> >
> > This is the top 20 of tests were we spend the most time. MySQL (remember
> > concurrent access enabled) - https://s3.amazonaws.com/
> > archive.travis-ci.org/jobs/205277617/log.txt:
> >
> > tests.BackfillJobTest.test_backfill_examples:  287.9209s
> > tests.BackfillJobTest.test_backfill_multi_dates:  53.5198s
> > tests.SchedulerJobTest.test_scheduler_start_date:  36.4935s
> > tests.CoreTest.test_scheduler_job:  35.5852s
> > tests.CliTests.test_backfill:  29.7484s
> > tests.SchedulerJobTest.test_scheduler_multiprocessing:  26.1573s
> > tests.DaskExecutorTest.test_backfill_integration:  24.5456s
> > tests.CoreTest.test_schedule_dag_no_end_date_up_to_today_only:  17.3278s
> > tests.SubDagOperatorTests.test_subdag_deadlock:  16.1957s
> > tests.SensorTimeoutTest.test_timeout:  15.1000s
> > tests.SchedulerJobTest.test_dagrun_deadlock_ignore_depends_on_past:
> > 13.8812s
> > tests.BackfillJobTest.test_cli_backfill_depends_on_past:  12.9539s
> > tests.SchedulerJobTest.test_dagrun_deadlock_ignore_
> > depends_on_past_advance_ex_date:  12.8779s
> > tests.SchedulerJobTest.test_dagrun_success:  12.8177s
> > tests.SchedulerJobTest.test_dagrun_root_fail:  10.3953s
> > tests.SchedulerJobTest.test_dag_with_system_exit:  10.1132s
> > tests.TransferTests.test_mysql_to_hive:  8.5939s
> > tests.SchedulerJobTest.test_retry_still_in_executor:  8.1739s
> > tests.SchedulerJobTest.test_dagrun_fail:  7.9855s
> > tests.ImpersonationTest.test_default_impersonation:  7.4993s
> >
> > Yes we spend a whopping 5 minutes on executing all examples. Another
> > interesting one is “tests.CoreTest.test_scheduler_job”. This test just
> > checks whether a certain directories are creating as part of logging.
> This
> > could have been covered by a real unit test just covering the
> functionality
> > of the function that creates the files - now it takes 35s.
> >
> > We discussed several strategies for reducing time apart from rewriting
> > some of the tests (that would be a herculean job!). What the most optimal
> > seems is:
> >
> > 1. Run the scheduler tests apart from all other tests.
> > 2. Run “operator” integration tests in their own unit.
> > 3. Run UI tests separate
> > 4. Run API tests separate
> >
> > This creates the following build matrix (warning ASCII art):
> >
> > ——
> > |   |  Scheduler |  Operators   |
> >  UI  |   API |
> > ——
> > | Python 2  | x  |. x   |
> >  x   |   x   |
> > ——
> > | Python 3  | x  |  x   |
> >  x   |   x   |
> > ——
> > | Kerberos  ||  |
> >  x   |   x   |
> > ——
> > | Ldap  ||  |
> >  x   |   |
> > ——
> > | Hive  ||  x   |
> >  x   |   x   |
> > ——
> > | SSH   ||  x   |
> >  |   |
> > ——
> > | Postgres  | x  |  x   |
> >  x   |   x   |
> > ——
> > | MySQL | x  |  x   |
> >  x   |   x   |
> > ——
> > | SQLite| x  |  x
> >  |   x   |   x   |
> > ———

Re: Cutting down on testing time - updated

2017-02-26 Thread Jeremiah Lowin
Thanks for writing that up, Bolke.

One more thought -- about 4 minutes (so anywhere from 20-33% of test time)
is spent building the environment and installing dependencies. I believe we
could publish a Docker image with the test environment ready to go and get
a majority of that time back. The image would have to be rebuilt with every
push to master, though.


On Sat, Feb 25, 2017 at 1:19 PM Bolke de Bruin  wrote:

> Hi All,
>
> (Welcome to new MacBook Pro that has a send “button” on the touch bar)
>
> Jeremiah and I have been looking into optimising the time that is spend on
> tests. The reason for this was that Travis’ runs are taking more and more
> time and we are being throttled by travis. As part of that we enabled color
> coding of test outcomes and timing of tests. The results kind of
> …surprising.
>
> This is the top 20 of tests were we spend the most time. MySQL (remember
> concurrent access enabled) -
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/205277617/log.txt: <
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/205277617/log.txt:>
>
> tests.BackfillJobTest.test_backfill_examples:  287.9209s
> tests.BackfillJobTest.test_backfill_multi_dates:  53.5198s
> tests.SchedulerJobTest.test_scheduler_start_date:  36.4935s
> tests.CoreTest.test_scheduler_job:  35.5852s
> tests.CliTests.test_backfill:  29.7484s
> tests.SchedulerJobTest.test_scheduler_multiprocessing:  26.1573s
> tests.DaskExecutorTest.test_backfill_integration:  24.5456s
> tests.CoreTest.test_schedule_dag_no_end_date_up_to_today_only:  17.3278s
> tests.SubDagOperatorTests.test_subdag_deadlock:  16.1957s
> tests.SensorTimeoutTest.test_timeout:  15.1000s
> tests.SchedulerJobTest.test_dagrun_deadlock_ignore_depends_on_past:
> 13.8812s
> tests.BackfillJobTest.test_cli_backfill_depends_on_past:  12.9539s
> tests.SchedulerJobTest.test_dagrun_deadlock_ignore_depends_on_past_advance_ex_date:
> 12.8779s
> tests.SchedulerJobTest.test_dagrun_success:  12.8177s
> tests.SchedulerJobTest.test_dagrun_root_fail:  10.3953s
> tests.SchedulerJobTest.test_dag_with_system_exit:  10.1132s
> tests.TransferTests.test_mysql_to_hive:  8.5939s
> tests.SchedulerJobTest.test_retry_still_in_executor:  8.1739s
> tests.SchedulerJobTest.test_dagrun_fail:  7.9855s
> tests.ImpersonationTest.test_default_impersonation:  7.4993s
>
> Yes we spend a whopping 5 minutes on executing all examples. Another
> interesting one is “tests.CoreTest.test_scheduler_job”. This test just
> checks whether a certain directories are creating as part of logging. This
> could have been covered by a real unit test just covering the functionality
> of the function that creates the files - now it takes 35s.
>
> We discussed several strategies for reducing time apart from rewriting
> some of the tests (that would be a herculean job!). What the most optimal
> seems is:
>
> 1. Run the scheduler tests apart from all other tests.
> 2. Run “operator” integration tests in their own unit.
> 3. Run UI tests separate
> 4. Run API tests separate
>
> This creates the following build matrix (warning ASCII art):
>
> ——
> |   |  Scheduler |  Operators   |
>  UI  |   API |
> ——
> | Python 2  | x  |. x   |
>  x   |   x   |
> ——
> | Python 3  | x  |  x   |
>  x   |   x   |
> ——
> | Kerberos  ||  |
>  x   |   x   |
> ——
> | Ldap  ||  |
>  x   |   |
> ——
> | Hive  ||  x   |
>  x   |   x   |
> ——
> | SSH   ||  x   |
>  |   |
> ——
> | Postgres  | x  |  x   |
>  x   |   x   |
> ——
> | MySQL | x  |  x   |
>  x   |   x   |
> ——
> | SQLite| x  |  x
>  |   x   |   x   |
> ——
>
>
> So from this build matrix one can deduct that Postgres, MySQL are generic
> services that will be present in every build. In addition all builds will
> use Python 2 and Python 3. And I propose using Python 3.4 and Python 3.5.
> The matrix can be expressed by environment variables. See .travis.yml for
> the current build matrix.
>
> Furthermore, I would like us to label our tests correctly,

Re: [RESULT] [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-25 Thread Jeremiah Lowin
;>> maximebeauche...@gmail.com> wrote:
> >>>
> >>>> Our database may have edge cases that could be associated with
running
> >>> any
> >>>> previous version that may or may not have been part of an official
> >>> release.
> >>>>
> >>>> Let's see if anyone else reports the issue. If no one does, one
> option is
> >>>> to release 1.8.0 as is with a comment in the release notes, and have
a
> >>>> future official minor apache release 1.8.1 that would fix these minor
> >>>> issues that are not deal breaker.
> >>>>
> >>>> @bolke, I'm curious, how long does it take you to go through one
> release
> >>>> cycle? Oh, and do you have a documented step by step process for
> >>> releasing?
> >>>> I'd like to add the Pypi part to this doc and add committers that are
> >>>> interested to have rights on the project on Pypi.
> >>>>
> >>>> Max
> >>>>
> >>>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin 
> >>> wrote:
> >>>>
> >>>>> So it is a database integrity issue? Afaik a start_date should
always
> >>> be
> >>>>> set for a DagRun (create_dagrun) does so  I didn't check the code
> >>> though.
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On 22 Feb 2017, at 22:19, Dan Davydov  >>> INVALID>
> >>>>> wrote:
> >>>>>>
> >>>>>> Should clarify this occurs when a dagrun does not have a start
date,
> >>>> not
> >>>>> a
> >>>>>> dag (which makes it even less likely to happen). I don't think this
> >>> is
> >>>> a
> >>>>>> blocker for releasing.
> >>>>>>
> >>>>>>> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov <
> >>> dan.davy...@airbnb.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> I rolled this out in our prod and the webservers failed to load
due
> >>> to
> >>>>>>> this commit:
> >>>>>>>
> >>>>>>> [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag
> >>>>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72
> >>>>>>>
> >>>>>>> This fixed it:
> >>>>>>> -  >>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"
> >>> title="Start
> >>>>> Date:
> >>>>>>> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}">
> >>>>>>> +  >>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true">
> >>>>>>>
> >>>>>>> This is caused by assuming that all DAGs have start dates set, so
a
> >>>>> broken
> >>>>>>> DAG will take down the whole UI. Not sure if we want to make this
a
> >>>>> blocker
> >>>>>>> for the release or not, I'm guessing for most deployments this
> would
> >>>>> occur
> >>>>>>> pretty rarely. I'll submit a PR to fix it soon.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 21, 2017 at 9:49 AM, Chris Riccomini <
> >>>> criccom...@apache.org
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Ack that the vote has already passed, but belated +1 (binding)
> >>>>>>>>
> >>>>>>>> On Tue, Feb 21, 2017 at 7:42 AM, Bolke de Bruin <
> bdbr...@gmail.com
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> IPMC Voting can be found here:
> >>>>>>>>>
> >>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
> >>>>>>>> 201702.mbox/%
> >>>>>>>>> 3c676bdc9f-1b55-4469-92a7-9ff309ad0...@gmail.com%3e <
> >>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
> >>>>>>>> 201702.mbox/%
> >>>>>>>>> 3c676bdc9f-1b55-4469-92a7-9ff309ad0...@gmail.com%3E>
> >>>>>>>>>
> >>>>>>>>> Kind regards,
> >>>>>>>>> Bolke
> >>>>>>>>>
> >>>>>>>>>> On 21 Feb 2017, at 08:20, Bolke de Bruin 
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> Apache Airflow (incubating) 1.8.0 (based on RC4) has been
> >>> accepted.
> >>>>>>>>>>
> >>>>>>>>>> 9 “+1” votes received:
> >>>>>>>>>>
> >>>>>>>>>> - Maxime Beauchemin (binding)
> >>>>>>>>>> - Arthur Wiedmer (binding)
> >>>>>>>>>> - Dan Davydov (binding)
> >>>>>>>>>> - Jeremiah Lowin (binding)
> >>>>>>>>>> - Siddharth Anand (binding)
> >>>>>>>>>> - Alex van Boxel (binding)
> >>>>>>>>>> - Bolke de Bruin (binding)
> >>>>>>>>>>
> >>>>>>>>>> - Jayesh Senjaliya (non-binding)
> >>>>>>>>>> - Yi (non-binding)
> >>>>>>>>>>
> >>>>>>>>>> Vote thread (start):
> >>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-
> >>>>>>>>> airflow-dev/201702.mbox/%3cD360D9BE-C358-42A1-9188-
> >>>>>>>>> 6c92c31a2...@gmail.com%3e <http://mail-archives.apache.
> >>>>>>>>> org/mod_mbox/incubator-airflow-dev/201702.mbox/%3C7EB7B6D6-
> >>>>>>>> 092E-48D2-AA0F-
> >>>>>>>>> 15f44376a...@gmail.com%3E>
> >>>>>>>>>>
> >>>>>>>>>> Next steps:
> >>>>>>>>>> 1) will start the voting process at the IPMC mailinglist. I do
> >>>> expect
> >>>>>>>>> some changes to be required mostly in documentation maybe a
> >>> license
> >>>>> here
> >>>>>>>>> and there. So, we might end up with changes to stable. As long
as
> >>>>> these
> >>>>>>>> are
> >>>>>>>>> not (significant) code changes I will not re-raise the vote.
> >>>>>>>>>> 2) Only after the positive voting on the IPMC and finalisation
I
> >>>> will
> >>>>>>>>> rebrand the RC to Release.
> >>>>>>>>>> 3) I will upload it to the incubator release page, then the tar
> >>>> ball
> >>>>>>>>> needs to propagate to the mirrors.
> >>>>>>>>>> 4) Update the website (can someone volunteer please?)
> >>>>>>>>>> 5) Finally, I will ask Maxime to upload it to pypi. It seems we
> >>> can
> >>>>>>>> keep
> >>>>>>>>> the apache branding as lib cloud is doing this as well (
> >>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package <
> >>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package>).
> >>>>>>>>>>
> >>>>>>>>>> Jippie!
> >>>>>>>>>>
> >>>>>>>>>> Bolke
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >
>
> --
  _/
_/ Alex Van Boxel


Re: Non-deterministic error in example_subdag_operator DAG

2017-02-21 Thread Jeremiah Lowin
Good thought -- but I believe it has existed for a while, popping up
infrequently. I never connected that those random errors had the same root
until I happened to be looking closely at a few cases this weekend. But
that's just a hunch, can't prove it. I'm sort of at a loss as to the cause.

On Tue, Feb 21, 2017 at 2:31 PM Chris Riccomini 
wrote:

> OK, just a thought. :)
>
> On Tue, Feb 21, 2017 at 11:29 AM, Bolke de Bruin 
> wrote:
>
> > Multiple tests use example bash operator, the mark success unit tests I
> > refactored to use a different dag.
> >
> > Nevertheless I had some issues that required me to refactor to use a
> > different test. However I think what Jeremiah is observing existed before
> > my update.
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 21 Feb 2017, at 20:09, Chris Riccomini 
> wrote:
> > >
> > > Wonder if this has to do with the recent changes to subdag mark success
> > for
> > > RC4?
> > >
> > >> On Sun, Feb 19, 2017 at 10:05 AM, Jeremiah Lowin 
> > wrote:
> > >>
> > >> The unit test that runs the example_subdag_operator DAG is failing
> > >> non-deterministically. I see it sporadically in the Travis results,
> for
> > >> different environments, but I can't figure out what's causing the
> > failure
> > >> (and why it doesn't happen every time!) Anyone have a thought?
> > >>
> > >> Here's a log from a failed run, :
> > >> https://travis-ci.org/jlowin/airflow/jobs/203149863#L8494. This is
> the
> > >> same
> > >> master branch that's currently passing in the apache repo.
> > >>
> > >> J
> > >>
> >
>


Non-deterministic error in example_subdag_operator DAG

2017-02-19 Thread Jeremiah Lowin
The unit test that runs the example_subdag_operator DAG is failing
non-deterministically. I see it sporadically in the Travis results, for
different environments, but I can't figure out what's causing the failure
(and why it doesn't happen every time!) Anyone have a thought?

Here's a log from a failed run, :
https://travis-ci.org/jlowin/airflow/jobs/203149863#L8494. This is the same
master branch that's currently passing in the apache repo.

J


Re: Xcom related security issue

2017-02-19 Thread Jeremiah Lowin
Rui,

Thanks for pointing this out, it's a valid concern.

I personally have no issue with swapping Pickle -> JSON, but there may be
many Airflow users relying on the current behavior and I don't want to
invalidate their DAGs with a PR.

On the other hand, I'm not sure of a way to "gently" deprecate the
PickleType. Perhaps step 1 is to check if an XCom can be JSON serialized
and if it can't, print a warning? Then step 2 is to enforce JSON
serialization at a future date.

Any suggestions of how to implement this?

J

On Sat, Feb 18, 2017 at 10:16 AM Rui Wang 
wrote:

> Hi,
>
> I created an JIRA issue: https://issues.apache.org/jira/browse/AIRFLOW-855
> .
>
>
> The JIRA task above gives pretty rich context. Briefly speaking, PickleType
> gives the possible that run code/command on remote machines. This type can
> serialize objects, which is a wide scope. I am wondering what kind of use
> cases you have for using Xcom and its PickleType. If the use cases show the
> possibility that replacing PickleType with JSON type, the probably this
> security issue can be solved by using JSON type instead,
>
>
> Thanks,
> Rui Wang
>


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc4

2017-02-17 Thread Jeremiah Lowin
+1 (binding) many thanks for all your work on this Bolke!

On Fri, Feb 17, 2017 at 7:10 PM Jayesh Senjaliya 
wrote:

+1 ( non-binding )works fine for me.



On Fri, Feb 17, 2017 at 3:37 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> +1 (binding)
>
> On Fri, Feb 17, 2017 at 11:33 AM, Dan Davydov <
> dan.davy...@airbnb.com.invalid> wrote:
>
> > +1 (binding). Mark success works great now, thanks to Bolke for fixing.
> >
> > On Fri, Feb 17, 2017 at 12:22 AM, Bolke de Bruin 
> > wrote:
> >
> > > Dear All,
> > >
> > > I have made the FOURTH RELEASE CANDIDATE of Airflow 1.8.0 available
at:
> > > https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > > https://dist.apache.org/repos/dist/dev/incubator/airflow/> , public
> keys
> > > are available at https://dist.apache.org/repos/
> > > dist/release/incubator/airflow/  > > /dist/release/incubator/airflow/> . It is tagged with a local version
> > > “apache.incubating” so it allows upgrading from earlier releases.
> > >
> > > One issues have been fixed since release candidate 3:
> > >
> > > * mark success was not working properly
> > >
> > > No known issues anymore.
> > >
> > > I would also like to raise a VOTE for releasing 1.8.0 based on release
> > > candidate 4, i.e. just renaming release candidate 4 to 1.8.0 release.
> > >
> > > Please respond to this email by:
> > >
> > > +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you
> > are
> > > not.
> > >
> > > Thanks!
> > > Bolke
> > >
> > > My VOTE: +1 (binding)
> >
>


Re: Celery or Dask?

2017-02-13 Thread Jeremiah Lowin
As far as I know I'm the only person using Dask with Airflow at the moment.
I've been using Dask for a variety of other (non-Airflow) tasks and have
found it to be a great tool. However, it's important to note that Celery is
a much more mature project with finer control over how tasks are executed.
In fact Dask's objectives are totally different (I think of it as
"pure-Python Spark") but it happens to expose similar functionality to
Celery through its Distributed subproject.

I added a DaskExecutor to Airflow in my last commit and am working on
improving the unit tests now. I've been running the DaskExecutor in a test
environment with good results, but between the fact that you have to run
Airflow's bleeding-edge master branch to get it and that I'm the only
person kicking its tires (at the moment), I would only recommend using it
if you like to live very dangerously indeed.

In the near future, I can see Dask being a recommended way to scale Airflow
beyond a single machine due to the ease of setting it up -- but not yet.

On Mon, Feb 13, 2017 at 11:04 AM Bolke de Bruin  wrote:

Dask just landed in master. So no Celery is the most used option to
scale-out.

Always interested in what you are running into, but please be prepared to
provide a lot of info on your setup.

- Boke

> On 13 Feb 2017, at 17:01, EKC (Erik Cederstrand) 
wrote:
>
> Hello all,
>
>
> I'm investigating why some of our DAGs are not being scheduled properly (
ran into https://issues.apache.org/jira/browse/AIRFLOW-342, among other
things). Coupled with comments on this list, I'm getting the impression
that Celery is a second-class citizen and core developers are mainly using
Dask. Is this correct?
>
>
> If Dask support is simply more mature and more likely to have issues
responded to, I'll consider migrating our installation.
>
>
> Thanks,
>
> Erik


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc3

2017-02-12 Thread Jeremiah Lowin
Interesting -- I also run on Kubernetes with a git-sync sidecar, but the
containers wait for the synced repo to apprar before starting since it
contains some dependencies -- I assume that's why I didn't experience the
same issue.

On Sun, Feb 12, 2017 at 6:29 AM Bolke de Bruin  wrote:

> Although the race condition doesn't explain why “num_runs = None” resolved
> the issue for you earlier, but it does give a clue now: the PR that
> introduced “num_runs = -1” was there to be able to work with empty dag
> dirs, maybe it wasn’t fully covered yet.
>
> Bolke
>
> > On 12 Feb 2017, at 12:26, Bolke de Bruin  wrote:
> >
> > Ok great! Thanks! That sounds like a race condition: module not
> available yet at time of reading. I would expect that it resolves itself
> after a while.
> >
> > After talking to some people at the Warsaw BigData conf I have some
> ideas around syncing dags, Spoiler: no dependency on git.
> >
> > - Bolke
> >
> >> On 12 Feb 2017, at 11:17, Alex Van Boxel  wrote:
> >>
> >> Running ok, in staging... @bolke I'm running patch-less. I've switched
> my
> >> Kubernetes from:
> >>
> >> - each container (webserver/scheduler/worker) had a git-sync'er (getting
> >> the dags from git)
> >>> this meant that the scheduler had 0 dags at startup, and should have
> >> picked them up later
> >>
> >> to
> >>
> >> - single NFS share that shares airflow_home over each container
> >>> the git sync'er is now a seperate container running before the other
> >> containers
> >>
> >> This resolved my mystery DAG crashes.
> >>
> >> I'll be updating production to a patchless RC3 today, you get my vote
> after
> >> that.
> >>
> >>
> >>
> >>
> >> On Sun, Feb 12, 2017 at 4:59 AM Boris Tyukin 
> wrote:
> >>
> >>> awesome! thanks Jeremiah
> >>>
> >>> On Sat, Feb 11, 2017 at 12:53 PM, Jeremiah Lowin 
> >>> wrote:
> >>>
> >>>> Boris, I submitted a PR to address your second point --
> >>>> https://github.com/apache/incubator-airflow/pull/2068. Thanks!
> >>>>
> >>>> On Sat, Feb 11, 2017 at 10:42 AM Boris Tyukin 
> >>>> wrote:
> >>>>
> >>>>> I am running LocalExecutor and not doing crazy things but use DAG
> >>>>> generation heavily - everything runs fine as before. As I mentioned
> in
> >>>>> other threads only had a few issues:
> >>>>>
> >>>>> 1) had to upgrade MySQL which was a PAIN. Cloudera CDH is running old
> >>>>> version of MySQL which was compatible with 1.7.1 but not compatible
> now
> >>>>> with 1.8 because of fractional seconds support PR.
> >>>>>
> >>>>> 2) when you install airflow, there are two new example DAGs
> >>>>> (last_task_only) which are going back very far in the past and
> >>> scheduled
> >>>> to
> >>>>> run every hour - a bunch of dags triggered on the first start of
> >>>> scheduler
> >>>>> and hosed my CPU
> >>>>>
> >>>>> Everything else was fine and I LOVE lots of small UI changes, which
> >>>> reduced
> >>>>> a lot my use of cli.
> >>>>>
> >>>>> Thanks again for the amazing work and an awesome project!
> >>>>>
> >>>>>
> >>>>> On Sat, Feb 11, 2017 at 9:17 AM, Jeremiah Lowin 
> >>>> wrote:
> >>>>>
> >>>>>> I was able to deploy successfully. +1 (binding)
> >>>>>>
> >>>>>> On Fri, Feb 10, 2017 at 7:37 PM Maxime Beauchemin <
> >>>>>> maximebeauche...@gmail.com> wrote:
> >>>>>>
> >>>>>>> +1 (binding)
> >>>>>>>
> >>>>>>> On Fri, Feb 10, 2017 at 3:44 PM, Arthur Wiedmer <
> >>>>>> arthur.wied...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> +1 (binding)
> >>>>>>>>
> >>>>>>>> On Feb 10, 2017 3:13 PM, "Dan Davydov"  >>>>>> invalid>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Our staging looks good, all the DAG

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc3

2017-02-11 Thread Jeremiah Lowin
Boris, I submitted a PR to address your second point --
https://github.com/apache/incubator-airflow/pull/2068. Thanks!

On Sat, Feb 11, 2017 at 10:42 AM Boris Tyukin  wrote:

> I am running LocalExecutor and not doing crazy things but use DAG
> generation heavily - everything runs fine as before. As I mentioned in
> other threads only had a few issues:
>
> 1) had to upgrade MySQL which was a PAIN. Cloudera CDH is running old
> version of MySQL which was compatible with 1.7.1 but not compatible now
> with 1.8 because of fractional seconds support PR.
>
> 2) when you install airflow, there are two new example DAGs
> (last_task_only) which are going back very far in the past and scheduled to
> run every hour - a bunch of dags triggered on the first start of scheduler
> and hosed my CPU
>
> Everything else was fine and I LOVE lots of small UI changes, which reduced
> a lot my use of cli.
>
> Thanks again for the amazing work and an awesome project!
>
>
> On Sat, Feb 11, 2017 at 9:17 AM, Jeremiah Lowin  wrote:
>
> > I was able to deploy successfully. +1 (binding)
> >
> > On Fri, Feb 10, 2017 at 7:37 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > +1 (binding)
> > >
> > > On Fri, Feb 10, 2017 at 3:44 PM, Arthur Wiedmer <
> > arthur.wied...@gmail.com>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Feb 10, 2017 3:13 PM, "Dan Davydov"  > invalid>
> > > > wrote:
> > > >
> > > > > Our staging looks good, all the DAGs there pass.
> > > > > +1 (binding)
> > > > >
> > > > > On Fri, Feb 10, 2017 at 10:21 AM, Chris Riccomini <
> > > criccom...@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Running in all environments. Will vote after the weekend to make
> > sure
> > > > > > things are working properly, but so far so good.
> > > > > >
> > > > > > On Fri, Feb 10, 2017 at 6:05 AM, Bolke de Bruin <
> bdbr...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Dear All,
> > > > > > >
> > > > > > > Let’s try again!
> > > > > > >
> > > > > > > I have made the THIRD RELEASE CANDIDATE of Airflow 1.8.0
> > available
> > > > at:
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > > > > > > https://dist.apache.org/repos/dist/dev/incubator/airflow/> ,
> > > public
> > > > > keys
> > > > > > > are available at https://dist.apache.org/repos/
> > > > dist/release/incubator/
> > > > > > > airflow/ <
> https://dist.apache.org/repos/dist/release/incubator/
> > > > > airflow/>
> > > > > > > . It is tagged with a local version “apache.incubating” so it
> > > allows
> > > > > > > upgrading from earlier releases.
> > > > > > >
> > > > > > > Two issues have been fixed since release candidate 2:
> > > > > > >
> > > > > > > * trigger_dag could create dags with fractional seconds, not
> > > > supported
> > > > > by
> > > > > > > logging and UI at the moment
> > > > > > > * local api client trigger_dag had hardcoded execution of None
> > > > > > >
> > > > > > > Known issue:
> > > > > > > * Airflow on kubernetes and num_runs -1 (default) can expose
> > import
> > > > > > issues.
> > > > > > >
> > > > > > > I have extensively discussed this with Alex (reporter) and we
> > > > consider
> > > > > > > this a known issue with a workaround available as we are unable
> > to
> > > > > > > replicate this in a different environment. UPDATING.md has been
> > > > updated
> > > > > > > with the work around.
> > > > > > >
> > > > > > > As these issues are confined to a very specific area and full
> > unit
> > > > > tests
> > > > > > > were added I would also like to raise a VOTE for releasing
> 1.8.0
> > > > based
> > > > > on
> > > > > > > release candidate 3, i.e. just renaming release candidate 3 to
> > > 1.8.0
> > > > > > > release.
> > > > > > >
> > > > > > > Please respond to this email by:
> > > > > > >
> > > > > > > +1,0,-1 with *binding* if you are a PMC member or *non-binding*
> > if
> > > > you
> > > > > > are
> > > > > > > not.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Bolke
> > > > > > >
> > > > > > > My VOTE: +1 (binding)
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc3

2017-02-11 Thread Jeremiah Lowin
I was able to deploy successfully. +1 (binding)

On Fri, Feb 10, 2017 at 7:37 PM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> +1 (binding)
>
> On Fri, Feb 10, 2017 at 3:44 PM, Arthur Wiedmer 
> wrote:
>
> > +1 (binding)
> >
> > On Feb 10, 2017 3:13 PM, "Dan Davydov" 
> > wrote:
> >
> > > Our staging looks good, all the DAGs there pass.
> > > +1 (binding)
> > >
> > > On Fri, Feb 10, 2017 at 10:21 AM, Chris Riccomini <
> criccom...@apache.org
> > >
> > > wrote:
> > >
> > > > Running in all environments. Will vote after the weekend to make sure
> > > > things are working properly, but so far so good.
> > > >
> > > > On Fri, Feb 10, 2017 at 6:05 AM, Bolke de Bruin 
> > > wrote:
> > > >
> > > > > Dear All,
> > > > >
> > > > > Let’s try again!
> > > > >
> > > > > I have made the THIRD RELEASE CANDIDATE of Airflow 1.8.0 available
> > at:
> > > > > https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> > > > > https://dist.apache.org/repos/dist/dev/incubator/airflow/> ,
> public
> > > keys
> > > > > are available at https://dist.apache.org/repos/
> > dist/release/incubator/
> > > > > airflow/  > > airflow/>
> > > > > . It is tagged with a local version “apache.incubating” so it
> allows
> > > > > upgrading from earlier releases.
> > > > >
> > > > > Two issues have been fixed since release candidate 2:
> > > > >
> > > > > * trigger_dag could create dags with fractional seconds, not
> > supported
> > > by
> > > > > logging and UI at the moment
> > > > > * local api client trigger_dag had hardcoded execution of None
> > > > >
> > > > > Known issue:
> > > > > * Airflow on kubernetes and num_runs -1 (default) can expose import
> > > > issues.
> > > > >
> > > > > I have extensively discussed this with Alex (reporter) and we
> > consider
> > > > > this a known issue with a workaround available as we are unable to
> > > > > replicate this in a different environment. UPDATING.md has been
> > updated
> > > > > with the work around.
> > > > >
> > > > > As these issues are confined to a very specific area and full unit
> > > tests
> > > > > were added I would also like to raise a VOTE for releasing 1.8.0
> > based
> > > on
> > > > > release candidate 3, i.e. just renaming release candidate 3 to
> 1.8.0
> > > > > release.
> > > > >
> > > > > Please respond to this email by:
> > > > >
> > > > > +1,0,-1 with *binding* if you are a PMC member or *non-binding* if
> > you
> > > > are
> > > > > not.
> > > > >
> > > > > Thanks!
> > > > > Bolke
> > > > >
> > > > > My VOTE: +1 (binding)
> > > >
> > >
> >
>


Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc2

2017-02-09 Thread Jeremiah Lowin
Bolke, I think you might be looking for
https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L376

I am traveling and will not be able to deploy rc2 for testing before next
week so I don't feel entirely comfortable offering a vote now, but (for
what it's worth) I would have voted +1 based on my experience with rc1.

On Thu, Feb 9, 2017 at 3:42 PM Bolke de Bruin  wrote:

> You can trigger them (pun intended :) ) by issuing
>
> airflow trigger_dag -e 20160101T01:01:00 test_dag
>
> I have no clue why the first issue was never caught, but writing tests now
> for it.
>
> If someone knows where the log directory for tasks is determined (the
> dynamic part), I would be happy, cause I have trouble finding it…
>
>
> > On 9 Feb 2017, at 22:27, Chris Riccomini  wrote:
> >
> > :( Thanks for the update. I'm running RC2 right now, and not hitting the
> > above issues (didn't check fractional execution dates).
> >
> > On Thu, Feb 9, 2017 at 1:25 PM, Bolke de Bruin  > wrote:
> >
> >> Replying to myself, changing to -1 (binding).
> >>
> >> Found three issues:
> >> * in cli trigger_dag and the local client. Execution date will never be
> >> set.
> >> * Issue with log directories (when will we fix logging?): execution_date
> >> can have fractions
> >> * maybe related to the above, but a triggered_dag will not get its tasks
> >> scheduled.
> >>
> >> Sorry it seems we will need an RC3, I consider the above to be blockers.
> >>
> >> - Bolke
> >>
> >>> On 9 Feb 2017, at 16:26, Bolke de Bruin  wrote:
> >>>
> >>> Dear All,
> >>>
> >>> I have made the SECOND RELEASE CANDIDATE of Airflow 1.8.0 available at:
> >> https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> >> https://dist.apache.org/repos/dist/dev/incubator/airflow/ <
> https://dist.apache.org/repos/dist/dev/incubator/airflow/>> , public keys
> >> are available at https://dist.apache.org/repos/dist/release/incubator/
> 
> >> airflow/  >
> >> . It is tagged with a local version “apache.incubating” so it allows
> >> upgrading from earlier releases.
> >>>
> >>> Two issues have been fixed:
> >>>
> >>> * Cgroups issue
> >>> * kwargs issue in PrestoOperator
> >>>
> >>> As these issues are confined to a very specific area I would also like
> >> to raise a VOTE for releasing 1.8.0 based on release candidate 2, i.e.
> just
> >> renaming release candidate 2 to 1.8.0 release. This is taking into
> account
> >> that Alex has an open issue, but it cannot be confirmed by others.
> >>>
> >>> Please respond to this email by:
> >>>
> >>> +1,0,-1 with *binding* if you are a PMC member or *non-binding* if you
> >> are not.
> >>>
> >>> Thanks!
> >>> Bolke
> >>>
> >>> My VOTE: +1 (binding)
>
>


Re: Proposal for new task state "WAITING_ON_CALLBACK"

2017-02-08 Thread Jeremiah Lowin
I meant the API -- will check the wiki now. Thanks!

On Wed, Feb 8, 2017 at 8:33 AM Bolke de Bruin  wrote:

> On this proposal? No, not yet. Just popped in my mind yesterday. API there
> is a bit on the wiki.
>
> > On 8 Feb 2017, at 14:31, Jeremiah Lowin  wrote:
> >
> > Makes a lot of sense. At the NY meetup there was considerable interest in
> > using the API (and quite a few hacks around exposing the CLI!) -- is
> there
> > more complete documentation anywhere?
> >
> > Thanks Bolke
> >
> > On Wed, Feb 8, 2017 at 1:36 AM Bolke de Bruin  wrote:
> >
> >> Hi All,
> >>
> >> Now that we have an API in place. I would like to propose a new state
> for
> >> tasks named “WAITING_ON_CALLBACK”. Currently, we have tasks that have a
> >> kind of polling mechanism (ie. Sensors) that wait for an action to
> happen
> >> and check if that action happened by regularly polling a particular
> >> backend. This will always use a slot from one of the workers and could
> >> starve an airflow cluster for resources. What if a callback to Airflow
> >> could happen that task to change its status by calling a callback
> mechanism
> >> without taking up a worker slot. A timeout could (should) be associated
> >> with the required callback so that the task can fail if required. So a
> bit
> >> more visual:
> >>
> >>
> >> Task X from DAG Z  does some work and sets “WAITING_ON_CALLBACK” -> API
> >> post to /dags/Z/dag_runs/20170101T00:00:00/tasks/X with payload “set
> status
> >> to SUCCESS”
> >>
> >> DAG Z happily continues.
> >>
> >> Or
> >>
> >> Task X from DAG Z sets “WAITING_ON_CALLBACK” with timeout of 300s ->
> time
> >> passes -> scheduler sets task to FAILED.
> >>
> >>
> >> Any thoughts?
> >>
> >> - Bolke
>
>


Re: Unit Test Failures

2017-02-08 Thread Jeremiah Lowin
Now that's service!

I've merged the PR. Travis logs should be considerably smaller now, though
task logs will be printed in full even when the test is successful (we can
revisit whether to restrict those in Travis at a future time if we need to).

Authors of recent PRs: please rebase and push -f your PRs to force Travis
to rerun.

Thanks

On Wed, Feb 8, 2017 at 8:29 AM Bolke de Bruin  wrote:

> Done.
>
> > On 8 Feb 2017, at 14:27, Jeremiah Lowin  wrote:
> >
> > I agree something is affecting many of the Py3 PR builds but we can't see
> > the errors because the logs are too verbose. I have a PR to limit log
> > verbosity on Travis (except for failed tests) but it needs a +1, once
> > that's in we should regain transparency in to the errors:
> > https://github.com/apache/incubator-airflow/pull/2049
> >
> > On Wed, Feb 8, 2017 at 7:37 AM Bolke de Bruin  wrote:
> >
> > I have noticed this as well and I do actually think it is something that
> > has “ran” away. The ones that fail go on for much longer and are tied to
> > python 3.
> >
> > Can someone have a look at its please? I am a bit preoccupied with
> getting
> > the release out.
> >
> > Cheers
> > Bolke
> >
> >> On 8 Feb 2017, at 13:18, Miller, Robin <
> > robin.mil...@affiliate.oliverwyman.com> wrote:
> >>
> >> Hi All,
> >>
> >>
> >> On a couple of my open PRs, I've been having trouble getting the units
> > tests to run successfully, not because there's a test failure, but
> because
> > Travis is reporting that the unit test log has become too large:
> >>
> >>
> >> "
> >>
> >> The log length has exceeded the limit of 4 MB (this usually means that
> > the test suite is raising the same exception over and over).
> >>
> >> The job has been terminated
> >>
> >> "
> >>
> >> I've taken a look through the log and there's nothing obviously stuck in
> > a loop; there are tests starting and passing pretty much to the bottom of
> > the file.
> >>
> >>
> >> Has something changed that means that the tests generate much more
> > output? Or have we just reached a tipping point where we have too many
> > tests for TravisCI to handle? For whatever reason this seems to be mainly
> > happening for Python3 with MySQL or Postgres as the backend (see here for
> > examples:
> https://travis-ci.org/apache/incubator-airflow/builds/199160045)
> >>
> >>
> >> I don't believe this is caused by the tests I've added, as I don't
> > believe it's even reached them in this run:
> > https://api.travis-ci.org/jobs/199160051/log.txt?deansi=true
> >>
> >>
> >> I appreciate any input on this,
> >>
> >> Robin Miller
> >> OLIVER WYMAN
> >> robin.mil...@affiliate.oliverwyman.com > robin.mil...@affiliate.oliverwyman.com>
> >> www.oliverwyman.com<http://www.oliverwyman.com/>
> >>
> >>
> >> 
> >> This e-mail and any attachments may be confidential or legally
> > privileged. If you received this message in error or are not the intended
> > recipient, you should destroy the e-mail message and any attachments or
> > copies, and you are prohibited from retaining, distributing, disclosing
> or
> > using any information contained herein. Please inform us of the erroneous
> > delivery by return e-mail. Thank you for your cooperation.
>
>


Re: Proposal for new task state "WAITING_ON_CALLBACK"

2017-02-08 Thread Jeremiah Lowin
Makes a lot of sense. At the NY meetup there was considerable interest in
using the API (and quite a few hacks around exposing the CLI!) -- is there
more complete documentation anywhere?

Thanks Bolke

On Wed, Feb 8, 2017 at 1:36 AM Bolke de Bruin  wrote:

> Hi All,
>
> Now that we have an API in place. I would like to propose a new state for
> tasks named “WAITING_ON_CALLBACK”. Currently, we have tasks that have a
> kind of polling mechanism (ie. Sensors) that wait for an action to happen
> and check if that action happened by regularly polling a particular
> backend. This will always use a slot from one of the workers and could
> starve an airflow cluster for resources. What if a callback to Airflow
> could happen that task to change its status by calling a callback mechanism
> without taking up a worker slot. A timeout could (should) be associated
> with the required callback so that the task can fail if required. So a bit
> more visual:
>
>
> Task X from DAG Z  does some work and sets “WAITING_ON_CALLBACK” -> API
> post to /dags/Z/dag_runs/20170101T00:00:00/tasks/X with payload “set status
> to SUCCESS”
>
> DAG Z happily continues.
>
> Or
>
> Task X from DAG Z sets “WAITING_ON_CALLBACK” with timeout of 300s -> time
> passes -> scheduler sets task to FAILED.
>
>
> Any thoughts?
>
> - Bolke


Re: Unit Test Failures

2017-02-08 Thread Jeremiah Lowin
I agree something is affecting many of the Py3 PR builds but we can't see
the errors because the logs are too verbose. I have a PR to limit log
verbosity on Travis (except for failed tests) but it needs a +1, once
that's in we should regain transparency in to the errors:
https://github.com/apache/incubator-airflow/pull/2049

On Wed, Feb 8, 2017 at 7:37 AM Bolke de Bruin  wrote:

I have noticed this as well and I do actually think it is something that
has “ran” away. The ones that fail go on for much longer and are tied to
python 3.

Can someone have a look at its please? I am a bit preoccupied with getting
the release out.

Cheers
Bolke

> On 8 Feb 2017, at 13:18, Miller, Robin <
robin.mil...@affiliate.oliverwyman.com> wrote:
>
> Hi All,
>
>
> On a couple of my open PRs, I've been having trouble getting the units
tests to run successfully, not because there's a test failure, but because
Travis is reporting that the unit test log has become too large:
>
>
> "
>
> The log length has exceeded the limit of 4 MB (this usually means that
the test suite is raising the same exception over and over).
>
> The job has been terminated
>
> "
>
> I've taken a look through the log and there's nothing obviously stuck in
a loop; there are tests starting and passing pretty much to the bottom of
the file.
>
>
> Has something changed that means that the tests generate much more
output? Or have we just reached a tipping point where we have too many
tests for TravisCI to handle? For whatever reason this seems to be mainly
happening for Python3 with MySQL or Postgres as the backend (see here for
examples: https://travis-ci.org/apache/incubator-airflow/builds/199160045)
>
>
> I don't believe this is caused by the tests I've added, as I don't
believe it's even reached them in this run:
https://api.travis-ci.org/jobs/199160051/log.txt?deansi=true
>
>
> I appreciate any input on this,
>
> Robin Miller
> OLIVER WYMAN
> robin.mil...@affiliate.oliverwyman.com
> www.oliverwyman.com
>
>
> 
> This e-mail and any attachments may be confidential or legally
privileged. If you received this message in error or are not the intended
recipient, you should destroy the e-mail message and any attachments or
copies, and you are prohibited from retaining, distributing, disclosing or
using any information contained herein. Please inform us of the erroneous
delivery by return e-mail. Thank you for your cooperation.


Re: Airflow 1.8.0 Release Candidate 1

2017-02-07 Thread Jeremiah Lowin
Sid,

The behavior in your first point is odd -- it should print a deprecation
warning about 2.0 but still work. I just tested it on master and it works
(see output below), so unless something unusual happened in the 1.8 build
it should work there as well:

In [1]: from airflow.operators import PigOperator
[2017-02-07 15:24:17,551] {__init__.py:57} INFO - Using executor
SequentialExecutor
/Users/jlowin/git/airflow/airflow/utils/helpers.py:320: DeprecationWarning:
Importing PigOperator directly from  has been
deprecated. Please import from '.[operator_module]'
instead. Support for direct imports will be dropped entirely in Airflow 2.0.
  DeprecationWarning)

In [2]: PigOperator
Out[2]: pig_operator.PigOperator



On Mon, Feb 6, 2017 at 9:50 PM siddharth anand  wrote:

> I did get 1.8.0 installed and running at Agari.
>
> I did run into 2 problems.
> 1. Most of our DAGs broke due the way Operators are now imported.
>
> https://github.com/apache/incubator-airflow/blob/master/UPDATING.md#deprecated-features
>
> According to the documentation, these deprecations would only cause an
> issue in 2.0. However, I needed to fix them now.
>
> So, I needed to change "from airflow.operators import PythonOperator" to
> from "from airflow.operators.python_operator import PythonOperator". Am I
> missing something?
>
> 2. I ran into a migration problem that seems to have cleared itself up. I
> did notice that some dags do not have data in their "DAG Runs" column on
> the overview page computed. I am looking into that issue presently.
>
> https://www.dropbox.com/s/cn058mtu3vcv8sq/Screenshot%202017-02-06%2018.45.07.png?dl=0
>
> -s
>
> On Mon, Feb 6, 2017 at 4:30 PM, Dan Davydov  .invalid>
> wrote:
>
> > Bolke, attached is the patch for the cgroups fix. Let me know which
> > branches you would like me to merge it to. If anyone has complaints about
> > the patch let me know (but it does not touch the core of airflow, only
> the
> > new cgroups task runner).
> >
> > On Mon, Feb 6, 2017 at 4:24 PM, siddharth anand 
> wrote:
> >
> >> Actually, I see the error is further down..
> >>
> >>   File
> >> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py",
> >> line
> >> 469, in do_execute
> >>
> >> cursor.execute(statement, parameters)
> >>
> >> sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) null value in
> >> column "dag_id" violates not-null constraint
> >>
> >> DETAIL:  Failing row contains (null, running, 1, f).
> >>
> >>  [SQL: 'INSERT INTO dag_stats (state, count, dirty) VALUES (%(state)s,
> >> %(count)s, %(dirty)s)'] [parameters: {'count': 1L, 'state': u'running',
> >> 'dirty': False}]
> >>
> >> It looks like an autoincrement is missing for this table.
> >>
> >>
> >> I'm running `SQLAlchemy==1.1.4` - I see our setup.py specifies any
> version
> >> greater than 0.9.8
> >>
> >> -s
> >>
> >>
> >>
> >> On Mon, Feb 6, 2017 at 4:11 PM, siddharth anand 
> >> wrote:
> >>
> >> > I tried upgrading to 1.8.0rc1 from 1.7.1.3 via pip install
> >> > https://dist.apache.org/repos/dist/dev/incubator/airflow/
> >> > airflow-1.8.0rc1+apache.incubating.tar.gz and then running airflow
> >> > upgradedb didn't quite work. First, I thought it completed
> successfully,
> >> > then saw errors some tables were indeed missing. I ran it again and
> >> > encountered the following exception :
> >> >
> >> > DB: postgresql://app_coust...@db-cousteau.ep.stage.agari.com:543
> >> 2/airflow
> >> >
> >> > [2017-02-07 00:03:20,309] {db.py:284} INFO - Creating tables
> >> >
> >> > INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
> >> >
> >> > INFO  [alembic.runtime.migration] Will assume transactional DDL.
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 2e82aab8ef20 ->
> >> > 211e584da130, add TI state index
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 211e584da130 ->
> >> > 64de9cddf6c9, add task fails journal table
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 64de9cddf6c9 ->
> >> > f2ca10b85618, add dag_stats table
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade f2ca10b85618 ->
> >> > 4addfa1236f1, Add fractional seconds to mysql tables
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 4addfa1236f1 ->
> >> > 8504051e801b, xcom dag task indices
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 8504051e801b ->
> >> > 5e7d17757c7a, add pid field to TaskInstance
> >> >
> >> > INFO  [alembic.runtime.migration] Running upgrade 5e7d17757c7a ->
> >> > 127d2bf2dfa7, Add dag_id/state index on dag_run table
> >> >
> >> > /usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/crud.py:692:
> >> > SAWarning: Column 'dag_stats.dag_id' is marked as a member of the
> >> primary
> >> > key for table 'dag_stats', but has no Python-side or server-side
> default
> >> > generator indicated, nor does it indicate 'autoincrement=True' or
> >> > 'nullable=True', and no explicit value is passed.  Primary key columns
> >> > typically may not store NULL.

Contrib & Dataflow

2017-02-04 Thread Jeremiah Lowin
Max made some great points on my dataflow PR and I wanted to continue the
conversation here to make sure the conversation was visible to all.

While I think my dataflow implementation contains the basic requirements
for any more complicated extension (but that conversation can wait!), I had
to implement it by adding some very specific "dataflow-only" code to core
Operator logic. In retrospect, that makes me pause (as, I believe, it did
for Max).

After thinking for a few days, what I really want to do is propose a very
small change to core Airflow: change BaseOperator.post_execute(context) to
BaseOperator.post_execute(result, context). I think the pre_execute and
post_execute hooks have generally been an afterthought, but with that
change (which, I think, is reasonable in and of itself) I could implement
entirely through those hooks.

So that brings me to my next point: if the hook is changed, I could happily
drop a reworked dataflow implementation into contrib, rather than core.
That would alleviate some of the pressure for Airflow to officially decide
whether it's the right implementation or not (it is! :) ). I feel like that
would be the optimal situation at the moment.

And that brings me to my next point: the future of "contrib" and the
Airflow community.
Having contrib in the core Airflow repo has some advantages:
  - standardized access
  - centralized repository for PRs
  - at least a style review (if not unit tests) from the committers
But some big disadvantages as well:
  - Very complicated dependency management [presumably, most contrib
operators need to add an extras_require entry for their specific
dependencies]
  - No sense of ownership or even an easy way to raise issues (due to
friction of opening JIRA tickets vs github issues)

One thought is to move the contrib directory to its own repo which would
keep the advantages but remove the disadvantages from core Airflow. Another
is to encourage individual airflow repos (Airflow-Docker, Airflow-Dataflow,
Airflow-YourExtensionHere) which could be installed a la carte. That would
leave maintenance up to the original author, but could lead to some
fracturing in the community as discovery becomes difficult.


Re: NYC Airflow Meetup

2017-02-03 Thread Jeremiah Lowin
Thanks for hosting, Joe & Blue Apron! Great to meet so many folks and
looking forward to joining you again.

Jeremiah

On Fri, Feb 3, 2017 at 2:02 PM Joseph Napolitano
 wrote:

> Hi all,
>
> I want to thank everyone for attending NYC's first Airflow meetup at Blue
> Apron.  It was a huge success and we're glad to have met everyone.
>
> As suggested, we decided to create an official NYC Meetup page, sponsored
> by Blue Apron.  We'll add Sid and Max as Organizers.  Let us know if you
> want to help organize.
>
> https://www.meetup.com/NYC-Apache-Airflow-incubating-Meetup/
>
> I planned on taking video of the presentations, but it completely slipped
> my mind!  I'll upload my slides to Slideshare and provide a small writeup
> to complement them.
>
> We're committed to Airflow at Blue Apron and we love the project.  Now that
> our infrastructure is taking shape, we'll have time to contribute back to
> the project.  We have top-down support at Blue Apron to dedicate company
> time for it.
>
> Feel free to connect anytime!
> https://www.linkedin.com/in/joenap
>
> Thanks again,
> *Joe Napolitano *| Sr. Data Engineer
> www.blueapron.com | 5 Crosby Street, New York, NY 10013
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-03 Thread Jeremiah Lowin
For what it's worth -- everything running smoothly after 24+ hours in a
production(ish) environment.

On Thu, Feb 2, 2017 at 11:25 PM Jayesh Senjaliya 
wrote:

> Thank You Bolke for all the efforts you are putting in !!
>
> I have deployed this RC now.
>
> On Thu, Feb 2, 2017 at 3:02 PM, Jeremiah Lowin  wrote:
>
> > Fantastic work on this Bolke, thank you!
> >
> > We've deployed the RC and will report if there are any issues...
> >
> > On Thu, Feb 2, 2017 at 4:32 PM Bolke de Bruin  wrote:
> >
> > > Now I am blushing :-)
> > >
> > > Sent from my iPhone
> > >
> > > > On 2 Feb 2017, at 22:05, Boris Tyukin  wrote:
> > > >
> > > > LOL awesome!
> > > >
> > > > On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> > > > maximebeauche...@gmail.com> wrote:
> > > >
> > > >> The Apache mailing doesn't support images so here's a link:
> > > >>
> > > >> http://i.imgur.com/DUkpjZu.png
> > > >> ​
> > > >>
> > > >> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin <
> bo...@boristyukin.com>
> > > >> wrote:
> > > >>
> > > >>> Bolke, you are our hero! I am sure you put a lot of your time to
> make
> > > it
> > > >>> happen
> > > >>>
> > > >>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
> > > >> wrote:
> > > >>>
> > > >>>> Hi All,
> > > >>>>
> > > >>>> I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0
> available
> > > >> at:
> > > >>>> https://dist.apache.org/repos/dist/dev/incubator/airflow/ ,
> public
> > > >> keys
> > > >>>> are available at
> > > https://dist.apache.org/repos/dist/release/incubator/
> > > >>>> airflow/ . It is tagged with a local version “apache.incubating”
> so
> > it
> > > >>>> allows upgrading from earlier releases. This should be considered
> of
> > > >>>> release quality, but not yet officially vetted as a release yet.
> > > >>>>
> > > >>>> Issues fixed:
> > > >>>> * Use static nvd3 and d3
> > > >>>> * Python 3 incompatibilities
> > > >>>> * CLI API trigger dag issue
> > > >>>>
> > > >>>> As the difference between beta 5 and the release candidate is
> > > >> relatively
> > > >>>> small I hope to start the VOTE for releasing 1.8.0 quite soon (2
> > > >> days?),
> > > >>> if
> > > >>>> the vote passes also a vote needs to happen at the IPMC
> mailinglist.
> > > As
> > > >>>> this is our first Apache release I expect some comments and
> required
> > > >>>> changes and probably a RC 2.
> > > >>>>
> > > >>>> Furthermore, we now have a “v1-8-stable” branch. This has version
> > > >>>> “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> > > >> “v1-8-test”
> > > >>>> branch now has version “1.8.1alpha0” as version and “master” has
> > > >> version
> > > >>>> “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means
> that,
> > > >> per
> > > >>>> release guidelines, patches accompanied with an ASSIGNED Jira and
> a
> > > >>>> sign-off from a committer. Only then the release manager applies
> the
> > > >>> patch
> > > >>>> to stable (In this case that would be me). The release manager
> then
> > > >>> closes
> > > >>>> the bug when the patches have landed in the appropriate branches.
> > For
> > > >>> more
> > > >>>> information please see: https://cwiki.apache.org/
> > > >>>> confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > >>>> Supported+Release+Lifetime <https://cwiki.apache.org/
> > > >>>> confluence/display/AIRFLOW/Airflow+Release+Planning+and+
> > > >>>> Supported+Release+Lifetime> .
> > > >>>>
> > > >>>> Any questions or suggestions don’t hesitate to ask!
> > > >>>>
> > > >>>> Cheers
> > > >>>> Bolke
> > > >>>
> > > >>
> > >
> >
>


Re: Airflow 1.8.0 Release Candidate 1

2017-02-02 Thread Jeremiah Lowin
Fantastic work on this Bolke, thank you!

We've deployed the RC and will report if there are any issues...

On Thu, Feb 2, 2017 at 4:32 PM Bolke de Bruin  wrote:

> Now I am blushing :-)
>
> Sent from my iPhone
>
> > On 2 Feb 2017, at 22:05, Boris Tyukin  wrote:
> >
> > LOL awesome!
> >
> > On Thu, Feb 2, 2017 at 4:00 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> The Apache mailing doesn't support images so here's a link:
> >>
> >> http://i.imgur.com/DUkpjZu.png
> >> ​
> >>
> >> On Thu, Feb 2, 2017 at 12:52 PM, Boris Tyukin 
> >> wrote:
> >>
> >>> Bolke, you are our hero! I am sure you put a lot of your time to make
> it
> >>> happen
> >>>
> >>> On Thu, Feb 2, 2017 at 2:50 PM, Bolke de Bruin 
> >> wrote:
> >>>
>  Hi All,
> 
>  I have made the (first) RELEASE CANDIDATE of Airflow 1.8.0 available
> >> at:
>  https://dist.apache.org/repos/dist/dev/incubator/airflow/ , public
> >> keys
>  are available at
> https://dist.apache.org/repos/dist/release/incubator/
>  airflow/ . It is tagged with a local version “apache.incubating” so it
>  allows upgrading from earlier releases. This should be considered of
>  release quality, but not yet officially vetted as a release yet.
> 
>  Issues fixed:
>  * Use static nvd3 and d3
>  * Python 3 incompatibilities
>  * CLI API trigger dag issue
> 
>  As the difference between beta 5 and the release candidate is
> >> relatively
>  small I hope to start the VOTE for releasing 1.8.0 quite soon (2
> >> days?),
> >>> if
>  the vote passes also a vote needs to happen at the IPMC mailinglist.
> As
>  this is our first Apache release I expect some comments and required
>  changes and probably a RC 2.
> 
>  Furthermore, we now have a “v1-8-stable” branch. This has version
>  “1.8.0rc1” and will graduate to “1.8.0” when we release. The
> >> “v1-8-test”
>  branch now has version “1.8.1alpha0” as version and “master” has
> >> version
>  “1.9.0dev0”. Note that “v1-8-stable” is now closed. This means that,
> >> per
>  release guidelines, patches accompanied with an ASSIGNED Jira and a
>  sign-off from a committer. Only then the release manager applies the
> >>> patch
>  to stable (In this case that would be me). The release manager then
> >>> closes
>  the bug when the patches have landed in the appropriate branches. For
> >>> more
>  information please see: https://cwiki.apache.org/
>  confluence/display/AIRFLOW/Airflow+Release+Planning+and+
>  Supported+Release+Lifetime   confluence/display/AIRFLOW/Airflow+Release+Planning+and+
>  Supported+Release+Lifetime> .
> 
>  Any questions or suggestions don’t hesitate to ask!
> 
>  Cheers
>  Bolke
> >>>
> >>
>


Re: Flow-based Airflow?

2017-02-02 Thread Jeremiah Lowin
Very good point -- however I'm hesitant to overcomplicate the base class.
At the moment users only have to override "serialize()" and "deserialize()"
to build any form of remote-backed dataflow, and I like the simplicity of
that.

However, if you look at my implementation of the GCSDataflow, the
constructor gets passed serializer and deserializer functions that are
applied to the data before storage and after recovery. I think that sort of
runtime-configurable serialization is in the spirit of what you're
describing and it should be straightforward to adapt it for more specific
requirements.

On Thu, Feb 2, 2017 at 12:37 PM Laura Lorenz 
wrote:

> This is great!
>
> We work with a lot of external data in wildly non-standard formats so
> another enhancement here we'd use and support is passing customizable
> serializers to Dataflow subclasses. This would let the dataflows keyword
> arg for a task handle dependency management, the Dataflow class or
> subclasses handle IO, and the Serializer subclasses handle parsing.
>
> Happy to contribute here, perhaps to create an S3Dataflow subclass in the
> style of your Google Cloud storage one for this PR.
>
> Laura
>
> On Wed, Feb 1, 2017 at 6:14 PM, Jeremiah Lowin  wrote:
>
> > Great point. I think the best solution is to solve this for all XComs by
> > checking object size before adding it to the DB. I don't see a built in
> way
> > of handling it (though apparently MySQL is internally limited to 64kb).
> > I'll look into a PR that would enforce a similar limit for all databases.
> >
> > On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com>
> > wrote:
> >
> > I'm not sure about XCom being the default, it seems pretty dangerous. It
> > just takes one person that is not fully aware of the size of the data, or
> > one day with an outlier and that could put the Airflow db in jeopardy.
> >
> > I guess it's always been an aspect of XCom, and it could be good to have
> > some explicit gatekeeping there regardless of this PR/feature. Perhaps
> the
> > DB itself has protection against large blobs?
> >
> > Max
> >
> > On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin 
> wrote:
> >
> > > Yesterday I began converting a complex script to a DAG. It turned out
> to
> > be
> > > a perfect test case for the dataflow model: a big chunk of data moving
> > > through a series of modification steps.
> > >
> > > So I have built an extensible dataflow extension for Airflow on top of
> > XCom
> > > and the existing dependency engine:
> > > https://issues.apache.org/jira/browse/AIRFLOW-825
> > > https://github.com/apache/incubator-airflow/pull/2046 (still waiting
> for
> > > tests... it will be quite embarrassing if they don't pass)
> > >
> > > The philosophy is simple:
> > > Dataflow objects represent the output of upstream tasks. Downstream
> tasks
> > > add Dataflows with a specific key. When the downstream task runs, the
> > > (optionally indexed) upstream result is available in the downstream
> > context
> > > under context['dataflows'][key]. In addition, PythonOperators receive
> the
> > > data as a keyword argument.
> > >
> > > The basic Dataflow serializes the data through XComs, but is trivially
> > > extended to alternative storage via subclasses. I have provided (in
> > > contrib) implementations of a local filesystem-based Dataflow as well
> as
> > a
> > > Google Cloud Storage dataflow.
> > >
> > > Laura, I hope you can have a look and see if this will bring some of
> your
> > > requirements in to Airflow as first-class citizens.
> > >
> > > Jeremiah
> > >
> >
>


Re: Flow-based Airflow?

2017-02-01 Thread Jeremiah Lowin
Great point. I think the best solution is to solve this for all XComs by
checking object size before adding it to the DB. I don't see a built in way
of handling it (though apparently MySQL is internally limited to 64kb).
I'll look into a PR that would enforce a similar limit for all databases.

On Wed, Feb 1, 2017 at 4:52 PM Maxime Beauchemin 
wrote:

I'm not sure about XCom being the default, it seems pretty dangerous. It
just takes one person that is not fully aware of the size of the data, or
one day with an outlier and that could put the Airflow db in jeopardy.

I guess it's always been an aspect of XCom, and it could be good to have
some explicit gatekeeping there regardless of this PR/feature. Perhaps the
DB itself has protection against large blobs?

Max

On Wed, Feb 1, 2017 at 12:42 PM, Jeremiah Lowin  wrote:

> Yesterday I began converting a complex script to a DAG. It turned out to
be
> a perfect test case for the dataflow model: a big chunk of data moving
> through a series of modification steps.
>
> So I have built an extensible dataflow extension for Airflow on top of
XCom
> and the existing dependency engine:
> https://issues.apache.org/jira/browse/AIRFLOW-825
> https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
> tests... it will be quite embarrassing if they don't pass)
>
> The philosophy is simple:
> Dataflow objects represent the output of upstream tasks. Downstream tasks
> add Dataflows with a specific key. When the downstream task runs, the
> (optionally indexed) upstream result is available in the downstream
context
> under context['dataflows'][key]. In addition, PythonOperators receive the
> data as a keyword argument.
>
> The basic Dataflow serializes the data through XComs, but is trivially
> extended to alternative storage via subclasses. I have provided (in
> contrib) implementations of a local filesystem-based Dataflow as well as a
> Google Cloud Storage dataflow.
>
> Laura, I hope you can have a look and see if this will bring some of your
> requirements in to Airflow as first-class citizens.
>
> Jeremiah
>


Re: Flow-based Airflow?

2017-02-01 Thread Jeremiah Lowin
Yesterday I began converting a complex script to a DAG. It turned out to be
a perfect test case for the dataflow model: a big chunk of data moving
through a series of modification steps.

So I have built an extensible dataflow extension for Airflow on top of XCom
and the existing dependency engine:
https://issues.apache.org/jira/browse/AIRFLOW-825
https://github.com/apache/incubator-airflow/pull/2046 (still waiting for
tests... it will be quite embarrassing if they don't pass)

The philosophy is simple:
Dataflow objects represent the output of upstream tasks. Downstream tasks
add Dataflows with a specific key. When the downstream task runs, the
(optionally indexed) upstream result is available in the downstream context
under context['dataflows'][key]. In addition, PythonOperators receive the
data as a keyword argument.

The basic Dataflow serializes the data through XComs, but is trivially
extended to alternative storage via subclasses. I have provided (in
contrib) implementations of a local filesystem-based Dataflow as well as a
Google Cloud Storage dataflow.

Laura, I hope you can have a look and see if this will bring some of your
requirements in to Airflow as first-class citizens.

Jeremiah


Re: Airflow + Celery + SQS

2017-01-30 Thread Jeremiah Lowin
Jason,

I don't believe Airflow cares about Celery's backend as long as the task
API remains the same. You should be OK (though I haven't tested to
confirm).

J

On Sat, Jan 28, 2017 at 5:09 PM Jason Chen 
wrote:

> Hi Airflow team,
>
> Celery 4 supports AWS SQS
>  http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html
>
> We are using Airflow 1.7.1.3
> Is there any problem, if we change config to use SQS for CeleryExecutor ?
>
> Thanks.
>
> Jason
>


Re: Flow-based Airflow?

2017-01-26 Thread Jeremiah Lowin
rently bad, but I feel that the dependency is redundant. Of
course, the additional checks could be somehow encoded in the dependency,
but it does not feel as clean to me, especially if the data quality check
is resource intensive.

Here my dependency is not so much on the data being available as it is on
the data being of the quality I need.

Best,
Arthur

On Jan 25, 2017 7:09 AM, "Jeremiah Lowin"  wrote:

> At the simplest level, a data-dependency should just create an automatic
> task-dependency (since a task shouldn't run before required data is
> available). Therefore it should be possible to reason about dataflow using
> the existing dependency framework.
>
> Is there any reason that wouldn't hold for all dataflow scenarios?
>
> Then the only differentiation becomes whether a task-dependency was
defined
> explicitly by a user or implicitly by a data-dependency.
>
> On Tue, Jan 24, 2017 at 11:23 AM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > I'm happy working on a design doc. I don't think Sankeys are the way to
> go
> > as they are typically used to show some metric (say number of users
> flowing
> > through pages on a website), and even if we'd have something like row
> count
> > throughout I don't think we'd want to make it that centric to the
> > visualization.
> >
> > I think good old graphs are where it's at. Either overloading the
current
> > graph view with extra options (current view untouched, current view +
> > lineage (a graph where nodes are tasks or data objects,  data objects
> have
> > a different shape), lineage only view).
> >
> > On Mon, Jan 23, 2017 at 11:16 PM, Gerard Toonstra 
> > wrote:
> >
> > > data lineage is one of the things you mentioned in an early
> presentation
> > > and I was wondering about it.
> > >
> > > I wouldn't mind setting up an initial contribution towards achieving
> > that,
> > > but would like to understand
> > > the subject a bit better. The easiest MVP is to use the annotations
> > method
> > > to simply show how
> > > data flows, but you mention other things that need to be done in the
> > third
> > > paragraph. If a wiki could
> > > be written on the subject, explaining why those things are done, we
can
> > set
> > > up a discussion and
> > > create an epic with jira issues to realize that.
> > >
> > > The way I think this can be visualized is perhaps through a sankey
> > diagram,
> > > which helps to make
> > > complex systems more understandable, eg:
> > > - how is transaction margin calculated?  What is all the source data?
> > > - where does customer data go to and are those systems compliant?
> > > - what is the overall data dependency between systems and can these be
> > > reduced?
> > > - which data gets used everywhere?
> > > - which end systems consume from the most diverse sources of data?
> > >
> > > and other questions appropriate for data lineage.
> > >
> > > Rgds,
> > >
> > > Gerard
> > >
> > >
> > > On Tue, Jan 24, 2017 at 2:04 AM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > A few other thoughts related to this. Early on in the project, I had
> > > > designed but never launched a feature called "data lineage
> annotations"
> > > > allowing people to define a list of sources, and a list of targets
> > > related
> > > > to a each task for documentation purposes. My idea was to use a
> simple
> > > > annotation string that would uniquely map to a data object. Perhaps
a
> > URI
> > > > as  in `{connection_type}://{conn_id}/{something_unique}` or
> something
> > > to
> > > > that effect.
> > > >
> > > > Note that operators could also "infer" lineage based on their input
> > > > (HiveOperator could introspect the HQL statement to figure out input
> > and
> > > > outputs for instance), and users could override the inferred lineage
> if
> > > so
> > > > desired, either to abstract complexity like temp tables and such, to
> > > > correct bad inference (SQL parsing is messy), or in cases where
> > operators
> > > > wouldn't implement the introspection functions.
> > > >
> > > > Throw a `data_object_exist(data_object_uri)` and a
> > > > `clear_data_object(data_object_uri)` method in existing hooks, an

Re: Flow-based Airflow?

2017-01-25 Thread Jeremiah Lowin
At the simplest level, a data-dependency should just create an automatic
task-dependency (since a task shouldn't run before required data is
available). Therefore it should be possible to reason about dataflow using
the existing dependency framework.

Is there any reason that wouldn't hold for all dataflow scenarios?

Then the only differentiation becomes whether a task-dependency was defined
explicitly by a user or implicitly by a data-dependency.

On Tue, Jan 24, 2017 at 11:23 AM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> I'm happy working on a design doc. I don't think Sankeys are the way to go
> as they are typically used to show some metric (say number of users flowing
> through pages on a website), and even if we'd have something like row count
> throughout I don't think we'd want to make it that centric to the
> visualization.
>
> I think good old graphs are where it's at. Either overloading the current
> graph view with extra options (current view untouched, current view +
> lineage (a graph where nodes are tasks or data objects,  data objects have
> a different shape), lineage only view).
>
> On Mon, Jan 23, 2017 at 11:16 PM, Gerard Toonstra 
> wrote:
>
> > data lineage is one of the things you mentioned in an early presentation
> > and I was wondering about it.
> >
> > I wouldn't mind setting up an initial contribution towards achieving
> that,
> > but would like to understand
> > the subject a bit better. The easiest MVP is to use the annotations
> method
> > to simply show how
> > data flows, but you mention other things that need to be done in the
> third
> > paragraph. If a wiki could
> > be written on the subject, explaining why those things are done, we can
> set
> > up a discussion and
> > create an epic with jira issues to realize that.
> >
> > The way I think this can be visualized is perhaps through a sankey
> diagram,
> > which helps to make
> > complex systems more understandable, eg:
> > - how is transaction margin calculated?  What is all the source data?
> > - where does customer data go to and are those systems compliant?
> > - what is the overall data dependency between systems and can these be
> > reduced?
> > - which data gets used everywhere?
> > - which end systems consume from the most diverse sources of data?
> >
> > and other questions appropriate for data lineage.
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Tue, Jan 24, 2017 at 2:04 AM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > A few other thoughts related to this. Early on in the project, I had
> > > designed but never launched a feature called "data lineage annotations"
> > > allowing people to define a list of sources, and a list of targets
> > related
> > > to a each task for documentation purposes. My idea was to use a simple
> > > annotation string that would uniquely map to a data object. Perhaps a
> URI
> > > as  in `{connection_type}://{conn_id}/{something_unique}` or something
> > to
> > > that effect.
> > >
> > > Note that operators could also "infer" lineage based on their input
> > > (HiveOperator could introspect the HQL statement to figure out input
> and
> > > outputs for instance), and users could override the inferred lineage if
> > so
> > > desired, either to abstract complexity like temp tables and such, to
> > > correct bad inference (SQL parsing is messy), or in cases where
> operators
> > > wouldn't implement the introspection functions.
> > >
> > > Throw a `data_object_exist(data_object_uri)` and a
> > > `clear_data_object(data_object_uri)` method in existing hooks, and a
> > > `BaseOperator.use_target_presence_as_state=False` boolean and some
> > > handling
> > > of in the dependency engine and while "clearing" and we're not too far
> > from
> > > a solution.
> > >
> > > As a more generic alternative, potentially task states could be handled
> > by
> > > a callback when so-desired. For this, all we'd need to do is to add a
> > > `status_callback(dag, task, task_instance)` callback to BaseOperator,
> and
> > > evaluate it for state in place of the database state where user
> specify.
> > >
> > > Max
> > >
> > > On Mon, Jan 23, 2017 at 12:23 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > Just commented on the blog post:
> > > >
> > > > 
> > > > I agree that workflow engines should expose a way to document data
> > > objects
> > > > it reads from and writes to, so that it can be aware of the full
> graph
> > of
> > > > tasks and data objects and how it all relates. This metadata allows
> for
> > > > clarity around data lineage and potentially deeper integration with
> > > > external systems.
> > > > Now there's the question of whether the state of a workflow should be
> > > > inferred based on the presence or absences of related targets. For
> this
> > > > specific question I'd argue that the workflow engine needs to manage
> > its
> > > > own state internally. Here are a few reasons why: * many main

Re: Flow-based Airflow?

2017-01-23 Thread Jeremiah Lowin
This is a very interesting discussion. Laura, very cool extension in
fileflow! I haven't delved too much into the actual code yet but the docs
give a great overview.

Some thoughts:

For "small" data, Airflow already supports a dataflow setup through the
XCom mechanism. It is considerably more cumbersome to set up than "vanilla"
Airflow, but with a little syntactical sugar we could easily allow users to
solve what I believe are the two largest blocks to coding in this style:
  1. Easily specify data dependencies from upstream tasks (in other words,
automating xcom_pull so data objects are immediately available as function
parameters or in the Operator context/template. I believe xcom_push is
already automated in a simple fashion if anything is returned from an
Operator).
  2. Tie task failure to the contents of that task's XComs. (The task might
appear to succeed (in a traditional Airflow sense) but would nonetheless
fail if (for example) it pushed an empty XCom).

For "large" data, the XCom system breaks down and something like Laura's
fileflow is needed, simply because the Airflow DB wasn't designed as a blob
store.

If we would like to make this a first-class mode for Airflow, here are some
starting points:
  1. The two modifications I described above, essentially lowering the
barrier to using XComs to convey data between Operators
  2. Borrowing from fileflow, providing a configurable backend for XComs --
GCS or S3 (or another database entirely). For example, a GCS-backed XCom
would use the builtin XCom mechanism to pass a gcs:// URI between tasks,
and automatically serialize or unpack the blob at that location on demand.
  3. One difference to the fileflow setup: we want to avoid using the
filesystem since there is no guarantee tasks are being run on the same
machine

--J


On Mon, Jan 23, 2017 at 2:05 PM Van Klaveren, Brian N. <
b...@slac.stanford.edu> wrote:

> I can give some insight from the physics world as far as this goes.
>
> First off, I think the dataflow puck is moving to platforms like Apache
> Beam. The main reason people (in science) don't just use Beam would be
> because they don't control the clusters they execute on. This is almost
> always true for science project using grid resources.
>
> This model is often desirable for scientific grid-based processing systems
> which are inherently decentralized and involve the staging in and out of
> data, since the execution environment is inherently sandboxed. These are
> often integrated with other Grid frameworks (e.g. DIRAC
> http://diracgrid.org/, PegasusWMS) which have their own data catalog
> integrated into them to aid with data movement and staging, or sometimes
> they'll use another system for that management (iRODS, https://irods.org/).
> In many of those cases, you deal with logical file handles as the
> inputs/outputs, but the file management systems also own the data. I've
> done some work in this space as well, in that I've written a file replica
> management system (github.com/slaclab/datacat<
> http://github.com/slaclab/datacat>), but in this model, the system is
> just a global metadata database about file replicas, and doesn't "own" the
> file/data. Often a requirement of these systems is also the need to support
> data versions and processing provenance.
>
> The nice thing about the dataflow is obviously the declarative data
> products. The messy thing is dealing with the data movement, especially if
> it's not mandatory (e.g. single datacenter and/or shared network disk). I'm
> partial towards the procedural nature of data movement using downstream
> processes and a DAG, but part of that is because I've had to deal with
> finicky File Transfer Nodes at different computing facilities to take full
> advantage of bandwidth available for file transfers. In most of the physics
> world (e.g. CERN), they also use dCache (https://www.dcache.org/) or
> xrootd (http://xrootd.org/) to aid in data movement, though some of the
> frameworks also support this natively (like DIRAC, as mentioned).
>
> I will say that a frustrating thing about much of these tools and
> frameworks I've mentioned is that they are often hard to use them à la
> carte.
>
> Brian
>
>
> On Jan 23, 2017, at 8:05 AM, Bolke de Bruin  bdbr...@gmail.com>> wrote:
>
> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (
> http://bionics.it/posts/workflows-dataflow-not-task-deps) where they
> argue dependencies should be between inputs and outputs of tasks rather
> than between tasks (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it:
> http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke
>
>


Re: How to learn more about deprecation warnings?

2017-01-20 Thread Jeremiah Lowin
Hi Laura,

The error is raised if an unused argument is passed to BaseOperator --
basically if there is anything in either args or kwargs. The original issue
was that in a number of cases arguments were misspelled or misused by
Operator subclasses and instead of raising an error, they were just passed
up the inheritance chain and finally (silently) absorbed by BaseOperator,
so there was no warning.

I think a workaround should be straightforward -- when you call
super().__init__ for the BaseOperator, just pass arguments explicitly
rather than with args/kwargs, or (alternatively), pop arguments out of
kwargs when you use them ahead of calling that __init__.

On Thu, Jan 19, 2017 at 10:23 AM Laura Lorenz 
wrote:

> Hi! Is there a way to determine the rationale behind deprecation warnings?
> In particular I'm interested in the following:
>
>
> /Users/llorenz/Envs/fileflow/lib/python2.7/site-packages/airflow/models.py:1719:
> > PendingDeprecationWarning: Invalid arguments were passed to
> > DivePythonOperator. Support for passing such arguments will be dropped in
> > Airflow 2.0. Invalid arguments were:
> >
> > *args: ()
> >
> > **kwargs: {'data_dependencies': {'something': 'write_a_file'}}
> >
> >   category=PendingDeprecationWarning
> >
>
> Our home grown plugin fileflow depends on this capability so I'd like to
> get more information about how it will be changing to see if I can
> anticipate a workaround to support airflow 2.0.
>
> Thanks!
>
> Laura
>


Re: Refactoring Connection

2017-01-09 Thread Jeremiah Lowin
I think this sounds like a great idea, though having not had to set up
connections in a while it's a little hard for me to picture (though I do
remember the pain of figuring out the very first Google Cloud connectors).
Could provide a little example?

For example, would I just define a connection as a string, like
"mongo://user@password/h.o.s.t:port/database" and trust a plugin to
register as a "mongohandler" and turn that URI into a connection? To Alex's
point, perhaps more complex needs could be handled as query parameters
(today they are just handwritten JSON in "extras", so that's not much of a
change): "gcp://project?keyfile=path/to/keyfile" Just shooting out some
ideas...

Thanks!

On Mon, Jan 9, 2017 at 3:44 PM George Leslie-Waksman
 wrote:

Could registering new types be handled through the plugin infrastructure?

On Mon, Jan 9, 2017 at 5:14 AM Alex Van Boxel  wrote:

> I was actually going to propose something different with entry-points, but
> your requirement beat me to it (but that's ok :-). Actually I think with
> this mechanism people would be able to extend Airflow connection mechanism
> (and later other stuff) by doing *pip install airflow-sexy-new-connection*
> (for example).
>
> On Mon, Jan 9, 2017 at 1:39 PM Gael Magnan  wrote:
>
> > Thank you for the read, I'm gonna look at it, it's probably gonna be
> better
> > that what I have.
> >
> > Point taken about the URI, I'll see if i can find something generic
> enough
> > to handle all those cases.
> >
> > Le lun. 9 janv. 2017 à 13:36, Alex Van Boxel  a écrit
> :
> >
> > > Thanks a lot, yes it clarifies a lot and I do agree you really need to
> > hack
> > > inside Airflow to add a Connection type. While you're working at this
> > could
> > > you have a look at the standard python *entry-point mechanism* for
> > > registering Connection types/components.
> > >
> > > A good read on this:
> > >
> > >
> >
>
http://docs.pylonsproject.org/projects/pylons-webframework/en/latest/advanced_pylons/entry_points_and_plugins.html
> > >
> > > My first though would be that just by adding an entry to the factory
> > method
> > > would be enough to register your Connection + ConnectionType and UI.
> > >
> > > Also note that not everything works with a URI. The Google Cloud
> > Connection
> > > doesn't have one, it uses a secret key file stored on disk, so don't
> > force
> > > every connection type to work with URI's.
> > >
> > >
> > >
> > > On Mon, Jan 9, 2017 at 1:15 PM Gael Magnan 
> wrote:
> > >
> > > > Yes sure,
> > > >
> > > > The question was the following:
> > > > "I was looking at the code of the connections, and I realized you
> can't
> > > > easily add a connection type without modifying the airflow code
> > source. I
> > > > wanted to create a mongodb connection type, but I think the best
> > approche
> > > > would be to refactor connections first. Thoughts anyone?"
> > > >
> > > > The answer of Bolke de Bruin was: "making it more generic would be
> > > > appreciated"
> > > >
> > > > So basically the way the code is set up actually every types of
> > > connection
> > > > existing is defined though a list in the Connection class. It
> > implements
> > > > exactly the same code for parsing uri to get connections info and
> > doesn't
> > > > allow for a simple way to get back the uri from the connection
infos.
> > > >
> > > > I need to add a mongodb connection and a way to get it back as a
uri,
> > so
> > > i
> > > > could use an other type of connection and play around with that or
> > juste
> > > > add one more hard coded connection type, but I though this might be
> > > > something that comes back regularly and having a simple way to plug
> in
> > > new
> > > > types of connection would make it easier for anyone to contribute a
> new
> > > > connection type.
> > > >
> > > > Hope this clarifies my proposal.
> > > >
> > > > Le lun. 9 janv. 2017 à 12:46, Alex Van Boxel  a
> > écrit
> > > :
> > > >
> > > > > Hey Gael,
> > > > >
> > > > > could you please recap the question here and provide some context.
> > Not
> > > > > everyone on the mailinglist is actively following Gitter,
including
> > me.
> > > > > With some context it would be easier to give feedback. Thanks.
> > > > >
> > > > > On Mon, Jan 9, 2017 at 11:15 AM Gael Magnan 
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > following my question on gitter the other day and the response
> from
> > > > Bolke
> > > > > > de Bruin, I've started working on refactoring the connections in
> > > > airflow.
> > > > > >
> > > > > > Before submitting a PR I wanted to share my proposal with you
and
> > get
> > > > > > feedbacks.
> > > > > >
> > > > > > The idea is quite simple, I've divided the Connection class in
> two,
> > > > > > Connection and ConnectionType, connection has the same interface
> it
> > > had
> > > > > > before plus a few methods, but the class keeps a reference to a
> > > > > dictionary
> > > > > > of registered ConnectionType. It delegates the work of parsing
> from
> > 

Re: Airflow 1.8.0 alpha 3

2017-01-09 Thread Jeremiah Lowin
Good stuff Bolke -- I was in the process of spinning up mysql because I
hadn't observed it in my normal test env (postgres) and that was the only
variable left [I think]. Let me know if you see it re-emerge.

On Mon, Jan 9, 2017 at 4:20 PM Bolke de Bruin  wrote:

> Hey Jeremiah,
>
> I probably have fixed the issue: see
> https://github.com/apache/incubator-airflow/pull/1978 . It seems to be an
> issue in how MySQL does locking. See also:
> http://stackoverflow.com/questions/23615641/why-am-i-getting-deadlock-in-mysql
>
> Please review,
>
> Bolke
>
> On 7 Jan 2017, at 20:14, Bolke de Bruin  wrote:
>
> Dear All,
>
> I have made Airflow 1.8.0 alpha 3 available. Again no Apache release yet -
> this is for testing purposes.
>
> https://people.apache.org/~bolke/  . Tarball is signed.
>
>
> Blockers:
>
> * XCom throws an duplicate / locking error (new - confirmed) -> Jeremiah
> is taking a look. https://issues.apache.org/jira/browse/AIRFLOW-738 .
>
>
> Fixed issues
> * Regression in email
> * LDAP case sensitivity
> * one_failed task not being run: now seems to pass suddenly (so fixed?) ->
> need to investigate why
> * Email attachments
> * Pinned jinja2 to < 2.9.0 (2.9.1 has a confirmed regression)
> * Improve time units for task performance charts
>
> Pending features:
> * DAG.catchup : minor changes needed, documentation still required,
> integration tests seem to pass flawlessly
> * Cgroups + impersonation: clean up of patches on going, more tests and
> more elaborate documentation required. Integration tests not executed yet
> * Schedule all pending DAG runs in a single scheduler loop: no progress
> (**)
> * Add execution_date to trigger_dag
>
>
> If the pending features are merged within a reasonable time frame (except
> for **, as no progress currently) then I am planning to mark the tarball as
> Beta and only allow bug fixes and (very) minor features. Hopefully end of
> next week.
>
> Have a good weekend!
> Bolke
>
>


Re: Airflow 1.8.0 alpha 2

2017-01-06 Thread Jeremiah Lowin
Bolke -- I will see if I can solve it

On Fri, Jan 6, 2017 at 3:46 PM, Bolke de Bruin  wrote:

> For the XCOM issue: https://issues.apache.org/jira/browse/AIRFLOW-738 <
> https://issues.apache.org/jira/browse/AIRFLOW-738>
>
> With example dag that exposes the issue. Would be nice if someone with
> XCOM exposure could take a look.
>
> - Bolke
>
> > On 6 Jan 2017, at 15:31, Bolke de Bruin  wrote:
> >
> > State of to be Alpha 3:
> >
> > Blockers:
> >
> > * XCom throws an duplicate / locking error (new - confirmed)
> > * one_failed task not being run: now seems to pass suddenly (so fixed?)
> -> need to investigate why
> >
> > Fixed issues
> > * Regression in email
> > * LDAP case sensitivity
> >
> > Pending features:
> > * DAG.catchup : minor changes needed, documentation still required,
> integration tests seem to pass flawlessly
> > * Cgroups + impersonation: clean up of patches on going, more tests and
> more elaborate documentation required. Integration tests not executed yet
> > * Schedule all pending DAG runs in a single scheduler loop: no progress
> (**)
> > * Email attachments
> > * Add execution_date to trigger_dag
> >
> > If the pending features are merged within a reasonable time frame
> (except for **, as no progress currently) then I am planning to mark the
> tarball as Beta and only allow bug fixes and (very) minor features.
> Hopefully end of next week.
> >
> > Bolke.
> >
> >
> >
> >> On 5 Jan 2017, at 20:07, Chris Riccomini  wrote:
> >>
> >> I have merged Robin's LDAP patch.
> >>
> >> On Thu, Jan 5, 2017 at 10:39 AM, Chris Riccomini  >
> >> wrote:
> >>
> >>> I have found the email problem:
> >>>
> >>> https://issues.apache.org/jira/browse/AIRFLOW-734
> >>>
> >>> Working on a fix.
> >>>
> >>> On Thu, Jan 5, 2017 at 9:10 AM, Chris Riccomini  >
> >>> wrote:
> >>>
>  Also, I'm seeing a second issue:
> 
>  SMTP doesn't seem to work for us anymore:
> 
>  [2017-01-05 15:10:13,666] {models.py:1378} ERROR - SMTP AUTH
> extension not supported by server.
>  Traceback (most recent call last):
>  File "/usr/lib/python2.7/site-packages/airflow/models.py", line
> 1374, in handle_failure
>    self.email_alert(error, is_retry=False)
>  File "/usr/lib/python2.7/site-packages/airflow/models.py", line
> 1521, in email_alert
>    send_email(task.email, title, body)
>  File "/usr/lib/python2.7/site-packages/airflow/utils/email.py", line
> 43, in send_email
>    return backend(to, subject, html_content, files=files,
> dryrun=dryrun, cc=cc, bcc=bcc, mime_subtype=mime_subtype)
>  File "/usr/lib/python2.7/site-packages/airflow/utils/email.py", line
> 84, in send_email_smtp
>    send_MIME_email(SMTP_MAIL_FROM, recipients, msg, dryrun)
>  File "/usr/lib/python2.7/site-packages/airflow/utils/email.py", line
> 100, in send_MIME_email
>    s.login(SMTP_USER, SMTP_PASSWORD)
>  File "/usr/lib64/python2.7/smtplib.py", line 584, in login
>    raise SMTPException("SMTP AUTH extension not supported by server.")
>  SMTPException: SMTP AUTH extension not supported by server.
> 
>  This was working on 1.7.1.2. Having a look now.
> 
>  On Thu, Jan 5, 2017 at 9:09 AM, Chris Riccomini <
> criccom...@apache.org>
>  wrote:
> 
> > Hey Robin,
> >
> > Awesome, thanks! I love open source. :) Let me try your patch out and
> > merge it.
> >
> > Cheers,
> > Chris
> >
> > On Thu, Jan 5, 2017 at 1:34 AM, Miller, Robin <
> > robin.mil...@affiliate.oliverwyman.com> wrote:
> >
> >> Hi Chris,
> >>
> >>
> >> I think I ran into this issue when setting up LDAP Auth in our
> >> environment (we're using very close to master as we needed some of
> the
> >> newer features/bugfixes). The problem turned out to be that the
> search was
> >> finding no results, so the line:
> >>
> >>
> >> groups_list = [regex.search(i).group(1) for i in user_groups]
> >>
> >>
> >> would fail because it had no matching groups to return. This turned
> out
> >> to be because Windows Active Directory (the LDAP server we're using)
> >> returned capitals "CN=" where the code expected lowercase: regex =
> >> re.compile("cn=([^,]*).*")
> >>
> >>
> >> I haven't looked it up, but Windows Active Directory is case
> >> insensitive when it comes to usernames and groups, so I wouldn't be
> >> surprised if the protocol itself is case insensitive and both "cn="
> and
> >> "CN=" should be considered valid. As such I've a PR open for a
> simple fix
> >> to make this regex case insensitive: https://github.com/apache/incu
> >> bator-airflow/pull/1945
> >>
> >>
> >> Hopefully this helps,
> >>
> >> Robin Miller
> >> OLIVER WYMAN
> >> robin.mil...@affiliate.oliverwyman.com >> ffiliate.oliverwyman.com>
> >> www.oliverwyman.com
> >>
> >> 
>

Re: NYC Meetup?

2016-12-24 Thread Jeremiah Lowin
Feb 1 should work for me -- thanks for getting the ball rolling, Joe!

On Sat, Dec 24, 2016 at 12:23 PM Joseph Napolitano
 wrote:

> Hi all,
>
> I got the green light to host an Airflow meetup here at Blue Apron.  We're
> located on 23rd St. between 5th and 6th.  A precise date and time are TBD,
> but I would like to suggest Wednesday, February 1st @ 6:30pm.
>
> I'll get the ball rolling if there's enough interest for February 1st.  I
> was asked to provide a headcount as well, so I would like to create an
> official Meetup event.  I am curious what others think about creating the
> event on the existing Bay Area Meetup Group, despite being labeled Bay
> Area?
>
> Also, please reach out if you'd like to give a presentation.  We'll have a
> large screen setup w/ a projector and PA system.
>
> All the best,
> Joe Nap
>
>
> On Thu, Dec 22, 2016 at 2:34 PM, Chris Riccomini 
> wrote:
>
> > @Rob, gotta get that up on the "who uses airflow" README.md section! :)
> >
> > On Thu, Dec 22, 2016 at 9:52 AM, Rob Goretsky  >
> > wrote:
> >
> > > We at MLB Advanced Media (MLBAM / MLB.com) are just about to get our
> > first
> > > few Airflow processes into production, so we'd love to join an
> NYC-based
> > > meetup!
> > >
> > > -rob
> > >
> > >
> > > On Wed, Dec 21, 2016 at 9:49 AM, Jeremiah Lowin 
> > wrote:
> > >
> > > > It would be wonderful to have an east coast meetup! I would love to
> > join
> > > if
> > > > I can be in NY that day.
> > > >
> > > > Best,
> > > > Jeremiah
> > > >
> > > > On Tue, Dec 20, 2016 at 4:24 PM Patrick D'Souza <
> > > patrick.dso...@gmail.com>
> > > > wrote:
> > > >
> > > > > Having hosted a bunch of meetups in the past, we at
> SecurityScorecard
> > > are
> > > > > very interested in hosting an Airflow meetup as well. We can easily
> > > host
> > > > > around 50 people or so in January.
> > > > >
> > > > > On Fri, Dec 16, 2016 at 4:26 PM, Chris Riccomini <
> > > criccom...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > lol
> > > > > >
> > > > > > On Fri, Dec 16, 2016 at 11:04 AM, Joseph Napolitano <
> > > > > > joseph.napolit...@blueapron.com.invalid> wrote:
> > > > > >
> > > > > > > Auto-correct got me. Metopes = Meetups
> > > > > > >
> > > > > > > Metope - a square space between triglyphs in a Doric frieze.
> > > > > > >
> > > > > > > On Fri, Dec 16, 2016 at 2:03 PM, Joseph Napolitano <
> > > > > > > joseph.napolit...@blueapron.com> wrote:
> > > > > > >
> > > > > > > > We hosted several metopes here at Blue Apron.  I will bring
> it
> > up
> > > > to
> > > > > > our
> > > > > > > > administrative team and give an update.  Mid-january is
> > probably
> > > a
> > > > > good
> > > > > > > > target.
> > > > > > > >
> > > > > > > > - Joe
> > > > > > > >
> > > > > > > > On Thu, Dec 15, 2016 at 5:18 PM, Luke Ptz <
> > lukeptzc...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Cool to see the interest is there! I unfortunately can't
> > offer a
> > > > > space
> > > > > > > for
> > > > > > > >> a meetup, can anyone else? If not could always be
> > informal/meet
> > > > in a
> > > > > > > >> public
> > > > > > > >> setting
> > > > > > > >>
> > > > > > > >> On Wed, Dec 14, 2016 at 7:08 PM, Andrew Phillips <
> > > > > andr...@apache.org>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > We at Blue Apron would be very interested.
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> > Same here.
> > > > > > > >> >
> > > > > > > >> > ap
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > *Joe Napolitano *| Sr. Data Engineer
> > > > > > > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > *Joe Napolitano *| Sr. Data Engineer
> > > > > > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> *Joe Napolitano *| Sr. Data Engineer
> www.blueapron.com | 5 Crosby Street, New York, NY 10013
>


Re: Regarding getting the result of another task in your current task

2016-12-22 Thread Jeremiah Lowin
Could you provide a little more detail on the type of Operator you want to
use? It's certainly easiest with PythonOperator. However, the `ti` object
is also available in jinja templates, so for example you can reference it
in a BashOperator as (I believe) {{ ti.xcom_pull("t1") }}.

On Thu, Dec 22, 2016 at 11:32 AM twinkle  wrote:

> Hi,
>
> I want to get the result of one task t1, inside task t2.
>
> I have set the result of t1, by returning a value, which is set for the key
> result_value.
>
> While using the PythonOperator as t2, i am able to get the value of the
> result of task t1.
> Code for doing this in PythonOperator is:
>
> pull = PythonOperator(
>
> task_id='puller', dag=main_dag, provide_context=True,
> python_callable=puller)
>
>
> def puller(**kwargs):
>
> ti = kwargs['ti']
>
> v1 = ti.xcom_pull(key=None, task_ids=[t1])
>
> logging.info("v1 :%s" %v1)
>
> How can i do that in any other operator?
> I tried, accessing 'ti' from context as well as kwargs, for both of them i
> get the following error:
>
> task_instance = context['ti']
>
> task_instance = context['ti']
> TypeError: string indices must be integers
>
>
> Any pointers?
>
> Regards,
> Twinkle
>


Re: NYC Meetup?

2016-12-21 Thread Jeremiah Lowin
It would be wonderful to have an east coast meetup! I would love to join if
I can be in NY that day.

Best,
Jeremiah

On Tue, Dec 20, 2016 at 4:24 PM Patrick D'Souza 
wrote:

> Having hosted a bunch of meetups in the past, we at SecurityScorecard are
> very interested in hosting an Airflow meetup as well. We can easily host
> around 50 people or so in January.
>
> On Fri, Dec 16, 2016 at 4:26 PM, Chris Riccomini 
> wrote:
>
> > lol
> >
> > On Fri, Dec 16, 2016 at 11:04 AM, Joseph Napolitano <
> > joseph.napolit...@blueapron.com.invalid> wrote:
> >
> > > Auto-correct got me. Metopes = Meetups
> > >
> > > Metope - a square space between triglyphs in a Doric frieze.
> > >
> > > On Fri, Dec 16, 2016 at 2:03 PM, Joseph Napolitano <
> > > joseph.napolit...@blueapron.com> wrote:
> > >
> > > > We hosted several metopes here at Blue Apron.  I will bring it up to
> > our
> > > > administrative team and give an update.  Mid-january is probably a
> good
> > > > target.
> > > >
> > > > - Joe
> > > >
> > > > On Thu, Dec 15, 2016 at 5:18 PM, Luke Ptz 
> > wrote:
> > > >
> > > >> Cool to see the interest is there! I unfortunately can't offer a
> space
> > > for
> > > >> a meetup, can anyone else? If not could always be informal/meet in a
> > > >> public
> > > >> setting
> > > >>
> > > >> On Wed, Dec 14, 2016 at 7:08 PM, Andrew Phillips <
> andr...@apache.org>
> > > >> wrote:
> > > >>
> > > >> > We at Blue Apron would be very interested.
> > > >> >>
> > > >> >
> > > >> > Same here.
> > > >> >
> > > >> > ap
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > *Joe Napolitano *| Sr. Data Engineer
> > > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > > >
> > >
> > >
> > >
> > > --
> > > *Joe Napolitano *| Sr. Data Engineer
> > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > >
> >
>


Re: String formatting

2016-10-17 Thread Jeremiah Lowin
Indeed -- though I think the larger question from Luke is whether or not we
want to enforce a certain style of logging message (variable arguments vs
formatting the string itself). Since there's nothing to stop users from
formatting the string on one line and passing it to logging on the next,
not to mention that I don't really see the performance benefit, I think
this warning is silly and should be removed and people can log however they
feel most comfortable. Just my $0.02.

On Mon, Oct 17, 2016 at 10:20 AM Andrew Phillips 
wrote:

> Perhaps I stand corrected! -- though I don't see where it actually says
> this approach is preferred. In any case, the Python 3 docs explicitly
> state
> that the behavior is only maintained for backwards compatibility:
> https://docs.python.org/3/howto/logging.html#logging-variable-data

Ah, interesting! From [1], it appears that it's now possible to change
the placeholder syntax so that the messages look somewhat more "modern".
As far as I can tell, variable components of the log messages are still
intended to be passed as arguments, though?

So with the style change, the log message could look like this:

logging.info("A message with a param: {}", param)

Regards

ap

[1]
https://docs.python.org/3/howto/logging-cookbook.html#formatting-styles


Re: String formatting

2016-10-17 Thread Jeremiah Lowin
Perhaps I stand corrected! -- though I don't see where it actually says
this approach is preferred. In any case, the Python 3 docs explicitly state
that the behavior is only maintained for backwards compatibility:
https://docs.python.org/3/howto/logging.html#logging-variable-data


On Mon, Oct 17, 2016 at 10:10 AM Andrew Phillips 
wrote:

> > "Use % formatting in logging functions and pass the % parameters as
> > arguments"
> >
> > [...]
> >
> > Can anybody give a judgement on this? If the .format is preferred,
> > then we'll look into changing the landscape.io settings.
>
> As far as I can tell from the Python 2 logging docs [1] at least, the
> following is indeed the preferred approach:
>
> logging.info("Some message with a param: %s", param)
>
> One of the reasons is that the string representation of param does not
> need to be evaluated if the framework can determine that info logging is
> not enabled - this differs from using format, where the string has to be
> calculated irrespective of whether it is ever emitted.
>
> Regards
>
> ap
>
> [1] https://docs.python.org/2/library/logging.html
>


Re: String formatting

2016-10-17 Thread Jeremiah Lowin
Good question. Personally I hate this warning, I like new-style formatting
far more than "%"-style formatting (and in fact, what pylint wants is
neither -- it prefers something like `logging.info('my msg is %',
my_msg)`). The stated reason (
https://bitbucket.org/logilab/pylint/pull-requests/169/add-new-warning-logging-format/diff)
is that this defers the string interpolation until it is actually required
(if/when the logger is called) and therefore provides a performance
enhancement -- I find it incredibly hard to believe that this yields any
measurable improvement. In addition, I haven't found any PEP or other
official documentation that says this behavior is preferred (as opposed to
merely possible). I would support disabling the warning entirely.

J

On Mon, Oct 17, 2016 at 10:04 AM Maycock, Luke <
luke.mayc...@affiliate.oliverwyman.com> wrote:

> Hi Dev List,
>
> We're currently working on removing all of the new Landscape.io warnings
> from some of our code and we're noticing the following quite a lot:
>
> "Use % formatting in logging functions and pass the % parameters as
> arguments"
>
> This is being flagged up on lines such as:
>
> "logging.info("Running command:\n {}".format(hql))"
>
> We're just wondering whether it is indeed preferred to use the %
> parameters as suggested or whether this is an issue with the Landscape.io
> setup.
>
> Can anybody give a judgement on this? If the .format is preferred, then
> we'll look into changing the landscape.io settings.
>
> Thanks,
>
> Luke Maycock
> OLIVER WYMAN
> luke.mayc...@affiliate.oliverwyman.com luke.mayc...@affiliate.oliverwyman.com>
> www.oliverwyman.com
>
>
> 
> This e-mail and any attachments may be confidential or legally privileged.
> If you received this message in error or are not the intended recipient,
> you should destroy the e-mail message and any attachments or copies, and
> you are prohibited from retaining, distributing, disclosing or using any
> information contained herein. Please inform us of the erroneous delivery by
> return e-mail. Thank you for your cooperation.
>


Re: Cloud Provider grouping into Plugins

2016-10-14 Thread Jeremiah Lowin
One reason I do like the idea is that especially in contrib, Operators are
essentially self-documenting and the first clue is just the file name
('my_gcp_operators.py'). Since we no longer greedily import anything, you
have to know exactly what file to import to get the functionality you want.
Grouping them provides a gentler way to figure out what file does what
('GCP/storage_operators.py' vs 'GCP/bigquery_operators.py' vs
'docker_operators.py'). Sure, you could do this by enforcing a common name
standard ('GCP_storage_operators.py') but submodules mean you can
additionally take advantage of the common infrastructure that Alex
referenced. I think if we knew how many contrib modules we would have
today, we would have done this at the outset (even though it would have
looked like major overkill). Also, the previous import mechanism made
importing from submodules really hard; we don't have that issue anymore.

So I vote for this, but it will have to be done gently to avoid breaking
the existing GCP ones.

On Fri, Oct 14, 2016 at 11:29 AM Alex Van Boxel  wrote:

> Talking about AWS, it would only make sense if other people would step up
> to do it for AWS, and even Azure (or don't we have Azure operators?).
>
> On Fri, Oct 14, 2016 at 5:25 PM Chris Riccomini 
> wrote:
>
> > What do others think? I know Sid is a big AWS user.
> >
> > On Fri, Oct 14, 2016 at 8:24 AM, Chris Riccomini 
> > wrote:
> > > Ya, if we go the deprecation route, and let them float around for a
> > > release or two, I'm OK with that (or until we bump major to 2.0).
> > >
> > > Other than that, it sounds like a good opportunity to clean things up.
> > > :) I do notice a lot of AWS/GCP code (e.g. the S3 Redshift operator).
> > >
> > > On Fri, Oct 14, 2016 at 8:16 AM, Alex Van Boxel 
> > wrote:
> > >> Well, I wouldn't touch the on that exist (maybe we could mark them
> > >> deprecated, but that's all). But I would move (copy) them together and
> > make
> > >> them consistent (example, let them all use the same default
> > connection_id,
> > >> ...). For a new user it's quite confusing I think due to different
> > reasons
> > >> (style, etc...) you know we have an old ticket: making gcp consistent
> (I
> > >> just don't want to start on this on, on fear of breaking something).
> > >>
> > >> On Fri, Oct 14, 2016 at 4:59 PM Chris Riccomini <
> criccom...@apache.org>
> > >> wrote:
> > >>
> > >> Hmm. What advantages would this provide? I'm a little nervous about
> > >> breaking compatibility. We have a bunch of DAGs which import all kinds
> > >> of GCP hooks and operators. Wouldn't want those to move.
> > >>
> > >> On Fri, Oct 14, 2016 at 7:54 AM, Alex Van Boxel 
> > wrote:
> > >>> Hi all,
> > >>>
> > >>> I'm starting to write some very exotic Operators that are a bit
> strange
> > >>> adding to contrib. Examples of this are:
> > >>>
> > >>> + See if a Compute snapshot of a disc is created
> > >>> + See if a string appears on the serial port of Compute instance
> > >>>
> > >>> but they would be a nice addition if we had a Google Compute plugin
> (or
> > >> any
> > >>> other cloud provider, AWS, Azure, ...). I'm not talking about getting
> > >> cloud
> > >>> support out of the main source tree. No, I'm talking about grouping
> > them
> > >>> together in a consistent part. We can even start adding macro's etc.
> > This
> > >>> would be a good opportunity to move all the GCP operators together,
> > making
> > >>> them consistent without braking the existing operators that exist in
> > >>> *contrib*.
> > >>>
> > >>> Here are a few requirements that I think of:
> > >>>
> > >>>- separate folder ( example  /integration/googlecloud ,
> > >>> /integration/aws
> > >>>,  /integration/azure )
> > >>>- enable in config (don't want to load integrations I don't use)
> > >>>- based on Plugin (same interface)
> > >>>
> > >>> Thoughts?
> >
>


Re: Dynamic DAG Generation

2016-08-28 Thread Jeremiah Lowin
Airflow supports what we call "slowly-changing" dynamic DAGs.

Because the DAG files are frequently parsed, there is no issue in theory
with generating different DAGs depending on your needs at a given moment.
The issue is that you want to make sure that the same DAG is loaded by the
scheduler and the executor. For example, if the scheduler loads up the DAG
and sees a task called "random_task_012345", it will create an instruction
for the executor to run that task. Then if the executor loads up the DAG
and finds only "random_task_98765", it will cause a problem. So
"slowly-changing" means the DAG can change, but do it on a timescale that
won't cause conflicts when you're trying to run it.

One last note -- the web interface implicitly assumes that the DAG remains
constant over time. You may see odd behavior with mutating DAGs.

On Sat, Aug 27, 2016 at 8:04 PM Lance Norskog 
wrote:

> Yes! We use tables to generate some of our DAGs and it works well.  What's
> most important is to avoid block-copying code among DAGs.
>
> On Sat, Aug 27, 2016 at 12:05 PM, Vincent Poulain <
> vincent.poul...@tinyclues.com> wrote:
>
> > Hello,
> > First, thank you for your great job. Airflow made my job easier :)
> >
> > Is a dynamic generation of DAG a good practice ? As WePay explained
> there :
> > https://wecode.wepay.com/posts/airflow-wepay.
> >
> > In part of a DAG, I have *n* ECS tasks need to be launched daily, each
> one
> > with specific *params*. *n & params* depending on configuration which is
> > daily updated.
> > For now, I have just one task fetching this configuration and internally
> > broadcast ECS task.
> > I am wondering if Airflow would be suitable to generate DAG dynamically
> > (depending on configuration) in order to split this task into those
> > multiple ECS task (using ECSOperator, for instance).
> >
> > Thank you for your tips,
> >
>
>
>
> --
> Lance Norskog
> lance.nors...@gmail.com
> Redwood City, CA
>


Re: airflow does not like mysql - ProgrammingError: (mysql.connector.errors.ProgrammingError) Failed processing pyformat-parameters;

2016-08-22 Thread Jeremiah Lowin
We've seen this before... unfortunately as far as I can tell the discussion
(and resolution) was swallowed by the move from GitHub issues to JIRA!

IIRC, some mysql libraries have an explicit type check for strings and our
py2/py3 compatible string type fails. It looks like you're using
mysql.connector -- can you try with mysqlclient as described here:
https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls? I run
Airflow on Google Cloud SQL using the mysqldb dialect with no issue.

Thanks,
J



On Sun, Aug 21, 2016 at 6:09 PM David Montgomery 
wrote:

> Hi,
>
>
>
> Below is my config.
>
> sql_alchemy_conn = mysql+mysqlconnector://root:xxx@localhost:3306/airflow
>
> when I run initdb tables are present
>
>
> mysql> use airflow;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
>
> Database changed
> mysql> show tables;
> +---+
> | Tables_in_airflow |
> +---+
> | alembic_version   |
> | chart |
> | connection|
> | dag   |
> | dag_pickle|
> | dag_run   |
> | import_error  |
> | job   |
> | known_event   |
> | known_event_type  |
> | log   |
> | sla_miss  |
> | slot_pool |
> | task_instance |
> | users |
> | variable  |
> | xcom  |
> +---+
> 17 rows in set (0.00 sec)
>
>
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line
> 1817, in wsgi_app
> response = self.full_dispatch_request()
>   File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line
> 1477, in full_dispatch_request
> rv = self.handle_user_exception(e)
>   File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line
> 1381, in handle_user_exception
> reraise(exc_type, exc_value, tb)
>   File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line
> 1475, in full_dispatch_request
> rv = self.dispatch_request()
>   File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line
> 1461, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
>   File "/usr/local/lib/python2.7/dist-packages/flask_admin/base.py",
> line 68, in inner
> return self._run_view(f, *args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/flask_admin/base.py",
> line 367, in _run_view
> return fn(self, *args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/flask_login.py", line
> 755, in decorated_view
> return func(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/www/utils.py",
> line 213, in view_func
> return f(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/airflow/www/utils.py",
> line 116, in wrapper
> session.commit()
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 801, in commit
> self.transaction.commit()
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 392, in commit
> self._prepare_impl()
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 372, in _prepare_impl
> self.session.flush()
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 2019, in flush
> self._flush(objects)
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 2137, in _flush
> transaction.rollback(_capture_exception=True)
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py",
> line 60, in __exit__
> compat.reraise(exc_type, exc_value, exc_tb)
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py",
> line 2101, in _flush
> flush_context.execute()
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py",
> line 373, in execute
> rec.execute(self)
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py",
> line 532, in execute
> uow
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py",
> line 174, in save_obj
> mapper, table, insert)
>   File
> "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py",
> line 800, in _emit_insert_statements
> execute(statement, params)
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py",
> line 914, in execute
> return meth(self, multiparams, params)
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py",
> line 323, in _execute_on_connection
> return connection._execute_clauseelement(self, multiparams, params)
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py",
> line 1010, in _execute_clauseelement
> compiled_sql, distilled_params
>   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py",
> line 1146, in _execute_context
> context)
>   File "/usr/local/lib/p

Re: Run TaskInstances sequentially

2016-08-12 Thread Jeremiah Lowin
-- שלום הילה

A pool size of 1 is the simplest way to do this -- I recommend making a
pool for just this specific task.

depends_on_past is an even more strict condition: it will force your tasks
to run in chronological order. However as you point out, it also requires
the previous task to succeed.

Jeremiah

On Thu, Aug 11, 2016 at 5:29 PM Lance Norskog 
wrote:

> In Airflow 1.6.2, all of the concurrency controls are sometimes ignored and
> many tasks are scheduled simultaneously. I don't know if this has been
> completely fixed.
> You can rely on them to separate your task runs *most* of the time, but not
> *all* of the time- so don't write code that depends on exclusive operation.
>
> Lance
>
> On Thu, Aug 11, 2016 at 1:15 PM, Kurt Muehlner 
> wrote:
>
> > I’m not aware of a concurrency limit at task granularity, however, one
> > available option is the ‘max_active_runs’ parameter in the DAG class.
> >
> >   max_active_runs (int) – maximum number of active DAG runs, beyond this
> > number of DAG runs in a running state, the scheduler won’t create new
> > active DAG runs
> >
> > I’ve used the ‘pool size of 1’ option you mention as a very simple way to
> > ensure two DAGs run in serial.
> >
> > Kurt
> >
> > On 8/11/16, 6:57 AM, "הילה ויזן"  wrote:
> >
> > should I use pool of size 1?
> >
> > On Thu, Aug 11, 2016 at 4:46 PM, הילה ויזן 
> wrote:
> >
> > > Hi,
> > > I searched in the documentation for a way to limit a specific task
> > > concurrency to 1,
> > > but didn't find a way.
> > > I thought that 'depends_on_past' should achieve this goal, but I
> > want the
> > > task to run even if the previous task failed - just to be sure the
> > they
> > > don't run in parallel.
> > >
> > > The task doesn't have a downstream task, so I can't use
> > > 'wait_for_downstream'.
> > >
> > > Am I Missing something?
> > >
> > > Thanks,
> > > Hila
> > >
> > >
> >
> >
> >
>
>
> --
> Lance Norskog
> lance.nors...@gmail.com
> Redwood City, CA
>


Re: Shout out!

2016-08-12 Thread Jeremiah Lowin
Thanks for the kind words! While I appreciate it, just to be clear, this
script began life as the PR tool for Spark (and I've credited it
appropriately)... though at this point it's so heavily modified that it
definitely stands alone and, if I may say, is pretty damn cool!

Added some very cool new functionality just a few minutes ago in fact:
- automatically wrap commit messages as 50 characters (preserves newlines
and indentation)
- if a user mentions a JIRA issue anywhere in their PR or commits, it is
automatically prepended to the squash commit subject.

Basically: no more asking authors to edit their commits or PRs.
Nonetheless, authors: this doesn't let you off the hook... it's just a
safety net. :)

Call it the "making Sid and Bolke happy" PR. The goodness is here:
https://github.com/apache/incubator-airflow/pull/1728


On Fri, Aug 12, 2016 at 3:42 PM Chris Riccomini 
wrote:

> Perhaps here? https://yetus.apache.org/
>
> On Fri, Aug 12, 2016 at 12:08 PM, siddharth anand 
> wrote:
> > Mentors,
> > how do we contribute this goodness to other apache projects?
> >
> > -s
> >
> > On Fri, Aug 12, 2016 at 12:04 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> OMG yes!
> >>
> >> We should broadcast it to all other Apache projects that uses git.
> >>
> >> Max
> >>
> >> On Fri, Aug 12, 2016 at 9:46 AM, Dan Davydov  >> invalid
> >> > wrote:
> >>
> >> > +1
> >> >
> >> > On Aug 12, 2016 8:02 AM, "Chris Riccomini" 
> >> wrote:
> >> >
> >> > > Same. It's awesome.
> >> > >
> >> > > On Thu, Aug 11, 2016 at 7:28 PM, siddharth anand  >
> >> > > wrote:
> >> > > > FYI!
> >> > > > Just wanted to give a special shout-out for jlowin for writing a
> >> great
> >> > > > merge tool for committers. Thx to this tool, merging your PR is
> super
> >> > > easy.
> >> > > >
> >> > > > -s
> >> > >
> >> >
> >>
>


Re: airflow scheduler slowness as tasks increase

2016-08-08 Thread Jeremiah Lowin
Sure, just modify this code:

import airflow
from airflow.models import Pool
sess = airflow.settings.session()

pool = (
sess.query(Pool)
.filter(Pool.pool=='my_pool')
.first())

if not pool:
session.add(
Pool(
pool='my_pool',
slots=8,
description='this is my pool'
)
)
session.commit()



On Sun, Aug 7, 2016 at 4:37 PM Nadeem Ahmed Nazeer 
wrote:

> Could we create a pool programmatically instead of manually creating from
> UI? I want to create this pool from the chef script when airflow starts up.
>
> Thanks,
> Nadeem
>
> On Wed, Jul 13, 2016 at 5:21 PM, Lance Norskog 
> wrote:
>
> > Nazeer- "If I don't use num_runs, scheduler would just stop after running
> > some number of tasks and I can't figure out why."
> > This is a known bug.
> >
> > One way to help this scheduling is to create a Pool. A Pool is a queue
> > gatekeeper that allows at most N tasks to run concurrently. If you set
> the
> > Pool size to, say, 5-10 and make all tasks join that pool, then only that
> > many tasks will run. The point of Pools is to regulate access to
> contested
> > resources. In this case, all of your external services (S3, Hadoop) are
> > contested resources. In this case, you may have 30 S3 jobs running at
> once
> > and 50 M/R jobs trying to run. You will find this all runs more smoothly
> > when you control the number of active tasks using a resource.
> >
> > Another technique is that either a DAG or a task (I can't remember which)
> > can wait until previous days finish. This is another way to regulate the
> > flow of tasks.
> >
> > After all, you would not do this in the shell:
> >
> > for x in 500 hive scripts
> > do
> >hive -f $x &
> > done
> >
> > This is exactly what Airflow is doing with out-of-control tasks.
> >
> > Lance
> >
> > On Wed, Jul 13, 2016 at 11:18 AM, Nadeem Ahmed Nazeer <
> naz...@neon-lab.com
> > >
> > wrote:
> >
> > > Thanks for the response Bolke. Looking forward to have this slowness
> with
> > > the scheduler fixed in the future airflow releases. I am currently on
> > > version 1.7.0, will upgrade to 1.7.1.3 and also try your suggestions.
> > >
> > > I am using CeleryExecutor. If I don't use num_runs, scheduler would
> just
> > > stop after running some number of tasks and I can't figure out why. The
> > > scheduler would only start running after I restart the service
> manually.
> > > The fix to that was to add this parameter. I found the num_tasks
> > parameter
> > > used in the upstart script for the scheduler by default and also read
> in
> > > the manual to use this (
> > > https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls).
> > >
> > > Thanks,
> > > Nadeem
> > >
> > > On Wed, Jul 13, 2016 at 8:51 AM, Bolke de Bruin 
> > wrote:
> > >
> > > > Nadeem,
> > > >
> > > > Unfortunately this slowness is currently a deficit in the scheduler.
> It
> > > > will be addressed
> > > > in the future, but obviously we are not there yet. To make it more
> > > > manageable you could
> > > > use end_date for the dag and create multiple dags for it, keeping the
> > > > logic the same but
> > > > the dag_id and the start-date / end_date different. If you are on
> > 1.7.1.3
> > > > you will then benefit
> > > > from multiprocessing (max_threads for the scheduler). In addition you
> > add
> > > > load by hand then.
> > > > Not ideal but it will work.
> > > >
> > > > Also depending the speed of your tasks finishing you could limit the
> > > > heartbeat so the scheduler
> > > > does not run redundantly while not being able to fire off new tasks.
> > > >
> > > > In addition why are you using num_runs? I definitely do not recommend
> > > > using it with a
> > > > LocalExecutor and if you are on 1.7.1.3 I would not use it with
> Celery
> > > > either.
> > > >
> > > > I hope this helps!
> > > >
> > > > Bolke
> > > >
> > > > > Op 13 jul. 2016, om 10:43 heeft Nadeem Ahmed Nazeer <
> > > naz...@neon-lab.com>
> > > > het volgende geschreven:
> > > > >
> > > > > Hi,
> > > > >
> > > > > We are using airflow to establish a data pipeline that runs tasks
> on
> > > > > ephemeral amazon emr cluster. The oldest data we have is from
> > > 2014-05-26
> > > > > which we have set as the start date with a scheduler interval of 1
> > day
> > > > for
> > > > > airflow.
> > > > >
> > > > > We have an s3 copy task, a map reduce task and a bunch of hive and
> > > impala
> > > > > load tasks in our DAG all run via PythonOperator. Our expectation
> is
> > > for
> > > > > airflow to run each of these tasks for each day from the start date
> > > till
> > > > > current date.
> > > > >
> > > > > Just for numbers, the number of dags that got created were
> > > approximately
> > > > > 800 from start date till current date (2016-07-13). All is well at
> > the
> > > > > start of the execution but as it executes more and more tasks, the
> > > > > scheduling of tasks starts slowing down. Looks like the scheduler
> is
> > > > > spending lot of time in

Broken unit tests -- request for help

2016-08-04 Thread Jeremiah Lowin
We have a few non-deterministic unit test failures that are affecting many
-- but not all -- PRs. I believe they are being ignored as "unrelated" but
they have the potential to mask real issues and should be addressed.
Unfortunately they're out of my expertise so I'm going to list the ones
I've identified and hope someone smarter than me can see if they can help!

In particular, we have a number of simple PR's that should obviously have
no problems (typos, readme edits, etc.) that are nonetheless failing tests,
causing frustration for all. Here is one from just this morning:
https://github.com/apache/incubator-airflow/pull/1705/files

Thanks in advance!

1. Python 3 Mysql (this one is pretty common), due to not being able to
find "beeline" which I believe is related to Hive. This is the error:

==

ERROR: test_mysql_to_hive_partition (tests.TransferTests)

--

Traceback (most recent call last):

  File 
"/home/travis/build/apache/incubator-airflow/tests/operators/operators.py",
line 208, in test_mysql_to_hive_partition

t.run(start_date=DEFAULT_DATE, end_date=DEFAULT_DATE, force=True)

  File "/home/travis/build/apache/incubator-airflow/airflow/models.py",
line 2350, in run

force=force,)

  File "/home/travis/build/apache/incubator-airflow/airflow/utils/db.py",
line 54, in wrapper

result = func(*args, **kwargs)

  File "/home/travis/build/apache/incubator-airflow/airflow/models.py",
line 1388, in run

result = task_copy.execute(context=context)

  File 
"/home/travis/build/apache/incubator-airflow/airflow/operators/mysql_to_hive.py",
line 131, in execute

recreate=self.recreate)

  File 
"/home/travis/build/apache/incubator-airflow/airflow/hooks/hive_hooks.py",
line 322, in load_file

self.run_cli(hql)

  File 
"/home/travis/build/apache/incubator-airflow/airflow/hooks/hive_hooks.py",
line 212, in run_cli

cwd=tmp_dir)

  File "/opt/python/3.4.2/lib/python3.4/subprocess.py", line 858, in __init__

restore_signals, start_new_session)

  File "/opt/python/3.4.2/lib/python3.4/subprocess.py", line 1456, in
_execute_child

raise child_exception_type(errno_num, err_msg)

nose.proxy.FileNotFoundError: [Errno 2] No such file or directory: 'beeline'


2. Python 3 Postgres (this one is really infrequent):

==

FAIL: Test that ignore_first_depends_on_past doesn't affect results

--

Traceback (most recent call last):

  File "/home/travis/build/apache/incubator-airflow/tests/jobs.py",
line 349, in test_dagrun_deadlock_ignore_depends_on_past

run_kwargs=dict(ignore_first_depends_on_past=True))

  File "/home/travis/build/apache/incubator-airflow/airflow/utils/db.py",
line 54, in wrapper

result = func(*args, **kwargs)

  File "/home/travis/build/apache/incubator-airflow/tests/jobs.py",
line 221, in evaluate_dagrun

self.assertEqual(ti.state, expected_state)

nose.proxy.AssertionError: None != 'success'

3. Mysql (py2 and py3, infrequent). This appears to happen when the
SLA code is called wiht mysql. Bizarrely, this doesn't appear to
actually raise an error in the test -- it just prints a logging error.
It must be trapped somewhere.

ERROR [airflow.jobs.SchedulerJob] Boolean value of this clause is not defined

Traceback (most recent call last):

  File "/home/travis/build/apache/incubator-airflow/airflow/jobs.py",
line 667, in _do_dags

self.manage_slas(dag)

  File "/home/travis/build/apache/incubator-airflow/airflow/utils/db.py",
line 53, in wrapper

result = func(*args, **kwargs)

  File "/home/travis/build/apache/incubator-airflow/airflow/jobs.py",
line 301, in manage_slas

.all()

  File 
"/home/travis/build/apache/incubator-airflow/.tox/py34-cdh-airflow_backend_mysql/lib/python3.4/site-packages/sqlalchemy/sql/elements.py",
line 2760, in __bool__

raise TypeError("Boolean value of this clause is not defined")

TypeError: Boolean value of this clause is not defined


A question for the Airflow community

2016-08-03 Thread Jeremiah Lowin
Dear all --

Thanks for being part of the Airflow community! We are all deeply
appreciative of your involvement in this project, your willingness to ask
questions, point out problems, and (especially) build solutions.

I'd like to ask you all a question in turn: what do you know now that you
wish you knew when you first deployed Airflow? In other words, what made
you do this: https://www.youtube.com/watch?v=gqQ99s4Ywnw

This is intentionally very open-ended and I hope many of you will respond,
even if it's just a few words on a pain point you encountered that wasn't
quite big enough to deserve a JIRA issue (but please note if you don't open
an issue, it's not getting fixed ;) Here's the link if you need it:
https://issues.apache.org/jira/browse/AIRFLOW/).

We will use your feedback to figure out not only what could be better
communicated, but also the best format for doing so. We want to make it
easier to mold Airflow to every user's unique environment. That might
involve improving the documentation; making better use of the wiki to store
knowledge; or creating a blog to present more narrative-driven examples,
use cases, and news.

Thanks so much!
-- Jeremiah (on behalf of the Airflow team)


Re: Sync Git code among Airflow workers

2016-07-13 Thread Jeremiah Lowin
I have a little module for this that was designed to facilitate syncing a
git repo in Kubernetes: https://github.com/jlowin/git-sync. The idea is to
sync a volume that is then shared to all containers (webserver, scheduler,
workers, etc). It also works locally.

However I want to stress that this is absolutely 100% unsupported (by me)!
It's an experiment that's works well enough for my use case. Maybe it's a
useful jumping off point?

Best,
J
On Wed, Jul 13, 2016 at 2:35 PM Chris Riccomini 
wrote:

> Most folks follow a push-based approach (puppet, chef, etc).
>
> Our approach is CRON-based pull, described here:
>
> https://wecode.wepay.com/posts/airflow-wepay
>
> On Wed, Jul 13, 2016 at 11:10 AM, Fernando San Martin 
> wrote:
> > At Turo we have our data pipeline organized as a set of Python & SQL jobs
> > orchestrated by Jenkins. We are evaluating Airflow as an alternative and
> we
> > have managed to get quite far but we have some questions that we were
> > hoping to get help with from the community.
> >
> > We have a set-up with a master node and two workers, our code was
> deployed
> > in all three boxes by retrieving from a git repo. The code in the repo
> > changes on a regular basis and we need to keep the boxes with the latest
> > version of the code.
> >
> > We first thought of adding to the top of our DAGs a BashOperator task
> that
> > simply runs `git pull origin master`, but since this code gets only
> > executed in the workers, the master node will eventually differ from the
> > code that is on the workers.
> >
> > Another option is to run a cron job that executes `git pull origin
> master`
> > in each box every 5-mins or so.
> >
> > Are there recommendations or best practices on how to handle this
> situation?
> >
> > Thank you!
>


Re: Passing XCOM variables from parent DAG to child DAG

2016-07-13 Thread Jeremiah Lowin
Hi Greg,

This should be possible. When you pull the XCom, you can specify key,
task_id, and dag_id. By default, it only pulls from the same DAG, so try
setting the DAG_id to match the parent.

J

On Wed, Jul 13, 2016 at 2:34 PM Greg Lira  wrote:

> Hi,
>
> We want to push an XCom in the parent dag and then use (pull) it in our
> SubDAGs.Can't get this working in airflow.
> Is it supported? How can we use that?
>
> Thanks,
> Greg
>


Re: Running a task from the Airflow UI

2016-07-11 Thread Jeremiah Lowin
Paul,

The trigger_dag cli command accepts a json "conf" parameter. To be honest
I'm not familiar with the feature (perhaps Sid can provide more detail) but
I think it might accomplish your goals.

J

On Fri, Jul 8, 2016 at 1:11 PM Paul Minton  wrote:

> +1 to this feature as well.
>
> I wonder if it would be possible to pass along some context when triggering
> a job. This may be outside the scope of this thread, but it would allow for
> more dynamic runs of a job. As a simple example, I may want to kick off a
> job and pass along the key to a file on s3. Right now we would depend on a
> initial s3 sensor, but that would require that the filename be static
> across runs.
>
> On Thu, Jul 7, 2016 at 9:55 AM, Chris Riccomini 
> wrote:
>
> > +1
> >
> > On Thu, Jul 7, 2016 at 5:18 AM, Bolke de Bruin 
> wrote:
> >
> > > Ideally the CLI and WebUI should both access an API that handles
> > > authentication and authorization. This would resolve both issues.
> > However,
> > > the UI already allows for authentication and to a lesser extent
> > > authorization. Thus allowing this from the UI (which we already do for
> > > Celery) is not a big change.
> > >
> > > - Bolke
> > >
> > >
> > > > Op 7 jul. 2016, om 11:01 heeft Alexander Alten-Lorenz <
> > > wget.n...@gmail.com> het volgende geschreven:
> > > >
> > > > Sounds good, but on the other hand I'm with Maxime. Given that the
> task
> > > can be triggered per CLI, the functionality is available but needs a
> > local
> > > login. When the "run" button now would be available for everyone who
> has
> > > access to the UI, I can imagine that would cause some serious load
> issues
> > > in a production environment, especially with SLA based workflow setups.
> > > > On the other hand, when the "run" button with a local executor would
> > > queue the task in a control queue (like "external triggered") a admin
> > could
> > > finally mark them as "approved".
> > > >
> > > > --alex
> > > >
> > > >> On Jul 7, 2016, at 12:12 AM, Jeremiah Lowin 
> > wrote:
> > > >>
> > > >> Perhaps it's a good chance to revisit the functionality. Right now
> the
> > > UI
> > > >> "run" button actually runs the task via CeleryExecutor. Perhaps
> > instead
> > > (or
> > > >> just when using a non-Celery executor) it should queue the task and
> > let
> > > the
> > > >> Scheduler pick it up. I guess in that case it would just be sugar
> for
> > > >> marking a TI as QUEUED. Just a thought.
> > > >>
> > > >> On Wed, Jul 6, 2016 at 2:54 AM Maxime Beauchemin <
> > > maximebeauche...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> The problem is that a web server isn't the right place to run an
> > > airflow
> > > >>> task. From the context of the web request scope we have to somehow
> > > pass a
> > > >>> message to an external executor to run the task. For LocalExecutor
> to
> > > work
> > > >>> the web server would have to start a LocalExecutor as a sub process
> > and
> > > >>> that doesn't sound like a great idea...
> > > >>>
> > > >>> Max
> > > >>>
> > > >>> On Tue, Jul 5, 2016 at 11:22 AM, Jason Chen <
> > chingchien.c...@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Airflow team,
> > > >>>> I am using the "LocalExecutor" and it works very well to run the
> > > >>> workflow
> > > >>>> I setup.
> > > >>>>
> > > >>>> I noticed that, from the UI, it can trigger a task to run.
> > > >>>> However, I got the error "Only works with the CeleryExecutor,
> sorry
> > ".
> > > >>>> I can ssh into airflow node and run the command line from there.
> > > >>>> However, it would be nice to just run it from airflow UI.
> > > >>>> Is it possible to do that (with "LocalExecutor") or it's a future
> > > feature
> > > >>>> to consider ?
> > > >>>>
> > > >>>> Thanks.
> > > >>>> Jason
> > > >>>>
> > > >>>
> > > >
> > >
> > >
> >
>


Re: Running a task from the Airflow UI

2016-07-06 Thread Jeremiah Lowin
It's funny but I never thought about it either until your email. We can't
guarantee there's a scheduler running... but then again we can't guarantee
a Celery worker is running either, so I think it might be a good
enhancement (along with an informative status message).
On Wed, Jul 6, 2016 at 8:32 PM Maxime Beauchemin 
wrote:

> Oh right, somehow I didn't even think of that. It's a fair assumption that
> there should be a scheduler up.
>
> Max
>
> On Wed, Jul 6, 2016 at 3:12 PM, Jeremiah Lowin  wrote:
>
> > Perhaps it's a good chance to revisit the functionality. Right now the UI
> > "run" button actually runs the task via CeleryExecutor. Perhaps instead
> (or
> > just when using a non-Celery executor) it should queue the task and let
> the
> > Scheduler pick it up. I guess in that case it would just be sugar for
> > marking a TI as QUEUED. Just a thought.
> >
> > On Wed, Jul 6, 2016 at 2:54 AM Maxime Beauchemin <
> > maximebeauche...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > The problem is that a web server isn't the right place to run an
> airflow
> > > task. From the context of the web request scope we have to somehow
> pass a
> > > message to an external executor to run the task. For LocalExecutor to
> > work
> > > the web server would have to start a LocalExecutor as a sub process and
> > > that doesn't sound like a great idea...
> > >
> > > Max
> > >
> > > On Tue, Jul 5, 2016 at 11:22 AM, Jason Chen  >
> > > wrote:
> > >
> > > > Hi Airflow team,
> > > >  I am using the "LocalExecutor" and it works very well to run the
> > > workflow
> > > > I setup.
> > > >
> > > > I noticed that, from the UI, it can trigger a task to run.
> > > > However, I got the error "Only works with the CeleryExecutor, sorry
> ".
> > > > I can ssh into airflow node and run the command line from there.
> > > > However, it would be nice to just run it from airflow UI.
> > > > Is it possible to do that (with "LocalExecutor") or it's a future
> > feature
> > > > to consider ?
> > > >
> > > > Thanks.
> > > > Jason
> > > >
> > >
> >
>


Re: Running a task from the Airflow UI

2016-07-06 Thread Jeremiah Lowin
Perhaps it's a good chance to revisit the functionality. Right now the UI
"run" button actually runs the task via CeleryExecutor. Perhaps instead (or
just when using a non-Celery executor) it should queue the task and let the
Scheduler pick it up. I guess in that case it would just be sugar for
marking a TI as QUEUED. Just a thought.

On Wed, Jul 6, 2016 at 2:54 AM Maxime Beauchemin 
wrote:

> Hi,
>
> The problem is that a web server isn't the right place to run an airflow
> task. From the context of the web request scope we have to somehow pass a
> message to an external executor to run the task. For LocalExecutor to work
> the web server would have to start a LocalExecutor as a sub process and
> that doesn't sound like a great idea...
>
> Max
>
> On Tue, Jul 5, 2016 at 11:22 AM, Jason Chen 
> wrote:
>
> > Hi Airflow team,
> >  I am using the "LocalExecutor" and it works very well to run the
> workflow
> > I setup.
> >
> > I noticed that, from the UI, it can trigger a task to run.
> > However, I got the error "Only works with the CeleryExecutor, sorry ".
> > I can ssh into airflow node and run the command line from there.
> > However, it would be nice to just run it from airflow UI.
> > Is it possible to do that (with "LocalExecutor") or it's a future feature
> > to consider ?
> >
> > Thanks.
> > Jason
> >
>


Re: ExternalTaskSensor offsets with monthly jobs

2016-06-30 Thread Jeremiah Lowin
Thanks Laura -- it sounds like you're describing the "push" version of
Adrian's "pull" workflow. I hadn't considered that one.

Adrian, if that doesn't work for you please keep an eye on:
https://github.com/apache/incubator-airflow/pull/1641

On Thu, Jun 30, 2016 at 12:22 PM Laura Lorenz 
wrote:

> I think I mentioned this on another similar thread and I think our use case
> might be somewhat similar. We have a daily ETL that loads data in to a
> database in one DAG, and then need to do a weekly rollup report every
> Tuesday that is in another DAG. The first DAG has a final
> TriggerDagRunOperator that decides if today is Tuesday or not, and if yes,
> triggers the weekly rollup DAG that operates on the data in the database at
> that moment - which, since it's Tuesday, is all the data we want. If that
> sounds like what you're trying to do, your first DAG might have a
> TriggerDagRunOperator that decides if today is the first of the month, and
> then triggers some other DAG.
>
> Laura
>
> On Thu, Jun 30, 2016 at 12:09 PM, Jeremiah Lowin 
> wrote:
>
> > Interesting -- this could be an extension of open enhancement AIRFLOW-100
> > https://issues.apache.org/jira/browse/AIRFLOW-100. Let me see if I can
> > restate this correctly:
> >
> > - You have a daily ETL job
> > - You have a monthly reporting job, for arguments sake lets say it runs
> on
> > the last day of each month with an execution date equal to the last day
> of
> > the prior month (for example on 7/31/2016 the task with execution date
> > 6/30/2016 will run).
> > You want the monthly job with execution date 6/30/2016 to wait for (and
> > include) the daily ETLs through 7/31/2016. In some months, that requires
> a
> > 31 day delta, in others 30 (in others 28... and forget about leap years).
> >
> > It sounds like the simplest solution (and the one proposed in A-100) is
> to
> > allow ExternalTaskSensor to accept not just a static delta, but
> potentially
> > a callable that accepts the current execution date and returns the
> desired
> > execution date for the sensed task. In this case, it would take in
> > 6/30/2016 and return 7/31/2016 as the last day of the following month. I
> > don't think any headway has been made on actually implementing the
> solution
> > but it should be straightforward -- I will try to get to it if I have
> some
> > time in the next few days.
> >
> >
> > On Wed, Jun 29, 2016 at 11:25 AM Adrian Bridgett 
> > wrote:
> >
> > > I'm hitting a bit of an annoying problem and wondering about the best
> > > course of action.
> > >
> > > We have several dags:
> > > - a daily ETL job
> > > - several reporting jobs (daily, weekly or monthly) which use the data
> > > from previous ETL jobs
> > >
> > > I wish to have a dependency such that the reporting jobs depend upon
> the
> > > last ETL job that the report uses.   We're happy to set depends_on_past
> > > in the ETL job.
> > >
> > > Daily jobs are easy - ExternalTaskSensor, job done.
> > > Weekly jobs are a little trickier - we need to work out the
> > > execution_delta - normally +6 for us (we deliberately run a day late to
> > > prioritise other jobs).
> > > Monthly jobs this is where I'm struggling - how to work out the
> > > execution_delta.   I guess the ideal would be an upgrade from timedelta
> > > to dateutil.relativedelta?   tomorrow_ds and ds_add don't help either.
> > >
> > > I must admit, ds being the time that's just gone has caused me no end
> of
> > > brain befudledness, especially when trying to get the initial job right
> > > (so much so that I wrote this up in our DAG README, posting here for
> > > others):
> > >
> > > When adding a new job, it's critical to ensure that you've set the
> > > schedule correctly:
> > > - frequency (monthly, weekly, daily)
> > > - schedule_interval ("0 0 2 * *", "0 0 * * 0", "0 0 * * *")
> > > - start_date (choose a day that matches schedule_interval at least one
> > > interval ago)
> > > -- e.g if today is Thursday 2016-06-09, go back in time to when the
> > > schedule will trigger,
> > > then work out what "ds" (execution date) would be (remembering
> > > that's the lapsed date)
> > > --- for a monthly job, last trigger=2016-06-02, ds=2016-05-02
> > > --- for a weekly job, last trigger=2016-06-05, ds=2016-05-29
> > > --- for a daily job, last trigger=2016-06-09, ds=2016-06-08
> > >
> >
>


Re: ExternalTaskSensor offsets with monthly jobs

2016-06-30 Thread Jeremiah Lowin
Interesting -- this could be an extension of open enhancement AIRFLOW-100
https://issues.apache.org/jira/browse/AIRFLOW-100. Let me see if I can
restate this correctly:

- You have a daily ETL job
- You have a monthly reporting job, for arguments sake lets say it runs on
the last day of each month with an execution date equal to the last day of
the prior month (for example on 7/31/2016 the task with execution date
6/30/2016 will run).
You want the monthly job with execution date 6/30/2016 to wait for (and
include) the daily ETLs through 7/31/2016. In some months, that requires a
31 day delta, in others 30 (in others 28... and forget about leap years).

It sounds like the simplest solution (and the one proposed in A-100) is to
allow ExternalTaskSensor to accept not just a static delta, but potentially
a callable that accepts the current execution date and returns the desired
execution date for the sensed task. In this case, it would take in
6/30/2016 and return 7/31/2016 as the last day of the following month. I
don't think any headway has been made on actually implementing the solution
but it should be straightforward -- I will try to get to it if I have some
time in the next few days.


On Wed, Jun 29, 2016 at 11:25 AM Adrian Bridgett 
wrote:

> I'm hitting a bit of an annoying problem and wondering about the best
> course of action.
>
> We have several dags:
> - a daily ETL job
> - several reporting jobs (daily, weekly or monthly) which use the data
> from previous ETL jobs
>
> I wish to have a dependency such that the reporting jobs depend upon the
> last ETL job that the report uses.   We're happy to set depends_on_past
> in the ETL job.
>
> Daily jobs are easy - ExternalTaskSensor, job done.
> Weekly jobs are a little trickier - we need to work out the
> execution_delta - normally +6 for us (we deliberately run a day late to
> prioritise other jobs).
> Monthly jobs this is where I'm struggling - how to work out the
> execution_delta.   I guess the ideal would be an upgrade from timedelta
> to dateutil.relativedelta?   tomorrow_ds and ds_add don't help either.
>
> I must admit, ds being the time that's just gone has caused me no end of
> brain befudledness, especially when trying to get the initial job right
> (so much so that I wrote this up in our DAG README, posting here for
> others):
>
> When adding a new job, it's critical to ensure that you've set the
> schedule correctly:
> - frequency (monthly, weekly, daily)
> - schedule_interval ("0 0 2 * *", "0 0 * * 0", "0 0 * * *")
> - start_date (choose a day that matches schedule_interval at least one
> interval ago)
> -- e.g if today is Thursday 2016-06-09, go back in time to when the
> schedule will trigger,
> then work out what "ds" (execution date) would be (remembering
> that's the lapsed date)
> --- for a monthly job, last trigger=2016-06-02, ds=2016-05-02
> --- for a weekly job, last trigger=2016-06-05, ds=2016-05-29
> --- for a daily job, last trigger=2016-06-09, ds=2016-06-08
>


Re: Projects using GitHub issues

2016-06-28 Thread Jeremiah Lowin
Thanks for the heads up Chris!

On Tue, Jun 28, 2016 at 10:21 AM Chris Riccomini 
wrote:

> FYI- it sounds like there's some discussion about allowing GH issues
> instead of JIRA.
>
> -- Forwarded message --
> From: John D. Ament 
> Date: Tue, Jun 28, 2016 at 7:17 AM
> Subject: Projects using GitHub issues
> To: "gene...@incubator.apache.org" 
>
>
> All,
>
> I started a discussion on legal discuss, as I was told VP Legal approved
> using GitHub issues.  I have no specific concerns, other than ensuring that
> permissions are propagated.
>
>
> https://lists.apache.org/thread.html/d2cb0eed30d72976a2f54893f132c1fe300a86d2fdbce1422763c9f0@%3Clegal-discuss.apache.org%3E
>
> Its clear to me that legal-discuss isn't where this should start, so I'm
> not sure why VP Legal approved.  Anyways, I wanted to get opinions from the
> incubator on who should be discussing this issue.
>
> John
>


Standard imports merged: please note impact on unit tests!

2016-06-17 Thread Jeremiah Lowin
This morning I finally merged the "standard imports" PR that has been
outstanding for almost three months (
https://github.com/apache/incubator-airflow/pull/1272,
https://issues.apache.org/jira/browse/AIRFLOW-31).

This is the one that makes this behavior deprecated (with a warning):
from airflow.operators import PigOperator
in favor of:
from airflow.operators.pig_operator import PigOperator

*IMPORTANT* the user-facing import mechanism will remain fully
backwards-compatible until Airflow 2.0, but the backwards compatibility is
turned off for unit tests. I have gone through and done the painful work of
updating all existing unit tests to comply with this rule.

Why could this be an issue? We don't have full control over Apache travis,
so it's difficult to make sure that changes to unit tests are enforced on
currently-outstanding PRs. Normally we'd just rerun the PR's tests -- but
Apache travis can't do that without a new commit from the PR's author.

Therefore, when merging any PR that travis green-lit in the recent past,
please take an extra moment to eyeball (or run) any modified unit tests.
Alternatively, you can pull the PR to your local branch and run travis
there.

I wish there were a more graceful way to do this -- I considered turning
backwards compatibility on for tests but that just pushes the pain into the
future, when things will be even more complex.

Thanks,
J


Re: Iterating task xcom keys inside template

2016-06-16 Thread Jeremiah Lowin
Or, I'm afraid my jinja-fu is sorely lacking and I'm not going to be much
help to you here... perhaps Maxime or one of the other devs is more
experienced with templates and can offer some advice?

Apologies!
Jeremiah

On Thu, Jun 16, 2016 at 2:40 AM Or Sher  wrote:

> Thanks Jeremiah!
>
> Looking at the code it seems like get_many can return only many xcom values
> from different task_ids and not different keys from the same task_id.
> But it doesn't really matter as I would really like to do the iteration
> somehow inside the template.
> I feel like building the template inside the previous python operator is
> kind of missing the whole point of templates as I can already create the
> whole "already rendered" email message at the same time.
>
> I tried pushing one big json xcom to reflect all of the tests, but now
> I'm struggling
> with iterating string json inside the template.
> The following template fails on the "for" line:
> VALIDATION_EMAIL_CONTENT = """
> Date: {{ ds }}
> 
> 
> {% for key, value in
> task_instance.xcom_pull(task_ids='data_validation_tests',
> key='tests_results') %}
> {{ key|string() }}
> {{ value|string() }}
> {% endfor %}
> ..
> ..
>
> File "", line 5, in top-level template code
> ValueError: too many values to unpack
>
> I'm guessing it's because the xcom_pull returns a string but I didn't
> managed to cast it to a json yet..
>
> On Wed, Jun 15, 2016 at 5:33 PM Jeremiah Lowin  wrote:
>
> > Hi Or,
> >
> > There is support for pulling multiple XComs at once (see XCom.get_many())
> > but it won't be directly available inside a jinja template, where you you
> > are typically calling ti.xcom_pull() and returning a single XCom at a
> time,
> > using a key you know in advance.
> >
> > I would suggest that you do this processing either inside your existing
> > PythonOperator or in a subsequent PythonOperator that calls
> XCom.get_many()
> > and produces the final template for your email operator. It is
> > theoretically possible for you to do this in your email template using
> > template macros, but I only have a basic knowledge of that (and it's a
> > convoluted process) so I don't want to give you bad advice. If you can do
> > your preprocessing in your PythonOperator I think you will be happier,
> > especially because the result will be visible in the Airflow UI and logs.
> >
> > Best,
> > Jeremiah
> >
> > On Wed, Jun 15, 2016 at 9:49 AM Or Sher  wrote:
> >
> > > Hi all,
> > >
> > > For the purpose of data validation tests I have two adjacent operators:
> > > Python - Parsing predefined json of tests and saves the result as an
> xcom
> > > where each key is a test id.
> > >
> > > Email - I'd like to be able to iterate over all keys from the previous
> > task
> > > inside the predefined email template so I won't need to change it every
> > > time I change the tests json.
> > >
> > > It seems like it's not supported. Is that right?
> > > What are my choices?
> > > If I won't find anything else, I assume I can always use one key for
> all
> > of
> > > the tests combined into one json.
> > >
> >
>


Re: S3 connection

2016-06-16 Thread Jeremiah Lowin
Hi Tyrone,

The motivation behind the change was to force *all* Airflow connections
(including those used for logging) to go through the UI where they can be
managed/controlled by an admin, and also to allow more fine-grained
permissioning.

Fortunately, connections can be created programmatically with just a couple
extra steps. I use a script similar to this one (below) to set up all of
the connections in our production environment after restarts. I've made
some changes to show how the keys could be taken from env vars. You could
run this script either as part of your own library or plugin.

I hope this helps and I'm sorry for the inconvenience!



import airflow
import json
from airflow.models import Connection

S3_CONN_ID = 's3_connection'

if __name__ == '__main__':
session = airflow.settings.Session()

# check if the connection exists
s3_connection = (
session.query(Connection)
.filter(Connection.conn_id == S3_CONN_ID)
.one())

if not s3_connection:
print('Creating connection: {}'.format(S3_CONN_ID))
session.add(
Connection(
conn_id=S3_CONN_ID,
conn_type='s3',
extra=json.dumps(dict(
aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],

aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'])))
session.commit()
print('Done creating connections.')


On Thu, Jun 16, 2016 at 11:01 AM Tyrone Hinderson 
wrote:

> Hey Jacob,
>
> Thanks for your quick response. I doubt I can take your approach, because
>
>1. It's imperative that the s3 connection be contained within an
>environment variable
>2. My scheduler is deployed on an AWS box which uses an IAM role to
>connect to s3, not a credentials file.
>
> However, can you tell me where you got the idea to use that particular
> JSON? Might help with my quest for a solution.
>
> On Wed, Jun 15, 2016 at 8:00 PM Jakob Homan  wrote:
>
> > Hey Tyrone-
> >I just set this up on 1.7.1.2 and found the documentation confusing
> > too.  Been meaning to improve the documentation.  To get S3 logging
> > configured I:
> >
> > (a) Set up an S3Connection (let's call it foo) with only the extra
> > param set to the following json:
> >
> > { "s3_config_file": "/usr/local/airflow/.aws/credentials",
> > "s3_config_format": "aws" }
> >
> > (b) Added a remote_log_conn_id key to the core section of airflow.cfg,
> > with a value of "foo" (my S3Connection name)
> >
> > (c) Added a remote_base_log_folder key to the core section of
> > airflow.cfg, with a value of "s3://where/i/put/my/logs"
> >
> > Everything worked after that.
> >
> > -Jakob
> >
> > On 15 June 2016 at 15:35, Tyrone Hinderson 
> wrote:
> > > @Jeremiah,
> > >
> > > http://pythonhosted.org/airflow/configuration.html#logs
> > >
> > > I used to log to s3 in 1.7.0, and my background .aws/credentials would
> > take
> > > care of authenticating in the background. Now it appears that I need to
> > set
> > > that "remote_log_conn_id" config field in order to continue logging to
> s3
> > > in 1.7.1.2. Rather than create the connection in the web UI (afaik,
> > > impractical to do programatically), I'd like to use an
> > > "AIRFLOW_CONN_"-style env variable. I've tried an url like
> > > s3://[access_key_id]:[secret_key]@[bucket].s3-[region].amazonaws.com,
> > but
> > > that hasn't worked:
> > >
> > > =
> > > [2016-06-15 21:40:26,583] {base_hook.py:53} INFO - Using connection to:
> > > [bucket].s3-us-east-1.amazonaws.com <
> http://s3-us-east-1.amazonaws.com/>
> > >
> > > [2016-06-15 21:40:26,583] {logging.py:57} ERROR - Could not create an
> > > S3Hook with connection id "S3_LOGS". Please make sure that airflow[s3]
> is
> > > installed and the S3 connection exists.
> > >
> > > =
> > >
> > > It's clear that my connection exists because of the "Using connection
> > to:"
> > > line. However, I fear that my connection URI string is malformed. Can
> you
> > > provide some guidance as to how I might properly form an s3 connection
> > URI,
> > > since I mainly followed a mixture of wikipedia's URI format
> > > <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Examples>
> > > and amazon's
>

Re: Iterating task xcom keys inside template

2016-06-15 Thread Jeremiah Lowin
Hi Or,

There is support for pulling multiple XComs at once (see XCom.get_many())
but it won't be directly available inside a jinja template, where you you
are typically calling ti.xcom_pull() and returning a single XCom at a time,
using a key you know in advance.

I would suggest that you do this processing either inside your existing
PythonOperator or in a subsequent PythonOperator that calls XCom.get_many()
and produces the final template for your email operator. It is
theoretically possible for you to do this in your email template using
template macros, but I only have a basic knowledge of that (and it's a
convoluted process) so I don't want to give you bad advice. If you can do
your preprocessing in your PythonOperator I think you will be happier,
especially because the result will be visible in the Airflow UI and logs.

Best,
Jeremiah

On Wed, Jun 15, 2016 at 9:49 AM Or Sher  wrote:

> Hi all,
>
> For the purpose of data validation tests I have two adjacent operators:
> Python - Parsing predefined json of tests and saves the result as an xcom
> where each key is a test id.
>
> Email - I'd like to be able to iterate over all keys from the previous task
> inside the predefined email template so I won't need to change it every
> time I change the tests json.
>
> It seems like it's not supported. Is that right?
> What are my choices?
> If I won't find anything else, I assume I can always use one key for all of
> the tests combined into one json.
>


Re: Getting to an Apache release

2016-06-14 Thread Jeremiah Lowin
Related to #2, I am dusting off AIRFLOW-31 (
https://github.com/apache/incubator-airflow/pull/1272), which also includes
about 50 license headers. I rebased it today and am waiting for travis to
finish...


On Tue, Jun 14, 2016 at 2:15 PM Bolke de Bruin  wrote:

> Nice! Why don’t we also merge 1 into master then and fix issues from
> there? It will get wider exposure (We run master in our dev environments
> for example), it won’t put everything into the lab or Airbnb. Master hardly
> should be considered production so imho it is allowed some time to settle
> before it is stable again.
>
> Does someone want to pick up 2 say by the end of week or next week to fix
> all headers?
>
> - Bolke
>
>
> > Op 14 jun. 2016, om 20:01 heeft Maxime Beauchemin <
> maximebeauche...@gmail.com> het volgende geschreven:
> >
> > I think (1) is good to go. I can cherry pick it into our production to
> make
> > sure it's ready for release.
> >
> > (2) should be fine, git should be able to merge easily, otherwise it's
> > super easy to resolve any conflicts
> >
> > Max
> >
> > On Tue, Jun 14, 2016 at 9:04 AM, Chris Riccomini 
> > wrote:
> >
> >> Hey Bolke,
> >>
> >> I think your list is good. (1) is what I'm most concerned about, as it
> >> requires actually touching the code, and is blocking on graduation. I
> >> *think* Max had a partial PR on that, but don't know the current state.
> >>
> >> Re: (2), agree. Should just do a bulk PR for it.
> >>
> >> Cheers,
> >> Chris
> >>
> >> On Tue, Jun 14, 2016 at 8:41 AM, Bolke de Bruin 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am wondering what needs to be done to get to an Apache release? I
> think
> >>> now 1.7.1.3 is out we should be focused on getting one out as we are
> kind
> >>> of half way the incubation process. What comes to my mind is:
> >>>
> >>> 1. Replace highcharts by D3 (WIP:
> >>> https://github.com/apache/incubator-airflow/pull/1469)
> >>> 2. Add license headers everywhere (TM) (Sucks, as it will break many
> PRs
> >> -
> >>> but lets do it quickly)
> >>> 3. Have a review by Apache
> >>>
> >>> Anything I am missing?
> >>>
> >>> - Bolke
> >>
>
>


Re: s3 logging issues

2016-06-10 Thread Jeremiah Lowin
Jason, I just set up a fresh Airflow installation and was able to configure
remote logging via S3 connection. Could you please try the following:

1. Create the S3 connection as you described
2. Open python and run:
import airflow
s3 = airflow.hooks.S3Hook('s3_conn')
s3.load_string('test', airflow.conf.get('core',
'remote_base_log_folder'))
3. Confirm whether the test file is created?

Thanks,
Jeremiah




On Fri, Jun 10, 2016 at 5:39 AM Sumit Maheshwari 
wrote:

> Does your bucket name has dots(.) ?
>
>
>
> On Fri, Jun 10, 2016 at 7:14 AM, Jeremiah Lowin  wrote:
>
> > Jason,
> >
> > I will try to figure this out for you tomorrow.
> >
> > Jeremiah
> >
> > On Thu, Jun 9, 2016 at 12:20 PM Jason Kromm 
> > wrote:
> >
> > > Hi all,
> > >
> > > I've read the docs, I've looked at the source for s3Hook, and I've
> brute
> > > forced different combinations in my config file but I just can't get
> this
> > > working so I'm hoping someone can give me some insite.
> > >
> > > I need my logs to go into s3 storage, but no matter how I configure it
> my
> > > s3 bucket remains empty, and I never see any errors in my airflow
> > scheduler
> > > or workers regarding issues connection, or that it's even attempted to
> > > connect.
> > >
> > > Pip install airflow[s3]
> > >
> > > Set up a connection in web UI called s3_conn, with type S3, and extra
> set
> > > to {"aws_access_key_id": "mykey", "aws_secret_access_key": "mykey"}
> > >
> > > In airflow config I set the following
> > > remote_base_log_folder = s3://bucket/  (I've tried with and without the
> > > trailing slash)
> > > remote_log_conn_id = s3_conn
> > > encrypt_s3_logs = False
> > >
> > > Is there some other step that I'm missing?
> > >
> > > Thanks,
> > >
> > > Jason Kromm
> > > This email and any attachments may contain confidential and proprietary
> > > information of Blackboard that is for the sole use of the intended
> > > recipient. If you are not the intended recipient, disclosure, copying,
> > > re-distribution or other use of any of this information is strictly
> > > prohibited. Please immediately notify the sender and delete this
> > > transmission if you received this email in error.
> > >
> >
>


Re: s3 logging issues

2016-06-09 Thread Jeremiah Lowin
Jason,

I will try to figure this out for you tomorrow.

Jeremiah

On Thu, Jun 9, 2016 at 12:20 PM Jason Kromm 
wrote:

> Hi all,
>
> I've read the docs, I've looked at the source for s3Hook, and I've brute
> forced different combinations in my config file but I just can't get this
> working so I'm hoping someone can give me some insite.
>
> I need my logs to go into s3 storage, but no matter how I configure it my
> s3 bucket remains empty, and I never see any errors in my airflow scheduler
> or workers regarding issues connection, or that it's even attempted to
> connect.
>
> Pip install airflow[s3]
>
> Set up a connection in web UI called s3_conn, with type S3, and extra set
> to {"aws_access_key_id": "mykey", "aws_secret_access_key": "mykey"}
>
> In airflow config I set the following
> remote_base_log_folder = s3://bucket/  (I've tried with and without the
> trailing slash)
> remote_log_conn_id = s3_conn
> encrypt_s3_logs = False
>
> Is there some other step that I'm missing?
>
> Thanks,
>
> Jason Kromm
> This email and any attachments may contain confidential and proprietary
> information of Blackboard that is for the sole use of the intended
> recipient. If you are not the intended recipient, disclosure, copying,
> re-distribution or other use of any of this information is strictly
> prohibited. Please immediately notify the sender and delete this
> transmission if you received this email in error.
>


Re: Two different scheduling intervals

2016-06-05 Thread Jeremiah Lowin
It would be useful to have a DAG.copy() method, then you could duplicate it
and change the schedule interval of the copy. I can't remember if there is
a copy() method though -- but I think theres a subdag method which
effectively does the same thing.

On Sun, Jun 5, 2016 at 2:17 PM  wrote:

> Hi,
> Is it possible to define a DAG with two different scheduling intervals?
> I have a DAG with 3 tasks defined in it, the schedule interval is
> "@weekly" but I also want to run it every month, without duplicate the
> entire DAG/code.
> Thanks


Re: PEP8, Linting and Code Smells

2016-06-02 Thread Jeremiah Lowin
+2 I hate % so much!

On Thu, Jun 2, 2016 at 9:23 PM Chris Riccomini 
wrote:

> @Maxime +1.. % based really annoys me.
>
> On Thu, Jun 2, 2016 at 6:13 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Side note I think I want to remove  "Smell Use % formatting in logging
> > functions but pass the % parameters as
> > arguments". `%` formatting is outdated and folks are moving away from it
> in
> > favor of "".format() .
> >
> > On Thu, Jun 2, 2016 at 5:21 PM, Chris Riccomini 
> > wrote:
> >
> > > The top 6 all seem totally do-able, and shouldn't be disruptive at all,
> > > provided small PRs and fast merges.
> > >
> > > 204 Style Line too long
> > > 
> > > 159 Style Import statements should be the first statements in a module
> > >  >
> > > 91 Smell Unused argument
> > >  >
> > > 76 Smell Use % formatting in logging functions but pass the %
> parameters
> > as
> > > arguments
> > >  >
> > > 66 Style Variable in function should be lowercase
> > > 
> > > 66 Smell Unused variable
> > >  >
> > >
> > >
> > > On Thu, Jun 2, 2016 at 5:03 PM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > This is definitely useful! I'd love to get in the 95%+ range. We may
> > also
> > > > ease some of the rules eventually (it's easy to control in
> > > .landscape.yml)
> > > > but raw PEP8 is fine by me, though I allowed for up to 90 char per
> > line,
> > > so
> > > > that we can use judgment there.
> > > >
> > > > Maybe the best approach is to start with "errors" where the solutions
> > are
> > > > clear, and then proceed by a rule centric approach
> > > > ,
> > it
> > > > makes the PRs easier to review. PRs with maybe 50 to 200 line edits
> > seem
> > > > reasonable.
> > > >
> > > > All you need is a great soundtrack. I find doing that kind of
> mindless
> > > > repetitive work pretty enjoyable and relaxing.
> > > >
> > > > Max
> > > >
> > > > On Thu, Jun 2, 2016 at 4:04 PM, Chris Riccomini <
> criccom...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > @Paul, my opinion is that this is worth while as long as it isn't
> > > > > disruptive. I think the way to keep it from being disruptive is:
> > > > >
> > > > > 1. Small PRs
> > > > > 2. Only fix the issues that don't require refactoring (e.g. nit
> > picking
> > > > on
> > > > > line length, spacing, etc)
> > > > >
> > > > > On Thu, Jun 2, 2016 at 4:01 PM, Paul Rhodes 
> > > > wrote:
> > > > >
> > > > > > http://landscape.io is already doing this for the project - it's
> > > > already
> > > > > > integrated with travis and run for every commit/PR
> > > > > >
> > > > > >
> > > > > > I was more concerned about improving the code quality - we
> already
> > > have
> > > > > > the metrics.
> > > > > >
> > > > > >
> > > > > > 
> > > > > > From: Bence Nagy 
> > > > > > Sent: 02 June 2016 23:52
> > > > > > To: Airflow Dev
> > > > > > Subject: Re: PEP8, Linting and Code Smells
> > > > > >
> > > > > > Hey,
> > > > > >
> > > > > > Please check out http://coala-analyzer.org for static analysis,
> > it's
> > > > > > awesome! http://gitmate.com is able to run these checks
> > > automatically
> > > > > for
> > > > > > PRs. See https://github.com/coala-analyzer/coala/pull/2220 for a
> > > live
> > > > > > example (hover over the green checkmark next to the commit hash).
> > > > > >
> > > > > > I'd love to see as many things copied from coala's CI system as
> > > > possible
> > > > > -
> > > > > > they even have automatic pypi prerelease version deployments for
> > each
> > > > > > merge!
> > > > > >
> > > > > > On Thu, Jun 2, 2016 at 2:46 PM Paul Rhodes  >
> > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > >
> > > > > > > I've asked a few of the committers about this but thought I'd
> > ping
> > > > this
> > > > > > to
> > > > > > > the list.
> > > > > > >
> > > > > > >
> > > > > > > We've integrated landscape.io to do automated code checks and
> as
> > > of
> > > > > > right
> > > > > > > now it's flagging 8 Errors, 416 Smells and 784 Style alerts.
> I've
> > > > had a
> > > > > > > look at some of these and thought I might pick off some of the
> > > lower
> > > > > > > hanging fruit in this area, but I'd just like to understand if
> > > > > improving
> > > > > > > the static code scores is seen to be of value right now...
> > > > > > >
> > > > > > >
> > > > > > > I'd like to see the scores improve, but like the testing and
> the
> > > > > changes
> > > > > > > to model.py, it's a big job which touches a signifi

Re: Splitting models.py

2016-06-02 Thread Jeremiah Lowin
That all makes sense to me. I knew I wouldn't win any popularity contests
by bringing this up but I think we should keep it in mind in case it
becomes more feasible for any reason.

On Thu, Jun 2, 2016 at 3:21 PM Chris Riccomini 
wrote:

> I agree with what Arthur/Maxime said.
>
> On Thu, Jun 2, 2016 at 11:32 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > I think it's possible to do it without breaking the API by frontloading
> > things in `models/__init__.py`, but there are way so many things in
> flight
> > at the moment that to me this effort qualifies as low-ish value and high
> > complexity.
> >
> > Cycles spent refactoring the tests would be a much better investment from
> > my perspective.
> >
> > Max
> >
> > On Thu, Jun 2, 2016 at 9:54 AM, Arthur Wiedmer 
> wrote:
> >
> > > A final observation, since this may include breaking changes, should we
> > > target these large refactors for 2.0 rather than 1.8?
> > >
> > > I agree that these are important changes, but maybe getting our feet
> wet
> > > with an apache release or two to settle on a release process (+ testing
> > > infrastructure less dependent on Airbnb) before breaking too much stuff
> > > might not be a bad idea.
> > >
> > >  Best,
> > > Arthur
> > >
> > > On Thu, Jun 2, 2016 at 9:50 AM, Arthur Wiedmer 
> > wrote:
> > >
> > > > I'd love to do it. Actually this + refactoring the core.py tests
> would
> > be
> > > > amazing.
> > > >
> > > > But the amount of havok to fix stuff afterwards, including temporary
> > > > compatibility adjustments would require maybe a temporary lock of
> quiet
> > > > time on the models. It is hard to catch all of the added changes in
> the
> > > > rebases.
> > > >
> > > > Should we merge the remaining few Scheduler PRs first and then do the
> > > > refactor?
> > > >
> > > > Best,
> > > > Arthur
> > > >
> > > > On Thu, Jun 2, 2016 at 6:03 AM, Jeremiah Lowin 
> > > wrote:
> > > >
> > > >> Models.py is becoming monolithic. We've discussed refactoring it
> many
> > > >> times. No way around this: refactoring it will suck. It will break
> PRs
> > > and
> > > >> require rebases. It will make it impossible to see diffs.
> > > >>
> > > >> On the other hand it will make future changes much more manageable.
> It
> > > >> will
> > > >> implicitly address concerns about PRs that touch "core" areas
> because
> > > >> we'll
> > > >> be able to see if "dags.py" is altered, as opposed to "xcom.py". It
> > will
> > > >> make the codebase more digestible and clear.
> > > >>
> > > >> I'm not exactly lining up to champion this but it will only get
> harder
> > > and
> > > >> harder to do so I want to raise the issue to the list...
> > > >>
> > > >> J
> > > >>
> > > >
> > > >
> > >
> >
>


Splitting models.py

2016-06-02 Thread Jeremiah Lowin
Models.py is becoming monolithic. We've discussed refactoring it many
times. No way around this: refactoring it will suck. It will break PRs and
require rebases. It will make it impossible to see diffs.

On the other hand it will make future changes much more manageable. It will
implicitly address concerns about PRs that touch "core" areas because we'll
be able to see if "dags.py" is altered, as opposed to "xcom.py". It will
make the codebase more digestible and clear.

I'm not exactly lining up to champion this but it will only get harder and
harder to do so I want to raise the issue to the list...

J


Re: Airflow Contributors Meeting (June 01, 2016) : Minutes

2016-06-02 Thread Jeremiah Lowin
Thank you Sid!

I would like to float an idea. This is not even half baked... just to
prompt discussion!

One of the big frictions is that Airbnb carries a disproportionate share of
the burden of testing releases, and I believe that's largely for both
historical and inertial reasons. We want to bring more companies into the
release testing loop. However that's not without its own set of issues. The
primary one is that if a bug is discovered, either the company that
discovered it must fix it privately on their own infrastructure OR they
must create a simple, replicable example so the problem can be fixed in the
open. Neither option is appealing, as Airbnb is experiencing today.

So I'd like to float the idea of building a DAG sanitization tool (or DAG
mock tool). This tool would read in a DAG and spit out a "dummy" version of
the same DAG. Dependencies, schedules, triggers would all be maintained but
names and operators would be anonymized.

What I'm trying to do is separate "Airflow" from "Things Built With
Airflow". If my DAG fails but my sanitized DAG runs, then the fault is
probably my own (maybe my Python code is broken). However, if the sanitized
DAG fails, then the fault is certainly Airflow's. Sanitized DAGs could be
shared with the community since they would have no identifying marks and
wouldn't actually do anything.

Complications (there are many):
- What should Operators be replaced with. DummyOperators? Maybe the "base"
Airflow Operators also implement sanitized versions of themselves.
- XComs (and any other objects keyed by strings) -- how they should be
anonymized?

Food for thought...

J


On Wed, Jun 1, 2016 at 6:33 PM siddharth anand  wrote:

>  Hi Folks!
> We held our first contributor meeting this morning. I was about 20 minutes
> late, but did ask others in attendance for their input before compiling
> these minutes.
>
> *Agenda* :
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements#Announcements-May27,2016
>
> *Outcomes*:
>
>- We need better and more test coverage
>   - Committers should ask PR authors to include tests when possible.
>   There may be some exceptions to this : e.g. google cloud storage,
> etc...
>   where it is difficult to stub out or mock storage
>   - End-to-end dag testing with a corpus of test dags
>  - Max, you have a PR (to approve) in this regard
>   - A reiteration of already ratified rules:
>   - Committers should follow the instructions outlined on Committers'
>   Guide
>- A few of the non-Airbnb committers will drive the next release,
>including baking release candidates in our own production and
>pre-production environments
>   - Currently, Sid, Bolke, and Chris voiced interest in driving this,
>   but all from the community are welcome to help with release candidate
>   certification
>- Working collaboratively as a community
>   - Airbnb's roadmap for Airflow does not appear to be public
>  - https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap
>  - Large PRs do not
>   - For large PRs, first put up and socialze a design document
>   - Authors of PRs should seek out the right committers for PR reviews
>   - Leverage the dev list for conversations
>
> -s
>
>


Re: Merging #1514 AIRFLOW-128

2016-06-01 Thread Jeremiah Lowin
Just to be clear this is a highly unlikely event. I used to have a unit
test for it but got rid of it when we closed bugs that made it impossible
to cause such a crash deterministically. So this situation is possible but
almost certainly won't manifest.

On Wed, Jun 1, 2016 at 4:00 AM Bolke de Bruin  wrote:

> Hey,
>
> This is to give a heads up that I am planning to merge #1514, the refactor
> of process_dag, today. This is the second step in executing on the
> scheduler roadmap. It has been running in our production for a week now
> with no functional differences. Scheduler loop times start a bit higher,
> but have a lower max. Amount of connections to the database is round 1/3 of
> the previous scheduler (test dag went from 150 connections to 50). Database
> load slightly lower.
>
> While fixing many issues (race conditions), a corner case mentioned by
> Jeremiah is now present. A TI is sent in SCHEDULED state to the executor.
> The executor fails in loading the TI then the TI might be orphaned forever.
> As fixing the corner case will require further fundamental changes we
> discussed it should be addressed in a follow up patch.
>
> My planned next steps are 1) reduce scheduler loop time to around 1s by
> making task reporting “event driven”. 2) auto-align start date 3) add
> notion of “previous” to dagrun 4) fix corner case mentioned above.
>
> - Bolke
>
>
>
>


Re: Using xcom with web APIs that return JSON

2016-05-31 Thread Jeremiah Lowin
Mike, I think you'll need to overwrite the operator classes. Here's a
sketch to get you started, you may find more efficient ways of doing this:

```python
class MyHttpOperator(HttpOperator):

def execute(self, context):
# Code from parent class
http = HttpHook(self.method, http_conn_id=self.http_conn_id)
logging.info("Calling HTTP method")
response = http.run(self.endpoint,
self.data,
self.headers,
self.extra_options)
if self.response_check:
if not self.response_check(response):
raise AirflowException("Response check returned False.")

# NEW CODE BELOW
# to push the response
self.xcom_push(key='json_data', value=response)
# OR just return the response
return response


class MyHttpSensor(HttpSensor):
def poke(self, context):
# if you called xcom_push, then provide the key
# if you returned a value, then provide the task_id
json_data = self.xcom_pull(
key='json_data',
task_ids=http_operator_id)

# Then copy/paste poke() code from parent class here,
# modified to use json_data in the query

```

On Thu, May 26, 2016 at 12:06 PM Michael England 
wrote:

> Hi,
> I am having some trouble working with the xcom functionality in the
> HttpOperators.
> I am trying to use Airflow to call a web API to schedule a task
> asynchronously, returning some JSON data, and then based on the value of
> one of the JSON fields I would like to use an HttpSensor to query an end
> point until the task is complete.
>
> Does anyone have any examples of how this can be done?  The documentation
> is focused on BashOperators and PythonOperators.
>
> Thanks in advance!Mike


Re: question

2016-05-31 Thread Jeremiah Lowin
Generally, sensors block ("running") until their success criteria are met.
They fail if they time out.

On Fri, May 27, 2016 at 4:51 PM Griffin Thornton 
wrote:

> hello,
> I was wondering if sensors, specifically the web hdfs sensor succeed if
> there are any files in the file path or if they are supposed to fail until
> new files are submitted?
>
> Thanks,
> Griffin Thornton
>


Re: CI failing due to Hadoop?

2016-05-26 Thread Jeremiah Lowin
Thanks Bolke!

On Thu, May 26, 2016 at 4:50 PM Bolke de Bruin  wrote:

> It is fixed now by downloading Hadoop again without checking the cache of
> the unpacking fails the first time.
>
> -b
>
> Sent from my iPhone
>
> > On 26 mei 2016, at 00:19, Chris Riccomini  wrote:
> >
> > According to:
> >
> > https://docs.travis-ci.com/user/caching/
> >
> > We can disable the cache permanently:
> >
> > You can explicitly disable all caching by setting the cache option to
> false in
> > your *.travis.yml*:
> >
> > cache: false
> >
> > It is also possible to disable a single caching mode:
> >
> > cache:
> >  bundler: false
> >  pip: true
> >
> >
> >> On Wed, May 25, 2016 at 12:35 PM, Bolke de Bruin 
> wrote:
> >>
> >> I will have a look (I created it), but maybe the cache is invalid and as
> >> we don't have admin privileges we cannot invalidate the cache...
> >>
> >> My personal repos build fine.
> >>
> >> Sent from my iPhone
> >>
> >>> On 25 mei 2016, at 16:20, Jeremiah Lowin  wrote:
> >>>
> >>> A large number of Travis runs are failing due to what looks like errors
> >>> setting up a Hadoop environment. For example:
> >>> https://travis-ci.org/apache/incubator-airflow/jobs/131617413 (scroll
> to
> >>> bottom). It looks like the first error is:
> >>>
> >>> mkdir -p /user/hive/warehouse
> >>>
> >>> mkdir: cannot create directory `/user': Permission denied
> >>>
> >>> Bizarrely, it doesn't happen every time. Here's a run from the same PR
> >>> which completed setup (and failed for a different reason):
> >>> https://travis-ci.org/apache/incubator-airflow/jobs/131617415
> >>>
> >>> Is anyone familiar with this Hadoop setup? I assume it's related to
> >> testing
> >>> HiveOperators?
> >>
>


Re: Commit guidelines

2016-05-26 Thread Jeremiah Lowin
@Chris to be clear, what workflow would you like to see? I think trying to
do this without a squash commit (in other words editing individual commit
messages) could get messy since it would require managing a rebase through
the PR tool...

On Thu, May 26, 2016 at 11:36 AM Jeremiah Lowin  wrote:

> Yes it should be able to. Currently, if you tell the PR tool to squash
> commits, it reattributes the author properly. So it should just be a matter
> of adding a prompt for a new commit message. I will work on it.
>
>
> On Wed, May 25, 2016 at 7:30 PM Chris Riccomini 
> wrote:
>
>> Also, @Jeremiah, is it possible to make the airflow-pr tool allow us to
>> change commit messages? I couldn't figure out a way to do this without
>> affecting the git author, which removes attribution. It's kind of a pain
>> to
>> keep having to nag contributors to follow the guidelines.
>>
>> On Wed, May 25, 2016 at 3:30 PM, Chris Riccomini 
>> wrote:
>>
>> > @Bolke, thanks for bringing this up.
>> >
>> > I wonder if it's possible to get a commit hook on our Apache repo to
>> > prevent merges that don't follow at least some of the guidelines (e.g.
>> > starts with [AIRFLOW-XXX], has a multi-line description).
>> >
>> > On Tue, May 24, 2016 at 8:55 AM, Maxime Beauchemin <
>> > maximebeauche...@gmail.com> wrote:
>> >
>> >> There's also been some unapproved PRs that have been rush-merged. If
>> you
>> >> feel a sense of urgency towards a PR making it in master or in a
>> release,
>> >> that's a sign that you need to run your build off of a fork, where
>> you're
>> >> free to cherry pick any change you fancy.
>> >>
>> >> It's actually a positive things to have your changes running in your
>> >> production prior to being merged as it distributes the risk (as
>> opposed to
>> >> havd all new code getting productionized as Airbnb)
>> >>
>> >> Maxime
>> >>
>> >> On Tue, May 24, 2016 at 12:52 AM, Bolke de Bruin 
>> >> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> > I noticed that we started to slack a little in the commit messages.
>> >> These
>> >> > are the last commits excluding merges:
>> >> >
>> >> > aedb667 Make enhancements to VersionView
>> >> > 0b3d101 [AIRFLOW-52] 1.7.1 version bump and changelog
>> >> > 16740dd Add Kiwi.com as a user to README
>> >> > 4b78e1a [AIRFLOW-143] setup_env.sh doesn't leverage cache for
>> >> downloading
>> >> > minicluster
>> >> > 8ae8681 Increasing License Coverage
>> >> > 7d32c17 Add a version view to display airflow version info
>> >> > 4b25a7d [AIRFLOW-125] Add file to GCS operator
>> >> > af43db5 [AIRFLOW-86] Wrap dict.items() in list for Py3 compatibility
>> >> > f01854a Adding Nerdwallet to the list of Currently officially using
>> >> > Airflow:
>> >> > 843a22f [AIRFLOW-127] Makes filter_by_owner aware of multi-owner DAG
>> >> >
>> >> > Only one of those commits contains a description (4b25a7d). Only 4
>> out
>> >> of
>> >> > 10 start with an imperative and also only 4 out of 10 have a Jira
>> >> attached
>> >> > to them. I have no clue what “make enhancements to versionview” will
>> do
>> >> or
>> >> > "setup_env.sh doesn't leverage cache for downloading minicluster”.
>> >> >
>> >> > If we are to collaborate in a consensus model and trust each other to
>> >> have
>> >> > good commits I think being able to use "git log” and actually
>> understand
>> >> > why (a what will be supplied by the diff) a change has been made is
>> >> key. "A
>> >> > project's long-term success rests (among other things) on its
>> >> > maintainability and a maintainer has few tools more powerful than his
>> >> > project's log.”. If you are not aware what composes good commits
>> please
>> >> > read http://chris.beams.io/posts/git-commit/ <
>> >> > http://chris.beams.io/posts/git-commit/> , it is a really good
>> article.
>> >> >
>> >> > Thanks!
>> >> > Bolke
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>


Re: Commit guidelines

2016-05-26 Thread Jeremiah Lowin
I don't think the squash will prompt you for a new message -- what you
could do as a workaround is use the PR tool to squash, then edit the commit
message when the PR tool pauses and asks you if you want to merge to
Apache. Once you edit the message, tell the PR tool to resume.

I'll have a look at updating it when I have a moment

On Thu, May 26, 2016 at 11:47 AM Chris Riccomini 
wrote:

> Hey Jeremiah,
>
> I don't really have a specific flow in mind, so I'm OK with doing a squash
> commit. I haven't tried the squash commit feature yet, since all of the PRs
> I've merged have already been squashed. If I do a squash commit, can I just
> set the message and that's the end of it?
>
> Cheers,
> Chris
>
> On Thu, May 26, 2016 at 8:44 AM, Jeremiah Lowin  wrote:
>
>> @Chris to be clear, what workflow would you like to see? I think trying
>> to do this without a squash commit (in other words editing individual
>> commit messages) could get messy since it would require managing a rebase
>> through the PR tool...
>>
>> On Thu, May 26, 2016 at 11:36 AM Jeremiah Lowin 
>> wrote:
>>
>>> Yes it should be able to. Currently, if you tell the PR tool to squash
>>> commits, it reattributes the author properly. So it should just be a matter
>>> of adding a prompt for a new commit message. I will work on it.
>>>
>>>
>>> On Wed, May 25, 2016 at 7:30 PM Chris Riccomini 
>>> wrote:
>>>
>>>> Also, @Jeremiah, is it possible to make the airflow-pr tool allow us to
>>>> change commit messages? I couldn't figure out a way to do this without
>>>> affecting the git author, which removes attribution. It's kind of a
>>>> pain to
>>>> keep having to nag contributors to follow the guidelines.
>>>>
>>>> On Wed, May 25, 2016 at 3:30 PM, Chris Riccomini >>> >
>>>> wrote:
>>>>
>>>> > @Bolke, thanks for bringing this up.
>>>> >
>>>> > I wonder if it's possible to get a commit hook on our Apache repo to
>>>> > prevent merges that don't follow at least some of the guidelines (e.g.
>>>> > starts with [AIRFLOW-XXX], has a multi-line description).
>>>> >
>>>> > On Tue, May 24, 2016 at 8:55 AM, Maxime Beauchemin <
>>>> > maximebeauche...@gmail.com> wrote:
>>>> >
>>>> >> There's also been some unapproved PRs that have been rush-merged. If
>>>> you
>>>> >> feel a sense of urgency towards a PR making it in master or in a
>>>> release,
>>>> >> that's a sign that you need to run your build off of a fork, where
>>>> you're
>>>> >> free to cherry pick any change you fancy.
>>>> >>
>>>> >> It's actually a positive things to have your changes running in your
>>>> >> production prior to being merged as it distributes the risk (as
>>>> opposed to
>>>> >> havd all new code getting productionized as Airbnb)
>>>> >>
>>>> >> Maxime
>>>> >>
>>>> >> On Tue, May 24, 2016 at 12:52 AM, Bolke de Bruin 
>>>> >> wrote:
>>>> >>
>>>> >> > Hi,
>>>> >> >
>>>> >> > I noticed that we started to slack a little in the commit messages.
>>>> >> These
>>>> >> > are the last commits excluding merges:
>>>> >> >
>>>> >> > aedb667 Make enhancements to VersionView
>>>> >> > 0b3d101 [AIRFLOW-52] 1.7.1 version bump and changelog
>>>> >> > 16740dd Add Kiwi.com as a user to README
>>>> >> > 4b78e1a [AIRFLOW-143] setup_env.sh doesn't leverage cache for
>>>> >> downloading
>>>> >> > minicluster
>>>> >> > 8ae8681 Increasing License Coverage
>>>> >> > 7d32c17 Add a version view to display airflow version info
>>>> >> > 4b25a7d [AIRFLOW-125] Add file to GCS operator
>>>> >> > af43db5 [AIRFLOW-86] Wrap dict.items() in list for Py3
>>>> compatibility
>>>> >> > f01854a Adding Nerdwallet to the list of Currently officially using
>>>> >> > Airflow:
>>>> >> > 843a22f [AIRFLOW-127] Makes filter_by_owner aware of multi-owner
>>>> DAG
>>>> >> >
>>>> >> > Only one of those commits contains a description (4b25a7d). Only 4
>>>> out
>>>> >> of
>>>> >> > 10 start with an imperative and also only 4 out of 10 have a Jira
>>>> >> attached
>>>> >> > to them. I have no clue what “make enhancements to versionview”
>>>> will do
>>>> >> or
>>>> >> > "setup_env.sh doesn't leverage cache for downloading minicluster”.
>>>> >> >
>>>> >> > If we are to collaborate in a consensus model and trust each other
>>>> to
>>>> >> have
>>>> >> > good commits I think being able to use "git log” and actually
>>>> understand
>>>> >> > why (a what will be supplied by the diff) a change has been made is
>>>> >> key. "A
>>>> >> > project's long-term success rests (among other things) on its
>>>> >> > maintainability and a maintainer has few tools more powerful than
>>>> his
>>>> >> > project's log.”. If you are not aware what composes good commits
>>>> please
>>>> >> > read http://chris.beams.io/posts/git-commit/ <
>>>> >> > http://chris.beams.io/posts/git-commit/> , it is a really good
>>>> article.
>>>> >> >
>>>> >> > Thanks!
>>>> >> > Bolke
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>>
>>>
>


Re: Commit guidelines

2016-05-26 Thread Jeremiah Lowin
Yes it should be able to. Currently, if you tell the PR tool to squash
commits, it reattributes the author properly. So it should just be a matter
of adding a prompt for a new commit message. I will work on it.


On Wed, May 25, 2016 at 7:30 PM Chris Riccomini 
wrote:

> Also, @Jeremiah, is it possible to make the airflow-pr tool allow us to
> change commit messages? I couldn't figure out a way to do this without
> affecting the git author, which removes attribution. It's kind of a pain to
> keep having to nag contributors to follow the guidelines.
>
> On Wed, May 25, 2016 at 3:30 PM, Chris Riccomini 
> wrote:
>
> > @Bolke, thanks for bringing this up.
> >
> > I wonder if it's possible to get a commit hook on our Apache repo to
> > prevent merges that don't follow at least some of the guidelines (e.g.
> > starts with [AIRFLOW-XXX], has a multi-line description).
> >
> > On Tue, May 24, 2016 at 8:55 AM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> There's also been some unapproved PRs that have been rush-merged. If you
> >> feel a sense of urgency towards a PR making it in master or in a
> release,
> >> that's a sign that you need to run your build off of a fork, where
> you're
> >> free to cherry pick any change you fancy.
> >>
> >> It's actually a positive things to have your changes running in your
> >> production prior to being merged as it distributes the risk (as opposed
> to
> >> havd all new code getting productionized as Airbnb)
> >>
> >> Maxime
> >>
> >> On Tue, May 24, 2016 at 12:52 AM, Bolke de Bruin 
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I noticed that we started to slack a little in the commit messages.
> >> These
> >> > are the last commits excluding merges:
> >> >
> >> > aedb667 Make enhancements to VersionView
> >> > 0b3d101 [AIRFLOW-52] 1.7.1 version bump and changelog
> >> > 16740dd Add Kiwi.com as a user to README
> >> > 4b78e1a [AIRFLOW-143] setup_env.sh doesn't leverage cache for
> >> downloading
> >> > minicluster
> >> > 8ae8681 Increasing License Coverage
> >> > 7d32c17 Add a version view to display airflow version info
> >> > 4b25a7d [AIRFLOW-125] Add file to GCS operator
> >> > af43db5 [AIRFLOW-86] Wrap dict.items() in list for Py3 compatibility
> >> > f01854a Adding Nerdwallet to the list of Currently officially using
> >> > Airflow:
> >> > 843a22f [AIRFLOW-127] Makes filter_by_owner aware of multi-owner DAG
> >> >
> >> > Only one of those commits contains a description (4b25a7d). Only 4 out
> >> of
> >> > 10 start with an imperative and also only 4 out of 10 have a Jira
> >> attached
> >> > to them. I have no clue what “make enhancements to versionview” will
> do
> >> or
> >> > "setup_env.sh doesn't leverage cache for downloading minicluster”.
> >> >
> >> > If we are to collaborate in a consensus model and trust each other to
> >> have
> >> > good commits I think being able to use "git log” and actually
> understand
> >> > why (a what will be supplied by the diff) a change has been made is
> >> key. "A
> >> > project's long-term success rests (among other things) on its
> >> > maintainability and a maintainer has few tools more powerful than his
> >> > project's log.”. If you are not aware what composes good commits
> please
> >> > read http://chris.beams.io/posts/git-commit/ <
> >> > http://chris.beams.io/posts/git-commit/> , it is a really good
> article.
> >> >
> >> > Thanks!
> >> > Bolke
> >> >
> >> >
> >> >
> >>
> >
> >
>


Re: Get the date into a dag?

2016-05-26 Thread Jeremiah Lowin
Lance,

If you set the "templated_fields" attribute (on mobile so that might not
actually be correct) you can make your operator run any field through
jinja. See how BashOperator handles bash_command for an example.

Best,
Jeremiah

On Wed, May 25, 2016 at 6:40 PM Lance Norskog 
wrote:

> Ah! There's the problem. I'm using a home-made Operator that does not know
> how to use the templating engine, so this example does not count.
>
> It seems like I need to access the execution_date object.
>
>
> On Tue, May 24, 2016 at 11:15 PM, הילה ויזן  wrote:
>
> > a working example: {{ts}} is the execution date
> >
> > general_templated_cmd = """
> >  {{ ts }} {{ task.task_id }} {{ dag.dag_id }}
> > """
> >
> > t1 = BashOperator(
> > task_id='morning_task',
> > bash_command=general_templated_cmd,
> > dag=dag)
> >
> > On Wed, May 25, 2016 at 8:39 AM, Lance Norskog 
> > wrote:
> >
> > > My mad google skillz are failing me. Where is an example of getting the
> > > execution date into the python code in a DAG?
> > >
> > > I can't quite tell how to trigger the ninja template engine. Is there
> an
> > > example somewhere of this?
> > >
> > > --
> > > Lance Norskog
> > > lance.nors...@gmail.com
> > > Redwood City, CA
> > >
> >
>
>
>
> --
> Lance Norskog
> lance.nors...@gmail.com
> Redwood City, CA
>


CI failing due to Hadoop?

2016-05-25 Thread Jeremiah Lowin
A large number of Travis runs are failing due to what looks like errors
setting up a Hadoop environment. For example:
https://travis-ci.org/apache/incubator-airflow/jobs/131617413 (scroll to
bottom). It looks like the first error is:

mkdir -p /user/hive/warehouse

mkdir: cannot create directory `/user': Permission denied

Bizarrely, it doesn't happen every time. Here's a run from the same PR
which completed setup (and failed for a different reason):
https://travis-ci.org/apache/incubator-airflow/jobs/131617415

Is anyone familiar with this Hadoop setup? I assume it's related to testing
HiveOperators?


Re: S3 connection

2016-05-24 Thread Jeremiah Lowin
Where are you seeing that an S3 connection is required? It will only be
accessed if you tols Airflow to send logs to S3. The config option can also
be null (default) or a google storage location.

The S3 connection is a standard Airflow connection. If you would like it to
use environment variables or a boto config, it will -- but the connection
object itself must be created in Airflow. See the S3 hook for details.


On Tue, May 24, 2016 at 3:57 PM George Leslie-Waksman <
geo...@cloverhealth.com> wrote:

> We ran into this issue as well. If you set the environment variable to
> anything random, it'll get ignored and control will pass through to
> .aws/credentials
>
> We used "n/a"
>
> It's kind of annoying that the s3 connection is a) required, and b) poorly
> supported as an env var.
>
> On Tue, May 24, 2016 at 8:37 AM Tyrone Hinderson 
> wrote:
>
> > I was logging to S3 in 1.7.0, but now I need to create an S3 "Connection"
> > in airflow (for remote_log_conn_id) to keep doing that in 1.7.1.2. Rather
> > than set this "S3" connection in the UI, I'd like set a AIRFLOW_CONN_S3
> env
> > variable. What does an airlfow-friendly s3 "connection string" look like?
> >
>


Re: boto vs boto3

2016-05-18 Thread Jeremiah Lowin
Just for clarity, the old boto library is fully compatible with Python 3.
"boto3" is a completely different API that AWS is migrating to.

On Wed, May 18, 2016 at 6:14 PM Chris Riccomini 
wrote:

> Hey David,
>
> This is tracked here:
>
> https://issues.apache.org/jira/browse/AIRFLOW-115
>
> You'll have to ask Arthur for the status.
>
> Cheers,
> Chris
>
> On Wed, May 18, 2016 at 3:09 PM, David Klosowski 
> wrote:
>
> > Hi Airflowers:
> >
> > I was testing using the S3KeySensor and it became clear that it's using
> > boto instead of boto3. Is there any plan to migrate to boto3 (2.0 for
> > instance especially if there is full support for python 3)?
> >
> >
> https://aws.amazon.com/blogs/aws/now-available-aws-sdk-for-python-3-boto3/
> >
> > Thanks.
> >
> > Cheers,
> > David
> >
>


PR merge script

2016-05-18 Thread Jeremiah Lowin
Dear team,

To make up for being offline for the last week, I have a present for you:
https://github.com/apache/incubator-airflow/pull/1515

I started with the script used by the Spark team (and other Apache
projects) and medium-modified it to work with Airflow. It makes it
super-simple to merge PRs and close associated JIRA issues; no more
worrying about erasing Airflow with an errant keystroke!

J


Re: Adhoc operators

2016-05-18 Thread Jeremiah Lowin
I think it is a useful feature that nonetheless adds disproportionate
complexity -- for example, is there logic for when there is a task
downstream from an ad-hoc task, and the ad-hoc task isn't being run?

Perhaps there is a way to reengineer it around current Airflow idioms.
Maybe we can start by figuring out what exactly it's being used for? Here
are a few use-cases that come to mind (under the heading of "actions that
relate to my DAG but I don't want them to run every time... just at the
very beginning or occasionally on demand"):
- initializing a database table (and making sure it exists before running
downstream tasks)
- periodic maintenance, for example pruning, truncating tables, etc.
- initial logins, connection testing, etc.
- issuing some sort of debug command to a third party system before running
the rest of the DAG

On Wed, May 18, 2016 at 11:23 AM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> The idea there is to be able to ship on-demand tasks along with your DAG.
> Is it not used because it's not documented?
>
> Deprecating may be harder than maintaining it. We'd have to start warning
> about deprecation in 2.0 soon and add the PR that removes this in the
> [eventual] 2.0 branch.
>
> I don't feel strongly about the feature, I added it because we had use
> cases for it, and we didn't have externally triggered DAGs at the time. I
> get how it can become confusing, both from a usability and code maintenance
> perspective.
>
> Max
>
> On Tue, May 17, 2016 at 12:49 PM, Chris Riccomini 
> wrote:
>
> > Yea, it feels like a pretty edge-use case. It's not even documented. In
> the
> > interest of simplifying and and reducing bugs it seems like we might just
> > want to nuke this, or completely rethink the use cases.
> >
> > On Tue, May 17, 2016 at 12:22 PM, Jeremiah Lowin 
> > wrote:
> >
> > > Perhaps ad-hoc tasks could be refractored as ad-hoc DAGs? It sounds
> like
> > > they are for infrequent initialization or maintainence tasks.
> > >
> > > On Tue, May 17, 2016 at 11:21 AM Arthur Wiedmer 
> > wrote:
> > >
> > > > We still have tasks in production that use this feature.
> > > >
> > > > Sometimes, it has been used for one off tasks that create simple
> static
> > > > mapping tables (Tables loaded from a static file that also lives in
> > > source
> > > > control, creating a programmatically generated time dimension
> etc...).
> > > >
> > > > Of course, maybe just having the task in question as a script that
> uses
> > > the
> > > > airflow utilities would be sufficient.
> > > >
> > > > Best,
> > > > Arthur
> > > >
> > > > On Tue, May 17, 2016 at 10:40 AM, Chris Riccomini <
> > criccom...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > @Bolke/@Jeremiah
> > > > >
> > > > > When you make your changes to unify the backfiller and scheduler,
> it
> > > > sounds
> > > > > like this can go away, right?
> > > > >
> > > > > On Tue, May 17, 2016 at 10:38 AM, Maxime Beauchemin <
> > > > > maximebeauche...@gmail.com> wrote:
> > > > >
> > > > > > The scheduler won't trigger where `adhoc=True`. The CLI's
> > > > > backfill/test/run
> > > > > > is the only way to trigger where `adhoc=True`. For backfill
> > > > specifically,
> > > > > > there's a `-a`, `--include_adhoc` flag to make these tasks
> in-scope
> > > to
> > > > > the
> > > > > > backfill.
> > > > > >
> > > > > > Max
> > > > > >
> > > > > > On Tue, May 17, 2016 at 10:14 AM, Chris Riccomini <
> > > > criccom...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey all,
> > > > > > >
> > > > > > > Curious about what the 'adhoc' property is in BaseOperator. It
> > > > appears
> > > > > to
> > > > > > > be completely undocumented. What is this?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Chris
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Adhoc operators

2016-05-17 Thread Jeremiah Lowin
Perhaps ad-hoc tasks could be refractored as ad-hoc DAGs? It sounds like
they are for infrequent initialization or maintainence tasks.

On Tue, May 17, 2016 at 11:21 AM Arthur Wiedmer  wrote:

> We still have tasks in production that use this feature.
>
> Sometimes, it has been used for one off tasks that create simple static
> mapping tables (Tables loaded from a static file that also lives in source
> control, creating a programmatically generated time dimension etc...).
>
> Of course, maybe just having the task in question as a script that uses the
> airflow utilities would be sufficient.
>
> Best,
> Arthur
>
> On Tue, May 17, 2016 at 10:40 AM, Chris Riccomini 
> wrote:
>
> > @Bolke/@Jeremiah
> >
> > When you make your changes to unify the backfiller and scheduler, it
> sounds
> > like this can go away, right?
> >
> > On Tue, May 17, 2016 at 10:38 AM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > The scheduler won't trigger where `adhoc=True`. The CLI's
> > backfill/test/run
> > > is the only way to trigger where `adhoc=True`. For backfill
> specifically,
> > > there's a `-a`, `--include_adhoc` flag to make these tasks in-scope to
> > the
> > > backfill.
> > >
> > > Max
> > >
> > > On Tue, May 17, 2016 at 10:14 AM, Chris Riccomini <
> criccom...@apache.org
> > >
> > > wrote:
> > >
> > > > Hey all,
> > > >
> > > > Curious about what the 'adhoc' property is in BaseOperator. It
> appears
> > to
> > > > be completely undocumented. What is this?
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > >
> >
>


Re: Voting Changes for Scheduler-related PRs/Commits

2016-05-14 Thread Jeremiah Lowin
Hi all,

I've been out of email/phone range for a few days (and will be for a few
more, just have brief access now) so I'm just catching up on all the
activity.

I think this proposal has its heart in the right place. There's a lot of
effort that goes in to the Apache transition and for the Airbnb team in
particular (though truly for everyone) it introduces a regime change in
terms of process and workflow. As a result, I think it feels like we are
stretched a little thin and there's a legitimate concern that major changes
could slip by with unintended consequences (I've been on both sides of that
already :) )

However, I believe that the good work we are doing to formalize procedures
and documentation and make development discussion more open will counteract
that in the near future and eliminate it shortly thereafter. Therefore, I
do trust the "Apache way" and I think after a few growing pains it will
really work smoothly. I agree that we should just all recognize when we are
touching "sensitive areas" of Airflow and allow those changes to bake a
little longer to ensure that many eyes get on them.

As a concrete example, I know I'm on Bolke's list because of my perceived
knowledge of the scheduler's inner workings. And while I wont necessarily
dispel those rumors, the truth is that I acquired that knowledge simply by
being tasked with certain deep-rooted bug reports and PRs. There could be a
commit tomorrow that renders a lot of that obsolete and produces a new
"scheduler guru". My point is that knowledge is fluid, especially at this
stage, and I believe the best (and most robust) thing we can do is share
and democratize it as much as possible.

So that's just a long way of saying: we know what a "big" change looks
like, and let's just make sure we have lots of talented committers
evaluating them :)

J

On Fri, May 13, 2016 at 10:08 AM Bolke de Bruin  wrote:

>
>
> Sent from my iPhone
>
> > On 13 mei 2016, at 19:02, Jakob Homan  wrote:
> >
> >> On 13 May 2016 at 00:40, Bolke de Bruin  wrote:
> >> The question is how to keep the trust of that first group - they are
> vital to the work - while growing the community.
> >
> > Another perspective is for the first group to trust the Apache Way.
> > The procedures, norms, votes, requirements, whole Incubator process,
> > etc. is designed to build a functioning community that addresses all
> > these issues.  It doesn't always work (not all incubations succeed),
> > but it does the vast majority of the time.
> >
> > This is just the first couple of weeks of Airflow's incubation and the
> > culture shock is definitely setting in, but we'll get past this pretty
> > quickly.  I do thnk Airflow will be very successful at Apache.
> >
>
> Hear hear! :-)
>
> > Best,
> > Jakob
>


Re: 1.7.1 release status

2016-05-09 Thread Jeremiah Lowin
Max, appreciate the clarity on this. Dan and I added a PR today that warns
when users try to double-add tasks to a DAG, and that moves us in the right
direction while not breaking anything, so I think everybody's happy!

There is one last issue that has come up -- it appears that at least one
Airbnb dagfile sets a relationship between two tasks when neither task has
a DAG. That creates an illegal situation and raises an error starting in
1.7.1. If you think about it, it really makes no sense to have a dependency
without a DAG... DAGs are literally expressions of workflows, and tasks are
just the units of work!

Dan and I looked to see if it could easily be deprecated and it turns out
it can't be. The issue is that when we set relationships between tasks, we
check to make sure we didn't introduce downstream cycles into the DAG. That
check uses the DAG to look up tasks (it actually turns out that you
authored the change here:
https://github.com/apache/incubator-airflow/commit/6c1207b636de4d74c0faf804dc003918ae4a8b16#diff-a32a363fa616685db3bfefba947535b2R1772
<https://github.com/apache/incubator-airflow/commit/6c1207b636de4d74c0faf804dc003918ae4a8b16>).
In this case, there is no DAG at all -- therefore: error.

So we are debating how to handle this. On the last issue, we were able to
deprecate it in a fairly unobtrusive way. But I really think this one
should stay an error -- it's an illegal situation at a very deep level. If
we introduce a special case, we're compromising the integrity of a core
check (detect_downstream_cycles) and possibly others. So, I submit that on
balance it's better in this case for Airbnb to change the offending DAG
than to build a backdoor into the cycles check. But again we're in that
grey area between "bug-fix" and "api-change" and I want to be sure you are
ok with it. Fortunately I think it's only the one DAG that would have to
change.

Thanks,
Jeremiah

On Fri, May 6, 2016 at 3:23 PM Maxime Beauchemin 
wrote:

> My thought was 2.0 should require tasks to be associated with a DAG on
> creation (maybe needs to be debated), 1.7.1 would **not** require this but
> start warning in prevision of 2.0
>
> Same with dag inference, we do start doing it in 1.7.1, but warn when we
> do, though it doesn't matter because of the previous statement, there won't
> be inference if all tasks are associated.
>
> I have to share this analogy about inference I used yesterday. If you get
> in line, in a line of people that are in line to go see star wars, we can
> assume you're want to go see star wars. If you tell the next person in line
> "I'm here because I want to see star wars", the person should just answer
> "well yeah, that's pretty obvious", not "of course you dummy now get off
> this line" (which would be the equivalent of raising)
>
> Max
>
> On Fri, May 6, 2016 at 9:54 AM, Jeremiah Lowin  wrote:
>
> > The second point about moving add_task to a private method makes sense to
> > me, and I love `set_as_last` and `set_as_first` for convenience. Here are
> > my thoughts on the other two points:
> >
> >
> > * The absence of dag=dag (or equivalent context manager) WARNS about
> > deprecation in 2.0
> >
> > - dag has never been a required argument for tasks (pre 1.7.1 or post) so
> > I'm not sure what behavior we would be deprecating. The only difference
> is
> > that 1.7.1 takes additional care when working with dag-less tasks to make
> > sure they are automatically added to any DAGs they interact with (via
> > set_upstream, for example). Otherwise you get in situations like the
> above
> > where a task in a DAG is allowed to depend on a task outside the DAG,
> > resulting in a broken DAG.
> >
> >
> > * set_upstream/downstream infers where needed, but WARNS about
> deprecation
> > of inference in 2.0
> >
> > - inference is new in 1.7.1; previously, set_upstream could be called on
> > any two tasks without validating that they were in compatible DAGs (or
> any
> > DAG for that matter). I don't think we should put off doing that
> > inference/validation until 2.0, since the current behavior clearly allows
> > broken DAGs and should be fixed. To me, this doesn't represent an API
> > change that needs deprecation (especially since the API is identical),
> it's
> > a bug fix that prevents broken DAGs from being created. Maybe we have a
> > warning message that tells you if validation was done since that piece
> is a
> > new feature, but I don't see a reason to wait for 2.0 to do this
> > validation. The DAG is broken either way; in 1.7.0 you have to run it to
> > find that out and in 1

Re: 'shutdown' state?

2016-05-06 Thread Jeremiah Lowin
Tasks (and jobs) are put in that state if you clear them
(clear_task_instances() or “airflow clear”) while they’re running.

Jeremiah

On Fri, May 6, 2016 at 5:59 PM Lance Norskog 
wrote:

> What is the 'shutdown' (blue rim) state? How does a task get to that state?
>
> --
> Lance Norskog
> lance.nors...@gmail.com
> Redwood City, CA
>


Re: Status of Apache Migration

2016-05-06 Thread Jeremiah Lowin
It looks to me like travis migrated perfectly, as Max predicted :) Unless
there is any further action required, I'm going to close the travis
migration issue.

J

On Wed, May 4, 2016 at 3:46 PM Maxime Beauchemin 
wrote:

> Done and done.
> https://github.com/apache/incubator-airflow
>
> Max
>
> On Wed, May 4, 2016 at 12:22 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
>> I changed the title for the duration of the move
>>
>>
>> ​
>>
>> On Wed, May 4, 2016 at 12:20 PM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>
>>> About the migration, I just released the Airflow repo to `asfgit` as
>>> requested.
>>>
>>> For some reason I had to assign it to myself first, so the repo will
>>> appear here <https://github.com/mistercrunch/airflow> until Apache
>>> Infra accepts the repo and moves it to it's final destination.
>>>
>>> This should not affect any normal flows, perhaps just a bit of confusion
>>> if you pay attention to URLs.
>>>
>>> Follow the progress here:
>>> https://issues.apache.org/jira/browse/INFRA-11776
>>>
>>> If anyone knows of a way to accelerate this please do!
>>>
>>> On Wed, May 4, 2016 at 12:04 PM, Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>>
>>>> Github has a branch protection feature that prevents force pushes and
>>>> branch deletion. I just turned it on for `master`.
>>>>
>>>> ​
>>>>
>>>>
>>>> On Wed, May 4, 2016 at 11:18 AM, Siddharth Anand <
>>>> siddharthan...@yahoo.com.invalid> wrote:
>>>>
>>>>> Yes.. also, I'm hoping no one does a "force push" in this scenario.
>>>>> -s
>>>>>
>>>>> On Wednesday, May 4, 2016 5:05 PM, Jeremiah Lowin <
>>>>> jlo...@apache.org> wrote:
>>>>>
>>>>>
>>>>>  Let's make sure that process is well documented!
>>>>>
>>>>> (asking for a friend... who doesn't want to accidentally erase Airflow)
>>>>>
>>>>> On Wed, May 4, 2016 at 12:59 PM Chris Riccomini >>>> >
>>>>> wrote:
>>>>>
>>>>> > > Chris, I assume the merge button will be greyed on PRs, what's the
>>>>> new
>>>>> > PR acceptance
>>>>> > flow like?
>>>>> >
>>>>> > I believe we'll have to pull the PR locally, merge into master, and
>>>>> push
>>>>> > master to the Apache Git repo. Annoying, but should work.
>>>>> >
>>>>> > On Tue, May 3, 2016 at 5:58 PM, Maxime Beauchemin <
>>>>> > maximebeauche...@gmail.com> wrote:
>>>>> >
>>>>> > > Hi,
>>>>> > >
>>>>> > > I'll be proceeding with the repo migration tomorrow as planned,
>>>>> getting
>>>>> > > things started in the morning. From previous experiences migrating
>>>>> repos
>>>>> > on
>>>>> > > Github, the transition should be smooth where Github handles all
>>>>> of the
>>>>> > > redirection. Even Travis and other Github ecosystem tools should
>>>>> just
>>>>> > work.
>>>>> > >
>>>>> > > Chris, I assume the merge button will be greyed on PRs, what's the
>>>>> new PR
>>>>> > > acceptance flow like?
>>>>> > >
>>>>> > > Thanks,
>>>>> > >
>>>>> > > Max
>>>>> > >
>>>>> > > On Tue, May 3, 2016 at 4:28 PM, Siddharth Anand >>>> >
>>>>> > wrote:
>>>>> > >
>>>>> > > > Committers/Maintainers :
>>>>> > > > If you have an open item on
>>>>> > > >
>>>>> >
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Migrating+to+Apache
>>>>> ,
>>>>> > > > please send status/blockers on your items :
>>>>> > > > Specifically:
>>>>> > > >- Max (repo migration)
>>>>> > > >- Jeremiah (travis migration)
>>>>> > > >
>>>>> > > >- I know you are waiting on Max
>>>>> > > >
>>>>> > > >- Bolke
>>>>> > > >
>>>>> > > >- GH issue-to-Jira migration - it looks like we decided to
>>>>> handle
>>>>> > this
>>>>> > > > in a lazy fashion, so maybe we can close this?
>>>>> > > > -s
>>>>> > >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: 1.7.1 release status

2016-05-06 Thread Jeremiah Lowin
The second point about moving add_task to a private method makes sense to
me, and I love `set_as_last` and `set_as_first` for convenience. Here are
my thoughts on the other two points:


* The absence of dag=dag (or equivalent context manager) WARNS about
deprecation in 2.0

- dag has never been a required argument for tasks (pre 1.7.1 or post) so
I'm not sure what behavior we would be deprecating. The only difference is
that 1.7.1 takes additional care when working with dag-less tasks to make
sure they are automatically added to any DAGs they interact with (via
set_upstream, for example). Otherwise you get in situations like the above
where a task in a DAG is allowed to depend on a task outside the DAG,
resulting in a broken DAG.


* set_upstream/downstream infers where needed, but WARNS about deprecation
of inference in 2.0

- inference is new in 1.7.1; previously, set_upstream could be called on
any two tasks without validating that they were in compatible DAGs (or any
DAG for that matter). I don't think we should put off doing that
inference/validation until 2.0, since the current behavior clearly allows
broken DAGs and should be fixed. To me, this doesn't represent an API
change that needs deprecation (especially since the API is identical), it's
a bug fix that prevents broken DAGs from being created. Maybe we have a
warning message that tells you if validation was done since that piece is a
new feature, but I don't see a reason to wait for 2.0 to do this
validation. The DAG is broken either way; in 1.7.0 you have to run it to
find that out and in 1.7.1 we tell you about it right away.


On Fri, May 6, 2016 at 10:53 AM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> I think that there are currently 4 ways to attach a task to a DAG:
> * Operator(task_id='foo', dag=dag)
> * dag.add_task(t)
> * Context manager: `with DAG('foo') as dag:`
> * inferred from set_upstream() and set_downstream in 1.7.1 (from my
> understanding)
>
> The pattern of passing the DAG object to BaseOperator constructor should
> probably be mandatory as this is how the `default_args` and `params` magic
> happens and it may be unclear to users that it is the case. I'm assuming
> that the behavior of the context manager is equivalent to the explictit
> dag=dag
>
> Knowing this, set_upstream shouldn't have to infer dag attribution, but
> simply check that the dags on both sides is the same object.
>
> So the changes I'd suggest would be:
> * The absence of dag=dag (or equivalent context manager) WARNS about
> deprecation in 2.0
> * DAG.add_task becomes private DAG._add_task and now WARNS about
> deprecation in 2.0 on usage, and also warns when reattributing a DAG twice
> * set_upstream/downstream infers where needed, but WARNS about deprecation
> of inference in 2.0
> * Add new methods to replace the upstream(dag.root) pattern, the terms root
> is unclear (does it mean firsts or lasts?). SE I'd suggest convenience
> methods `set_as_last` and `set_as_first` for clarity
>
> Max
>
> On Fri, May 6, 2016 at 6:32 AM, Bolke de Bruin  wrote:
>
> > I agree with disabling building broken DAGs and also disabling / not
> > supporting orphaned Operators/TaskInstances. It creates so many issues
> down
> > the line and the fixes are relatively easy. It will make airflow and the
> > used DAGs more maintainable.
> >
> > So +1, just make sure it is well documented in UPDATING.md
> >
> > my 2 cents.
> >
> > Bolke
> >
> > > Op 6 mei 2016, om 07:22 heeft Jeremiah Lowin  het
> > volgende geschreven:
> > >
> > > Tonight I was working with Dan on fixing the speed regression with
> large
> > > DAGs: https://github.com/apache/incubator-airflow/pull/1470. That
> clears
> > > the first blocker for 1.7.1 as described in AIRFLOW-52.
> > >
> > > We wanted to ask for the group's thoughts on the second blocker.
> > Basically,
> > > the issue centers on this pattern:
> > >
> > > ```python
> > > # email is an operator WITHOUT a DAG.
> > > email = Operator(...)
> > >
> > > # create dependencies for email
> > > email.set_upstream(dag.roots)
> > >
> > > # add email to the DAG
> > > dag.add_task(email)
> > > ```
> > >
> > > Why is this a problem? Under Airflow 1.7.0, this DAG is actually
> > completely
> > > broken after the set_upstream command, because it has a dependency to a
> > > task that's not in the DAG. It can't be run and will even raise an
> > > exception if you do something simple like access dag.roots. HOWEVER,
> this
> > > building this broken DAG is

  1   2   >