Re: Graduation resolution passed - Airflow is a TLP

2018-12-20 Thread James Meickle
Cheers to y'all, keep up the great work!

On Thu, Dec 20, 2018, 16:13 Jakob Homan  wrote:

> Hey all-
>The Board minutes haven't been published yet (probably due to
> Holiday-related slowness), but I can see through the admin tool that
> our Graduation resolution was approved yesterday at the meeting.
> Airflow is the 199th current active Top Level Project in Apache.
>
> Congrats all.
>
> -Jakob
>


Re: Call for fixes for Airflow 1.10.2

2018-12-06 Thread James Meickle
I suggest at least adding a commit to remove the broken S3 logging section
I just reported here: https://issues.apache.org/jira/browse/AIRFLOW-3449

On Wed, Nov 28, 2018 at 5:41 PM Kaxil Naik  wrote:

> Hi everyone,
>
> I'm starting the process of gathering fixes for a 1.10.2. So far the list
> of issues I have that we should pull in are
> *
> https://issues.apache.org/jira/browse/AIRFLOW-3384?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%201.10.2
> <
> https://issues.apache.org/jira/browse/AIRFLOW-3384?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%201.10.2
> >*
>
> I will start pushing these as cherry-picked commits to the v1-10-test
> branch today.
>
> *Kaxil Naik*
> *Big Data Consultant *@ *Data Reply UK*
> *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
> Developer
> *Phone: *+44 (0) 74820 88992
> *LinkedIn*: https://www.linkedin.com/in/kaxil
>


Re: programmatically creating and airflow quirks

2018-11-28 Thread James Meickle
I would be very interested in helping draft a rearchitecting AIP. Of
course, that's a vague statement. I am interested in several specific areas
of Airflow functionality that would be hard to modify without some
refactoring taking place first:

1) Improving Airflow's data model so it's easier to have functional data
pipelines (such as addressing information propagation and artifacts via a
non-xcom mechanism)

2) Having point-in-timeness for DAGs: a concept of which revision of a DAG
was in use at which date, represented in-Airflow.

3) Better idioms and loading capabilities for DAG factories (either
config-driven, or non-Python creation of DAGs, like with boundary-layer).

4) Flexible execution dates: in finance we operate day over day, and have
valid use cases for "t-1", "t+0", and "t+1" dates. The current execution
date status is incredibly confusing for literally every developer we've
brought onto Airflow (they understand it eventually but do make mistakes at
first).

5) Scheduler-integrated sensors

6) Making Airflow more operator-friendly with better alerting, health
checks, notifications, deploy-time configuration, etc.

7) Improving testability of various components (both within the Airflow
repo, as well as making it easier to test DAGs and plugins)

8) Deprecating "newbie trap" or excess complexity features (like skips), by
fixing their internal implementation or by providing alternatives that
address their use cases in more sound ways.

To my mind, I would need Airflow to be more modular to accomplish several
of those. Even if these aims don't happen in Airflow contrib (as some are
quite contentious and have been discussed on this list before), it would
currently be nearly impossible to maintain an in-house branch that
attempted to implement them.

That being said, saying that it requires microservices is IMO incorrect.
Airflow already scales quite well, so while it needs more modularization,
we probably would see no benefit from immediately breaking those modules
into independent services.

On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor  wrote:

> I have similar feelings around the "core" of Airflow and would _love_ to
> somehow find time to spend a month really getting to grips with the
> scheduler and the dagbag and see what comes to light with fresh eyes and
> the benefits of hindsight.
>
> Finding that time is going to be A Challenge though.
>
> (Oh, except no to microservices. Airflow is hard enough to operator right
> now without splitting things in to even more daemons)
>
> -ash
> > On 26 Nov 2018, at 03:06, soma dhavala  wrote:
> >
> >
> >
> >> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> >>
> >> The historical reason is that people would check in scripts in the repo
> >> that had actual compute or other forms or undesired effect in module
> scope
> >> (scripts with no "if __name__ == '__main__':") and Airflow would just
> run
> >> this script while seeking for DAGs. So we added this mitigation patch
> that
> >> would confirm that there's something Airflow-related in the .py file.
> Not
> >> elegant, and confusing at times, but it also probably prevented some
> issues
> >> over the years.
> >>
> >> The solution here is to have a more explicit way of adding DAGs to the
> >> DagBag (instead of the folder-crawling approach). The DagFetcher
> proposal
> >> offers solutions around that, having a central "manifest" file that
> >> provides explicit pointers to all DAGs in the environment.
> >
> > Some rebasing needs to happen. When I looked at 1.8 code base almost an
> year ago, it felt like more complex than necessary.  What airflow is trying
> to promise from an architectural standpoint — that was not clear to me. It
> is trying to do too many things, scattered in too many places, is the
> feeling I got. As a result, I stopped peeping, and just trust that it works
> — which it does, btw. I tend to think that, airflow outgrew its original
> intents. A sort of micro-services architecture has to be brought in. I may
> sound critical, but no offense. I truly appreciate the contributions.
> >
> >>
> >> Max
> >>
> >> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker 
> >> wrote:
> >>
> >>> In my opinion this searching for dags is not ideal.
> >>>
> >>> We should be explicitly specifying the dags to load somewhere.
> >>>
> >>>
>  On 25 Nov 2018, at 10:41 am, Kevin Yang  wrote:
> 
>  I believe that is mostly because we want to skip parsing/loading .py
> >>> files
>  that doesn't contain DAG defs to save time, as scheduler is going to
>  parse/load the .py files over and over again and some files can take
> >>> quite
>  long to load.
> 
>  Cheers,
>  Kevin Y
> 
>  On Fri, Nov 23, 2018 at 12:44 AM soma dhavala  >
>  wrote:
> 
> > happy to report that the “fix” worked. thanks Alex.
> >
> > btw, wondering why was it there in the first place? how does it help
> —
> > saves time, early terminatio

Re: Moving Airflow Config to Database.

2018-11-15 Thread James Meickle
Just a guess, but do you need to reload supervisorctl itself before
restarting the service? If you add an env var to the supervisor config, and
then restart the supervisor-managed service, it will actually be running
with the old supervisor config file still. The supervisor daemon itself
must be reloaded.



On Thu, Nov 15, 2018 at 7:59 AM Sai Phanindhra  wrote:

> Some of env variables are not getting reflected in supervisord
> environment(sometimes new variables are not available, sometimes changes to
> existing variables are not reflected)
>
> On Thu 15 Nov, 2018, 17:37 Ash Berlin-Taylor 
> > > problem with
> > > this approach is these env variables wont behave correctly when we
> > > subshells
> >
> >
> > Can you explain what you mean by this?
> >
> > -ash
> >
> >
> > > On 15 Nov 2018, at 12:03, Sai Phanindhra  wrote:
> > >
> > > Hi deng,
> > > I am currently using env variables for few airflow config variables
> which
> > > may differ across machines(airflow folder, log folder etc..,) Problem
> > with
> > > this approach is these env variables wont behave correctly when we
> > > subshells. ( I faced issues when i added airflow jobs in supervisord).
> > >  Moving config to db not only addresses this issue, it will give
> > provision
> > > to change config from UI.(most of the time, we cant give box access to
> > all
> > > users. ). Config is db makes it easy to create/update config without
> > > touching code.
> > >
> > > On Thu 15 Nov, 2018, 16:04 Deng Xiaodong  > >
> > >> A few solutions that may address your problem:
> > >>
> > >> - Specify your configurations in environment variables, so it becomes
> > much
> > >> easier to manage across machines
> > >> - use network attached storage to save your configuration file and
> > mount it
> > >> to all your machines (this can address DAG file sync as well)
> > >> - ...
> > >>
> > >> Personally I don’t see point moving configuration to DB.
> > >>
> > >>
> > >> XD
> > >>
> > >> On Thu, Nov 15, 2018 at 18:29 Sai Phanindhra 
> > wrote:
> > >>
> > >>> Hello Airflow users,
> > >>>  I recently encountered as issue with airflow. I am maintaining a
> > >>> airflow cluster, whenever i make a change in airflow configuration in
> > one
> > >>> of the machine i have to consciously copy these changes to other
> > machines
> > >>> in airflow cluster. Problem with this is it's a manual process and
> > >>> something users tend to forget to sync changes. I want move airflow
> > >> config
> > >>> to a database. I will be happy if you can share your valuable
> > >>> inputs/thoughts on this.
> > >>>
> > >>> --
> > >>> Sai Phanindhra,
> > >>> Ph: +91 9043258999
> > >>>
> > >>
> >
> >
>


Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

2018-11-14 Thread James Meickle
FYI I am on the Airflow Slack but only check it on weekends mostly.

Here is a gist with my implementation of a Slack callback, which attaches
icons/buttons/emoji/etc.:
https://gist.github.com/Eronarn/99408c0e5b0dd964487a5eea64b34f6d

On Wed, Nov 14, 2018 at 1:54 PM Sai Phanindhra  wrote:

> Thanks James for the input.
> For the problems i specified above, i build hacky solutions like adding one
> `*slack_start_notification_operation*` in beginning, `
> *slack_end_notification_operator*` in the end and `
> *slack_failed_notification_operation*` when upstream fails. This addresses
> first 3 issues/feature requirements i spoke about. I am maintaining lists
> for emails and at dag level i'm joing all required emails for addressing
> point 5. Still i feel like this is manual work and need to be done every
> time a new dag onboards in airflow. I feel like these are common problems
> many of the airflow users/developers face.
> @James  lets catch up someone on slack/hangout to
> discuss how these enhancements can be done.
>
>
> On Thu, 15 Nov 2018 at 00:10, James Meickle  .invalid>
> wrote:
>
> > As the author of the first linked PR, I think your points are good. Here
> is
> > my attempt to address them:
> >
> > 1: It is possible to do this today if you write a Slack callback. I would
> > be happy to share my code for this if you're having trouble integrating
> > Slack. That being said, it would be great if Airflow provided several
> > "default" callbacks for common platforms like Slack and Pagerduty.
> >
> > 2/3: Yes, Airflow should add callbacks for the DAG lifecycle, too. DAG
> > "SLAs" on the other hand, I am not sure would provide any additional
> value,
> > and have a high chance of being misused.
> >
> > 4: That's a great idea. My PR would make adding this very easy, because
> it
> > redefines the "SLAMiss" object as having a "type" of SLA miss. This would
> > involve adding a new type to the enum, and some logic to check when to
> > create an SLA miss of this type.
> >
> > 5: My interpretation is that you mean an email address that always gets
> > notified, regardless of any more specific users that a task says it
> should
> > email. (So not a default value to "emails", but instead an additional
> value
> > that is always added.) I think this makes a lot of sense and would be
> easy
> > to add to email. It would not be even remotely possible for a Slack
> > integration right now, since there's no unified code for that.
> >
> > My preferred way of addressing this would be to get my PR merged as a
> > starting point, which isolates a lot of this functionality from the
> > scheduler code. Then have a broader AIP created, or possibly a pair of
> > them: switching to a more general evented system for Airflow model
> > lifecycles, and implementing pluggable notifiers (right now a lot of the
> > email functionality is hardcoded) the same way that there is already
> > pluggable logging.
> >
> > From an SRE perspective, two other pain points we run into: the statsd
> > integration is subpar (at least when we ingest it in Datadog it's hard to
> > actually alert on), and there's no /health or /healthz endpoints for the
> > scheduler and worker so it's hard to know if they are healthy in a
> > programmatic way.
> >
> > On Wed, Nov 14, 2018 at 1:06 PM Niels Zeilemaker 
> > wrote:
> >
> > > I had a go once to introduce something similar, but never got it
> merged.
> > > Maybe you can use it as an inspiration.
> > >
> > > https://github.com/apache/incubator-airflow/pull/2412
> > >
> > > Niels
> > >
> > > Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra  > >
> > > > Above mentioned PR address issues/bugs in current functionality. I
> want
> > > to
> > > > add more mediums of alerting which includes SLA.
> > > >
> > > > On Wed, 14 Nov 2018 at 20:51, airflowuser
> > > >  wrote:
> > > >
> > > > > There is a pending PR to refactor the SLA:
> > > > > https://github.com/apache/incubator-airflow/pull/3584
> > > > >
> > > > > But it requires more reviews from committers.
> > > > >
> > > > >
> > > > > Sent with ProtonMail Secure Email.
> > > > >
> > > > > ‐‐‐ Original Message ‐‐‐
> > > > > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > > > > phani8...@gmail.com> wrote:
> > &

Re: Customised alerts/notifications and enhancements to alerting/notifications on Airflow

2018-11-14 Thread James Meickle
As the author of the first linked PR, I think your points are good. Here is
my attempt to address them:

1: It is possible to do this today if you write a Slack callback. I would
be happy to share my code for this if you're having trouble integrating
Slack. That being said, it would be great if Airflow provided several
"default" callbacks for common platforms like Slack and Pagerduty.

2/3: Yes, Airflow should add callbacks for the DAG lifecycle, too. DAG
"SLAs" on the other hand, I am not sure would provide any additional value,
and have a high chance of being misused.

4: That's a great idea. My PR would make adding this very easy, because it
redefines the "SLAMiss" object as having a "type" of SLA miss. This would
involve adding a new type to the enum, and some logic to check when to
create an SLA miss of this type.

5: My interpretation is that you mean an email address that always gets
notified, regardless of any more specific users that a task says it should
email. (So not a default value to "emails", but instead an additional value
that is always added.) I think this makes a lot of sense and would be easy
to add to email. It would not be even remotely possible for a Slack
integration right now, since there's no unified code for that.

My preferred way of addressing this would be to get my PR merged as a
starting point, which isolates a lot of this functionality from the
scheduler code. Then have a broader AIP created, or possibly a pair of
them: switching to a more general evented system for Airflow model
lifecycles, and implementing pluggable notifiers (right now a lot of the
email functionality is hardcoded) the same way that there is already
pluggable logging.

>From an SRE perspective, two other pain points we run into: the statsd
integration is subpar (at least when we ingest it in Datadog it's hard to
actually alert on), and there's no /health or /healthz endpoints for the
scheduler and worker so it's hard to know if they are healthy in a
programmatic way.

On Wed, Nov 14, 2018 at 1:06 PM Niels Zeilemaker 
wrote:

> I had a go once to introduce something similar, but never got it merged.
> Maybe you can use it as an inspiration.
>
> https://github.com/apache/incubator-airflow/pull/2412
>
> Niels
>
> Op wo 14 nov. 2018 16:43 schreef Sai Phanindhra 
> > Above mentioned PR address issues/bugs in current functionality. I want
> to
> > add more mediums of alerting which includes SLA.
> >
> > On Wed, 14 Nov 2018 at 20:51, airflowuser
> >  wrote:
> >
> > > There is a pending PR to refactor the SLA:
> > > https://github.com/apache/incubator-airflow/pull/3584
> > >
> > > But it requires more reviews from committers.
> > >
> > >
> > > Sent with ProtonMail Secure Email.
> > >
> > > ‐‐‐ Original Message ‐‐‐
> > > On Wednesday, November 14, 2018 5:11 PM, Sai Phanindhra <
> > > phani8...@gmail.com> wrote:
> > >
> > > > Hello airflow committers and maintainers,
> > > > I came across sla in airflow. It's a very good feature to begin
> > > > with. I feel like few enhancements can be done. These enhancements
> are
> > > not
> > > > limited to just sla, they basically are voids i felt when im using
> > > airflow.
> > > > Im listing few of them here.
> > > >
> > > > 1.  SLA alerts to slack channel(s) along with emails
> > > > 2.  Alerts at DAG level(starting, success and failure).
> > > > 3.  custom callbacks just like `*on_failure_callback*`,
> > > `*on_retry_callback*` and `*on_success_callback*` on DAG level.
> > > > 4.  Alerts if task gets completed before minimum run time(This is
> > really
> > > > a rare case. But there will be few long running jobs that we know
> > > for sure
> > > > runs for at least few hours and if they exit before that it means
> > > something
> > > > wrong. We need warning alerts for such cases.)
> > > >
> > > > 5.  Default/Global Alert config(default emails to send all alerts
> > and/or
> > > > slack channel to send alerts)
> > > >
> > > > Some of these might have already been solved or someone is
> working
> > to
> > > > solve. Please share your thoughts and add anything else i missed
> to
> > > this
> > > > list.
> > > >
> > >
> > >
> > >
> >
> > --
> > Sai Phanindhra,
> > Ph: +91 9043258999
> >
>


Re: Deployment / Execution Model

2018-11-01 Thread James Meickle
We're running into a lot of pain with this. We have a CI system that
enables very rapid iteration on DAG code. Whenever you need to modify
plugin code, it requires a re-ship of all of the infrastructure, which
takes at least 10x longer than a DAG deployment Jenkins build.

I think that Airflow should have multiple DAG parsing backends, the same
way that it has multiple executors. It's fine for subprocess to be the
default, but it would be immensely helpful if DAG parsing could take place
in a virtualenv, Docker container, or Kubernetes pod.

On Thu, Nov 1, 2018 at 2:20 AM Gabriel Silk 
wrote:

> I can see how my first email was confusing, where I said:
>
> "Our first attempt at productionizing Airflow used the vanilla DAGs folder,
> including all the deps of all the DAGs with the airflow binary itself"
>
> What I meant is that we have separate DAGs deployment, but we are being
> forced to package the *dependencies of the DAGs* with the Airflow binary,
> because that's the only way to make the DAG definitions work.
>
> On Wed, Oct 31, 2018 at 11:18 PM, Gabriel Silk  wrote:
>
> > Our DAG deployment is already a separate deployment from Airflow itself.
> >
> > The issue is that the Airflow binary (whether acting as webserver,
> > scheduler, worker), is the one that *reads* the DAG files. So if you
> > have, for example, a DAG that has this import statement in it:
> >
> > import mylib.foobar
> >
> > Then the only way to successfully interpret this DAG definition in the
> > Airflow process, is if you package the Airflow binary with the
> mylib.foobar
> > dependency.
> >
> > This implies that every time you add a new dependency in one of your DAG
> > definitions, you have to re-deploy Airflow itself, not just the DAG
> > definitions.
> >
> >
> > On Wed, Oct 31, 2018 at 2:45 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Deploying the DAGs should be decoupled from deploying Airflow itself.
> You
> >> can just use a resource that syncs the DAGs repo to the boxes on the
> >> Airflow cluster periodically (say every minute). Resource orchestrators
> >> like Chef, Ansible, Puppet, should have some easy way to do that. Either
> >> that or some sort of mount or mount-equivalent (k8s has constructs for
> >> that, EFS on Amazon).
> >>
> >> Also note that the DagFetcher abstraction that's been discussed before
> on
> >> the mailing list would solve this and more.
> >>
> >> Max
> >>
> >> On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk  >
> >> wrote:
> >>
> >> > Hello Airflow community,
> >> >
> >> >
> >> > I'm currently putting Airflow into production at my company of 2000+
> >> > people. The most significant sticking point so far is the deployment /
> >> > execution model. I wanted to write up my experience so far in this
> >> matter
> >> > and see how other people are dealing with this issue.
> >> >
> >> > First of all, our goal is to allow engineers to author DAGs and easily
> >> > deploy them. That means they should be able to make changes to their
> >> DAGs,
> >> > add/remove dependencies, and not have to  redeploy any of the core
> >> > component (scheduler, webserver, workers).
> >> >
> >> > Our first attempt at productionizing Airflow used the vanilla DAGs
> >> folder,
> >> > and including all the deps of all the DAGs with the airflow binary
> >> itself.
> >> > Unfortunately, that meant we had to redeploy the scheduler, webserver
> >> > and/or workers every time a dependency changed, which will definitely
> >> not
> >> > work for us long term.
> >> >
> >> > The next option we considered was to use the "packaged DAGs" approach,
> >> > whereby you place dependencies in a zip file. This would not work for
> >> us,
> >> > due to the lack of support for dynamic libraries (see
> >> > https://airflow.apache.org/concepts.html#packaged-dags)
> >> >
> >> > We have finally arrived at an option that seems reasonable, which is
> to
> >> use
> >> > a single Operator that shells out to various binary targets that we
> >> build
> >> > independently of Airflow, and which include their own dependencies.
> >> > Configuration is serialized via protobuf and passed over stdin to the
> >> > subprocess. The parent process (which is in Airflow's memory space)
> >> streams
> >> > the logs from stdout and stderr.
> >> >
> >> > This approach has the advantage of working seamlessly with our build
> >> > system, and allowing us to redeploy DAGs even when dependencies in the
> >> > operator implementations change.
> >> >
> >> > Any thoughts / comments / feedback? Have people faced similar issues
> out
> >> > there?
> >> >
> >> > Many thanks,
> >> >
> >> >
> >> > -G Silk
> >> >
> >>
> >
> >
>


Re: Is airflow a fit?

2018-10-18 Thread James Meickle
We use Airflow for a mixture of traditional data pipeline work, and
orchestration tasks that need to express complex date and dependency logic.
The former is a bit better supported, but it's still a great tool for the
latter.

But it sounds like your format is "this task should happen whenever this
other event happens". Airflow is oriented around "this task should happen
once per day, but not until this other condition happens". Therefore, I
would suggest that you probably want a live/streaming architecture where
tasks can subscribe to events.

On Thu, Oct 18, 2018 at 3:14 PM  wrote:

> Hi,
>
> I am Ahlem Triki, software Engineer. Recently we started working with
> Airflow and I would ask if you see Airflow as a fit to solve what we
> working
> on, or is it an over kill. We are really in need for an airflow expert
> point
> of view.
>
> Here is a brief description of what  we working on:
>
> *   Running health programs (10's to low 100's program) e.g., diabetic,
> cardio, new born..  For different 100's clinics (each one customize their
> own) including low 1000's of doctors
> *   1000's of patients in each clinic, with a patient enrolled in one
> or
> more of these programs
> *   Each program has
>
> *   a time table (Time triggered) actions of sending SMS,
> notifications,
> updating application (e.g., reminders of appointments or lab, setting
> calls,
> sending and receiving surveys/ results AND
> *   Event Triggered action, like receiving confirmation a mobile, or
> email, SMS (lab results high or low, med inventory status.)
> *   Each program runs for month or years or for ever (or just one
> single
> day)
>
> My confusion it seems Airflow is meant more for data pipelines i.e, in my
> mind computational level, vs what I'd doing is more like an application
> level, hence my question is AirFlow a fit? If so any guidance, if not any
> recommendations
>
>
>
> Best regards,
>
> Ahem
>
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>


Re: [External] RE: [IE] airflow ui not showing logs

2018-10-15 Thread James Meickle
Did you switch to the RBAC UI in 1.10? The default is to not allow reading
logs to most user accounts.

On Mon, Oct 15, 2018 at 3:14 PM Frank Maritato
 wrote:

> Hi Sunil,
>
> I don't see this process running. I have never had to run this command
> previously. Should it have started as part of the 'airflow webserver'
> command?
> I ran it manually from the command line as 'airflow serve_logs &' (does not
> have a daemon option I guess) but I am still not seeing logs in the ui. I
> did verify with netstat that it is running on port 8793.
>
>
> On Mon, Oct 15, 2018 at 12:02 PM Sunil Varma Chiluvuri <
> sunilvarma.chiluv...@equifax.com> wrote:
>
> > Check if the serve_logs process is running alongside your worker
> > process(es). This is the process that takes the log files written to disk
> > and serves them to the web UI. It should be running on port 8793 by
> default.
> >
> > Sunil
> >
> > -Original Message-
> > From: Frank Maritato [mailto:fmarit...@opentable.com.INVALID]
> > Sent: Monday, October 15, 2018 1:42 PM
> > To: dev@airflow.incubator.apache.org
> > Subject: [IE] airflow ui not showing logs
> >
> > Hi All,
> >
> > I'm running airflow 1.10.0 and the ui isn't showing the task logs
> anymore.
> > This has worked in the past so I'm not sure what changed. I was able to
> > verify that the logs are definitely being written to the same local
> > directory as what is specified in the airflow.cfg 'base_log_folder'. I
> > don't see any errors or debug in the airflow-webserver.{out|err} logs. I
> > tried restarting the process and it still doesn't work.
> >
> > Anyone know how I can track this down?
> > --
> > Frank Maritato
> > This message contains proprietary information from Equifax which may be
> > confidential. If you are not an intended recipient, please refrain from
> any
> > disclosure, copying, distribution or use of this information and note
> that
> > such actions are prohibited. If you have received this transmission in
> > error, please notify by e-mail postmas...@equifax.com. Equifax® is a
> > registered trademark of Equifax Inc. All rights reserved.
> >
>
>
> --
>
> Frank Maritato
>


Re: Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread James Meickle
For something to add to Airflow itself: I would love a more flexible
mapping between data time and processing time. The default is "n-1" (day
over day, you're aiming to process yesterday's data) but people post other
use cases on this mailing list quite frequently.

On Fri, Oct 12, 2018 at 7:46 AM Faouz El Fassi  wrote:

> What about an exponential back off on the poke interval?
>
> On Fri, 12 Oct 2018, 13:01 Ash Berlin-Taylor,  wrote:
>
> > That would work for some of our other uses cases (and has been an idea in
> > our backlog for months) but not this case as we're reading from someone
> > else's bucket so can't set up notifications etc. :(
> >
> > -ash
> >
> > > On 12 Oct 2018, at 11:57, Bolke de Bruin  wrote:
> > >
> > > S3 Bucket notification that triggers a dag?
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > >> Op 12 okt. 2018 om 12:42 heeft Ash Berlin-Taylor  het
> > volgende geschreven:
> > >>
> > >> A lot of our dags are ingesting data (usually daily or weekly) from
> > suppliers, and they are universally late.
> > >>
> > >> In the case I'm setting up now the delivery lag is about 30hours -
> data
> > for 2018-10-10 turned up at 2018-10-12 05:43.
> > >>
> > >> I was going to just set this up with an S3KeySensor and a daily
> > schedule, but I'm wondering if anyone has any other bright ideas for a
> > better way of handling this sort of case:
> > >>
> > >>   dag = DAG(
> > >>   DAG_ID
> > >>   default_args=args,
> > >>   start_date=args['start_date'],
> > >>   concurrency=1,
> > >>   schedule_interval='@daily',
> > >>   params={'country': cc}
> > >>   )
> > >>
> > >>   with dag:
> > >>   task = S3KeySensor(
> > >>   task_id="await_files",
> > >>   bucket_key="s3://bucket/raw/table1-{{ params.country }}/{{
> > execution_date.strftime('%Y/%m/%d') }}/SUCCESS",
> > >>   poke_interval=60 * 60 * 2,
> > >>   timeout=60 * 60 * 72,
> > >>   )
> > >>
> > >> That S3 key sensor is _going_ to fail the first 18 times or so it runs
> > which just seems silly.
> > >>
> > >> One option could be to use `ds_add` or similar on the execution date,
> > but I don't like breaking the (obvious) link between execution date and
> > which files it picks up, so I've ruled out this option
> > >>
> > >> I could use a Time(Delta)Sensor to just delay the start of the
> > checking. I guess with the new change in master to make sensors yield
> their
> > execution slots that's not a terrible plan.
> > >>
> > >> Does anyone else have any other idea, including possible things we
> > could add to Airflow itself.
> > >>
> > >> -ash
> > >>
> >
> >
>


Re: Can a DAG be conditionally hidden from the UI?

2018-10-09 Thread James Meickle
It's fine (and required if you do task assignment) to create the DAG
object; just return None from the function (return early) so that there is
no variable containing a DAG object visible within the module.

On Tue, Oct 9, 2018 at 12:21 PM Shah Altaf  wrote:

> @*jmeickle *can you share your code or any sample?  How are you returning
> nothing?
>
> I may have spoken too soon.  I can't seem to set dag to None, it fails with
>
> *>>>>Tried to create relationships between tasks that don't have DAGs yet*
>
>
> On Tue, Oct 9, 2018 at 12:57 PM Shah Altaf  wrote:
>
> >
> >
> > Thanks all, following your advice I've decided to do this right now with
> a
> > decorator.  It looks crude so any suggestions are welcome.
> >
> > The decorator function checks an environment variable that we set on our
> > servers and if it matches any of the passed in environment names, it'll
> > allow the DAG to be created.
> >
> >
> > def environments(env_names):
> > def restrict_environment(func):
> > def func_wrapper(*args, **kwargs):
> > *if( os.getenv("environment") in env_names):*
> > return DAG(*args, **kwargs)
> > return func_wrapper
> > return restrict_environment
> >
> > *@environments(["env1","env2","production"])*
> > def get_DAG(*args, **kwargs):
> > return None
> >
> > dag = get_DAG('my_dag', default_args=default_args, schedule_interval="30
> 5
> > * * *")
> >
> >
> > Regards
> > Shah
> >
> >
> > On Mon, Oct 8, 2018 at 5:19 PM William Pursell
> 
> > wrote:
> >
> >> This would be a very desirable feature.  It's not just an issue of
> >> differing environments, but also changing requirements.  A dag may be
> >> used for a time, and then removed from the environment.  But if it's
> >> still in the airflow db, it will get a row on the UI with a black "I'
> >> indicating that the dag is missing.  As far as I know, then only way
> >> to remove it is to manually edit the database.
> >> On Mon, Oct 8, 2018 at 9:43 AM Chris Palmer  wrote:
> >> >
> >> > I like James solution better, but the initial thought I had was to
> >> deploy
> >> > airflowignore files to the environments to filter out files that
> should
> >> not
> >> > be processed when filling the DagBag.
> >> >
> >> > Chris
> >> >
> >> > On Mon, Oct 8, 2018 at 10:22 AM James Meickle
> >> >  wrote:
> >> >
> >> > > As long as the Airflow process can't find the DAG as a top-level
> >> object in
> >> > > the module, it won't be registered. For example, we have a function
> >> that
> >> > > returns DAGs; the function returns nothing if it's not in the right
> >> > > environment.
> >> > >
> >> > > On Sun, Oct 7, 2018 at 2:31 PM Shah Altaf 
> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > tl;dr - Is it possible to conditionally hide a DAG from the UI
> >> based on
> >> > > an
> >> > > > environment variable?
> >> > > >
> >> > > >
> >> > > >
> >> > > > Our team has a single repo with several DAGs in it and we deploy
> it
> >> > > across
> >> > > > multiple 'environments' (think dev, test, and other integration
> >> > > > environments).  While most DAGs are meant to run everywhere, we do
> >> have
> >> > > > several which are meant to run in one and only one environment.
> Of
> >> > > course
> >> > > > they are all paused, but it would be nice to declutter a bit for
> >> > > ourselves.
> >> > > >
> >> > > > My question then - is it possible to conditionally hide a DAG from
> >> the UI
> >> > > > based on an environment variable or some flag somewhere.
> >> > > >
> >> > > > This is just wishful thinking - the dev could do something like
> >> > > >
> >> > > > dag = get_dag(...),and get_dag() would have a decorator like
> >> > > > @only_run_in("integration4,dev,local")
> >> > > >
> >> > > > And that decorator returns some kind of None object or special DAG
> >> which
> >> > > > just doesn't appear in the list.
> >> > > >
> >> > > > Or perhaps some other way to accomplish this same effect - any
> ideas
> >> > > would
> >> > > > be appreciated.
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > >
> >>
> >>
>


Re: error handling flow in DAG

2018-10-08 Thread James Meickle
Anthony:

Could you just have the "success" path be declared with "all_success" (the
default), and the "failure" side branches be declared with "all_failed"
depending on the previous task? This will have the same branching structure
you want but with less intermediary operators.

-James M.

On Mon, Oct 1, 2018 at 1:12 PM Anthony Brown 
wrote:

> Hi
>I am coding various data flows and one of the requirements we have is to
> have some error tasks happen when some of the tasks failure. These error
> tasks are specific to the task that failed and are not a generic to the
> whole DAG
>
>So for instance if I have a DAG that runs the following tasks
>
> task_1 > task_2 > task_3
>
>If task_1 fails, then I want to run
>
> task_1_failure_a ---> task_1_failure_b
>
>If task_2 fails, I do not need to do anything specific, but if task_3
> fails, I need to run
>
> task_3_failure_a ---> task_3_failure_b
>
>I already have a generic on_failure_callback task defined on all tasks
> that handles alerting, but am stuck on the best way of handling a failure
> flow for tasks
>
>The ways I have come up with of handling this are
>
> Have a branch operator between each task with trigger_rule set to all_done.
> The branch operator would then decide whether to go to next (success) task,
> or to go down the failure branch
>
> Put the failure tasks in a separate DAG with no schedule. Have a different
> on_failure_callback for each task that would trigger the failure DAG for
> that task and then do my generic error handling
>
>Does anybody have any thoughts on which of the above two approaches
> would be best, or suggest an alternative way of doing this
>
> Thanks
>
> --
> --
>
> Anthony Brown
> Data Engineer BI Team - John Lewis
> Tel : 0787 215 7305
> **
> This email is confidential and may contain copyright material of the John
> Lewis Partnership.
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message.
> (Please note that it is your responsibility to scan this message for
> viruses). Email to and from the
> John Lewis Partnership is automatically monitored for operational and
> lawful business reasons.
> **
>
> John Lewis plc
> Registered in England 233462
> Registered office 171 Victoria Street London SW1E 5NN
>
> Websites: https://www.johnlewis.com
> http://www.waitrose.com
> https://www.johnlewisfinance.com
> http://www.johnlewispartnership.co.uk
>
> **
>


Re: Airflow presentation at Velocity New York

2018-10-08 Thread James Meickle
Hi all,

The slides are now up:
https://docs.google.com/presentation/d/1gbfR79u4Ws8ctASZiGuQaslX3bHgnW5L9GFuyfQWRFM/edit#slide=id.p

Recordings are on O'Reilly's Safari platform, and you can of course reach
out to me if you have questions.

-James M.

On Thu, May 17, 2018 at 10:49 AM Vasyl Harasymiv <
vharasy...@activecampaign.com> wrote:

> Please do share slides/video after, James!
>
> On Thu, May 17, 2018 at 6:53 AM, James Meickle 
> wrote:
>
> > Hi folks,
> >
> > At Velocity New York in October, I will be presenting about how
> Quantopian
> > uses Airflow for financial data:
> > https://conferences.oreilly.com/velocity/vl-ny/public/
> > schedule/detail/70048
> >
> > We couldn't have adopted Airflow so quickly without the hard work of
> > contributors who have made it such an amazing piece of software. So,
> thank
> > you! And feel free to reach out if you'll be at the conference.
> >
> > -James M.
> >
>
>
>
> --
>
>
> Vasyl Harasymiv
> ActiveCampaign / Director of Data Science
> (608) 622-5327
> vharasy...@activecampaign.com
> 1 N Dearborn St, Chicago, IL 60602
> <https://www.facebook.com/activecampaign>
> <http://www.twitter.com/activecampaign>
> <https://www.linkedin.com/company/activecampaign-inc->
>


Re: Can a DAG be conditionally hidden from the UI?

2018-10-08 Thread James Meickle
As long as the Airflow process can't find the DAG as a top-level object in
the module, it won't be registered. For example, we have a function that
returns DAGs; the function returns nothing if it's not in the right
environment.

On Sun, Oct 7, 2018 at 2:31 PM Shah Altaf  wrote:

> Hi all,
>
> tl;dr - Is it possible to conditionally hide a DAG from the UI based on an
> environment variable?
>
>
>
> Our team has a single repo with several DAGs in it and we deploy it across
> multiple 'environments' (think dev, test, and other integration
> environments).  While most DAGs are meant to run everywhere, we do have
> several which are meant to run in one and only one environment.  Of course
> they are all paused, but it would be nice to declutter a bit for ourselves.
>
> My question then - is it possible to conditionally hide a DAG from the UI
> based on an environment variable or some flag somewhere.
>
> This is just wishful thinking - the dev could do something like
>
> dag = get_dag(...),and get_dag() would have a decorator like
> @only_run_in("integration4,dev,local")
>
> And that decorator returns some kind of None object or special DAG which
> just doesn't appear in the list.
>
> Or perhaps some other way to accomplish this same effect - any ideas would
> be appreciated.
>
> Thanks
>


Re: PR for refactoring Airflow SLAs

2018-10-08 Thread James Meickle
Hi,

I plan to work on this again but got busy between work and personal life.
I'll see if I can revisit it this month.

-James M.

On Fri, Oct 5, 2018 at 7:30 AM Colin Nattrass 
wrote:

> Hello all,
>
> Is there any update on the status of this PR?
>
> I discovered this following a request for help on StackOverflow (on
> creating SLAs on task duration
> https://stackoverflow.com/questions/52645422/sla-on-task-duration-airflow).
> If this is unlikely to implemented in the short term, is there a known
> workaround?
>
> Colin N
>
>
> On 2018/07/17 18:56:35, Maxime Beauchemin  wrote:
> > I did a quick scan and it looks like great work, thanks for your>
> > contribution! I'm guessing the committers are all either busy or>
> > vacationing at this moment. Let's make sure this gets properly reviewed
> and>
> > merged.>
> >
> > Related to this is the thought of having a formal flow for improvement>
> > proposals so that we can do some design review upfront, and couple>
> > contributors with committers early to make sure the process goes
> through>
> > smoothly. It hurts to have to have quality contributions ignored.
> Clearly>
> > we need to onboard more committers to insure quality work gets merged
> while>
> > also providing steady, high quality releases.>
> >
> > In the meantime I'd advise you to ping regularly to make sure this PR
> gets>
> > the attention it deserves and prevent it from getting buried in the
> pile.>
> >
> > Max>
> >
> > On Tue, Jul 17, 2018 at 5:27 AM James Meickle>
> >  wrote:>
> >
> > > Hi all,>
> > >>
> > > I'd still love to get some eyes on this one if anyone has time.
> Definitely>
> > > needs some direction as to what is required before merging, since this
> is a>
> > > higher-level API change...>
> > >>
> > > -James M.>
> > >>
> > > On Mon, Jul 9, 2018 at 11:58 AM, James Meickle >
>
> > > wrote:>
> > >>
> > > > Hi folks,>
> > > >>
> > > > Based on my earlier email to the list, I have submitted a PR that
> splits>
> > > > `sla=` into three independent SLA parameters, as well as heavily>
> > > > restructuring other parts of the SLA feature:>
> > > >>
> > > > https://github.com/apache/incubator-airflow/pull/3584>
> > > >>
> > > > This is my first Airflow PR and I'm still learning the codebase, so>
> > > > there's likely to be flaws with it. But I'm most interested in the>
> > > general>
> > > > compatibility of this feature with the rest of Airflow. We want this
> for>
> > > > our purposes at Quantopian, but we'd really prefer to get it into
> Airflow>
> > > > core rather than running a fork forever!>
> > > >>
> > > > Let me know your thoughts,>
> > > >>
> > > > -James M.>
> > > >>
> > >>
> >


Re: Pinning dependencies for Apache Airflow

2018-10-04 Thread James Meickle
I suggest not adopting pipenv. It has a nice "first five minutes" demo but
it's simply not baked enough to depend on as a swap in pip replacement. We
are in the process of removing it after finding several serious bugs in our
POC of it.

On Thu, Oct 4, 2018, 20:30 Alex Guziel 
wrote:

> FWIW, there's some value in using virtualenv with Docker to isolate
> yourself from your system's Python.
>
> It's worth noting that requirements files can link other requirements
> files, so that would make groups easier, but not that pip in one run has no
> guarantee of transitive dependencies not conflicting or overriding. You
> need pip check for that or use --no-deps.
>
> On Thu, Oct 4, 2018 at 5:19 PM Driesprong, Fokko 
> wrote:
>
> > Hi Jarek,
> >
> > Thanks for bringing this up. I missed the discussion on Slack since I'm
> on
> > holiday, but I saw the thread and it was way too interesting, and
> therefore
> > this email :)
> >
> > This is actually something that we need to address asap. Like you
> mention,
> > we saw it earlier that specific transient dependencies are not compatible
> > and then we end up with a breaking CI, or even worse, a broken release.
> > Earlier we had in the setup.py the fixed versions (==) and in a separate
> > requirements.txt the requirements for the CI. This was also far from
> > optimal since we had two versions of the requirements.
> >
> > I like the idea that you are proposing. Maybe we can do an experiment
> with
> > it, because of the nature of Airflow (orchestrating different systems),
> we
> > have a huge list of dependencies. To not install everything, we've
> created
> > groups. For example specific libraries when you're using the Google
> Cloud,
> > Elastic, Druid, etc. So I'm curious how it will work with the `
> > extras_require` of Airflow
> >
> > Regarding the pipenv. I don't use any pipenv/virtualenv anymore. For me
> > Docker is much easier to work with. I'm also working on a PR to get rid
> of
> > tox for the testing, and move to a more Docker idiomatic test pipeline.
> > Curious what you thoughts are on that.
> >
> > Cheers, Fokko
> >
> > Op do 4 okt. 2018 om 15:39 schreef Arthur Wiedmer <
> > arthur.wied...@gmail.com
> > >:
> >
> > > Thanks Jakob!
> > >
> > > I think that this is a huge risk of Slack.
> > > I am not against Slack as a support channel, but it is a slippery slope
> > to
> > > have more and more decisions/conversations happening there, contrary to
> > > what we hope to achieve with the ASF.
> > >
> > > When we are starting to discuss issues of development, extensions and
> > > improvements, it is important for the discussion to happen in the
> mailing
> > > list.
> > >
> > > Jarek, I wouldn't worry too much, we are still in the process of
> learning
> > > as a community. Welcome and thank you for your contribution!
> > >
> > > Best,
> > > Arthur.
> > >
> > > On Thu, Oct 4, 2018 at 1:42 PM Jarek Potiuk 
> > > wrote:
> > >
> > > > Thanks for pointing it out Jakob.
> > > >
> > > > I am still very fresh in the ASF community and learning the ropes and
> > > > etiquette and code of conduct. Apologies for my ignorance.
> > > > I re-read the conduct and FAQ now again - with more understanding and
> > > will
> > > > pay more attention to wording in the future. As you mentioned it's
> more
> > > the
> > > > wording than intentions, but since it was in TL;DR; it has stronger
> > > > consequences.
> > > >
> > > > BTW. Thanks for actually following the code of conduct and pointing
> it
> > > out
> > > > in respectful manner. I really appreciate it.
> > > >
> > > > J.
> > > >
> > > > Principal Software Engineer
> > > > Phone: +48660796129
> > > >
> > > > On Thu, 4 Oct 2018, 20:41 Jakob Homan,  wrote:
> > > >
> > > > > > TL;DR; A change is coming in the way how
> dependencies/requirements
> > > are
> > > > > > specified for Apache Airflow - they will be fixed rather than
> > > flexible
> > > > > (==
> > > > > > rather than >=).
> > > > >
> > > > > > This is follow up after Slack discussion we had with Ash and
> Kaxil
> > -
> > > > > > summarising what we propose we'll do.
> > > > >
> > > > > Hey all.  It's great that we're moving this discussion back from
> > Slack
> > > > > to the mailing list.  But I've gotta point out that the wording
> needs
> > > > > a small but critical fix up:
> > > > >
> > > > > "A change *is* coming... they *will* be fixed"
> > > > >
> > > > > needs to be
> > > > >
> > > > > "We'd like to propose a change... We would like to make them
> fixed."
> > > > >
> > > > > The first says that this decision has been made and the result of
> the
> > > > > decision, which was made on Slack, is being reported back to the
> > > > > mailing list.  The second is more accurate to the rest of the
> > > > > discussion ('what we propose...').  And again, since it's axiomatic
> > in
> > > > > ASF that if it didn't happen on a list, it didn't happen[1], we
> gotta
> > > > > make sure there's no confusion about where the community is on the
> > > > > decision-making

Re: Fundamental change - Separate DAG name and id.

2018-09-20 Thread James Meickle
I'm personally against having some kind of auto-increment numeric ID for
DAGs. While this makes a lot of sense for systems where creation is a
database activity (like a POST request), in Airflow, DAG creation is
actually a code ship activity. There are all kinds of complex scenarios
around that:

- I revert a commit and a DAG disappears or is renamed
- I run the same file, twice, with multiple parameters to create two DAGs
- I create the DAG in both staging and prod, but they wind up with
different IDs

It's just too hard to automatically track these scenarios.

If we really wanted to put something like this in place, it would first
make more sense to decouple DAG creation from code shipping, and instead
prefer creation of a DAG outside of code (but with a definition that
references which git repo/committish/file/arguments/etc. to use). Then if
you do something like rename a file, the DAG breaks, but at least still
exists in the db with that ID and history still makes sense once you update
the DAG definition with the new code location.

On Thu, Sep 20, 2018 at 4:52 AM airflowuser
 wrote:

> Hi,
> though this could have been explained on Jira I think this should be
> discussed first.
>
> The problem:
> Airflow mixes DAG name with id. It uses same filed for both purposes.
>
> I assume that most of you use the dag_id to describe what the DAG actually
> does.
> For example:
>
> dag = DAG(
> dag_id='cost_report_daily',
> ...
> )
>
> This dag_id is reflected to the dag id column in the UI.
> Now, lets say that you want to add another task to this specific dag - You
> are to be extremely careful when you change the dag_id to represent the new
> functionality for example : dag_id='cost_expenses_reports_daily' . This
> will break the history of the DAG.
>
> Or even with simpler use case.. the user just want to change the name he
> sees on the UI.
>
> I suggest to have a discussion if the dag_id should be split into id (an
> actual id) and name to reflect what it does. When the "connection" is done
> by id's  - names can change as much as you want without breaking anything.
> essentially it becomes a field uses for display purpose  only.
>
> * I didn't mention also the issue of DAG file name which can also cause
> trouble if someone wants to change it.
>
> Sent with [ProtonMail](https://protonmail.com) Secure Email.


Re: Guidelines on Contrib vs Non-contrib

2018-09-18 Thread James Meickle
So in favor of just using Python modules for operators. I initially wrote
mine as Airflow plugin compatible, and eventually had to un-write them that
way, so it's really a new-user trap.

I've had at least a half dozen times installing/testing/operating Airflow
where we had some issue based on an integration for a service we've never
even used (like Hive). I would love to see all of that go away. However, we
should make sure that it's not too onerous to get a fairly fully featured
Airflow install, such as having a way for external repos/packages to even
be discoverable.

On Tue, Sep 18, 2018 at 1:28 PM Driesprong, Fokko 
wrote:

> I fully agree with using plain Python modules :)
>
> I don't think a lot of hooks/operators graduate to core since it will break
> the import. A few of them, for example Databricks and the Google hooks are
> mature enough. For me the main point is having test coverage and a stable
> API.
>
> Cheers, Fokko
>
> Op di 18 sep. 2018 om 18:30 schreef Victor Noagbodji <
> vnoagbo...@amplify-analytics.com>:
>
> > yes, please!
> >
> > > On Sep 18, 2018, at 12:23 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> > >
> > > +1 for deprecating operators/hooks as plugins, let's use Python's good
> > old
> > > python packages and maybe python "entry points" if we want to inject
> them
> > > in "airflow.operators"/"airflow.hooks" (which is probably not
> necessary)
> > >
> > > On Tue, Sep 18, 2018 at 2:12 AM Ash Berlin-Taylor 
> > wrote:
> > >
> > >> Operators and hooks don't need any special plugin system - simply
> having
> > >> them as as separate Python modules which are imported using normal
> > python
> > >> semantics is enough.
> > >>
> > >> In fact now that I think about it: I want to deprecate the plugins
> > >> registering hooks/operators etc and limit it to only bits which a
> simple
> > >> python import can't manage - which I think is only anything that needs
> > to
> > >> be registered with another system, such as custom routes in the web
> UI.
> > >>
> > >> I'll draft an AIP for this soon.
> > >>
> > >> -ash
> > >>
> > >>
> > >>> On 18 Sep 2018, at 00:50, George Leslie-Waksman 
> > >> wrote:
> > >>>
> > >>> Given we have a plugin system, could we alternatively move away from
> > >>> keeping non-core supported code outside of the core project/repo?
> > >>>
> > >>> It would hugely decrease the surface area of the main repository and
> > >>> testing infrastructure to get most of the contrib code out to its own
> > >> place.
> > >>>
> > >>> Further, it would decrease the committer burden of having to
> > >> approve/merge
> > >>> code that is not supposed to be their responsibility.
> > >>>
> > >>> On Mon, Sep 17, 2018 at 4:37 PM Tim Swast 
> > >> wrote:
> > >>>
> > > Individual operators and hooks living in separate repositories on
> > >> github
> >  (or possibly other Apache projects), which are then distributed by
> pip
> > >> and
> >  installed as libraries seems like it would scale better.
> > 
> >  Pandas did this about a year ago, and it's seemed to have worked
> well.
> > >> For
> >  example, pandas.read_gbq is a very thin wrapper around
> > >> pandas_gbq.read_gbq
> >  (distributed as a separate package). It has made it easier for me to
> > >> track
> >  issues corresponding to my area of expertise.
> > 
> >  On Sun, Sep 16, 2018 at 1:25 PM Jakob Homan 
> > wrote:
> > 
> > >> My understanding as a contributor is that if a hook/operator is in
> >  core,
> > > it
> > >> means that a committer is willing to take personal responsibility
> to
> > >> maintain it (or at least help maintain it), and everything else
> goes
> > >> in
> > >> contrib.
> > >
> > > That's not correct.  All of the code is owned by the entire
> > > community[1]; no one person is responsible for any of it.  There's
> no
> > > silos, fiefdoms, walled gardens, etc.  If the community cannot
> > support
> > > a piece of code it should be deprecated and subsequently removed.
> > >
> > > Contrib sections are almost always problematic for this reason.
> > > Hadoop ended up abandoning its.  Because Airflow acts as a
> gathering
> > > point for so many disparate technologies (databases, storage
> systems,
> > > compute engines, etc.), trying to keep all of them corralled and up
> > to
> > > date will be very difficult.  Individual operators and hooks living
> > in
> > > separate repositories on github (or possibly other Apache
> projects),
> > > which are then distributed by pip and installed as libraries seems
> > > like it would scale better.
> > >
> > > -Jakob
> > >
> > > [1]
> > >> https://blogs.apache.org/foundation/entry/success-at-apache-a-newbie
> > >
> > > On 15 September 2018 at 13:29, Jeff Payne 
> > wrote:
> > >> How many operators are added to contrib per month? Is it too many
> to
> > > make the decision case by case? If so, then the above

Re: It's very hard to become a committer on the project

2018-09-18 Thread James Meickle
reen shot as well.
> >>>
> >>> [image: Screenshot 2018-09-16 11.28.57.png]
> >>>
> >>> I think we all agree that there is better integration between GH PRs
> and
> >>> GH Issues than between GH PRs and Jira issues.
> >>>
> >>> There are some practical matters to consider:
> >>>
> >>>   - For the 1100-1200 unclosed/unresolved issues, how will we transfer
> >>>   them to GH or will we drop those on the floor? How would we map
> >> submitters
> >>>   between the 2 systems, and how would we transfer the
> >> content/comments,etc...
> >>>   - For the existing closed PRs (>3k), whose PRs reference JIRA, we'd
> >>>   need to keep JIRA around in read-only mode so we could reference the
> >>>   bug/feature details, but somehow disallow new JIRA creations, lest
> >> some
> >>>   people continue to use it to create new issues
> >>>   - I'm assuming the GH issue naming would not conflict with that of
> >>>   JIRA naming in commit message subjects and PRs. In other words,
> >>>   incubator-airlow-1 vs AIRFLOW-1 or airflow-1 vs AIRFLOW-1 or possibly
> >>>   conflict at AIRFLOW-1? Once we graduate, I'm pretty sure the
> >> incubator name
> >>>   will be dropped, so there may be a naming conflict.
> >>>
> >>> In the end, these are 2 different tools. The issues you raise are
> mainly
> >>> around governance.
> >>>
> >>> If you folks would like to propose a new means to manage the JIRAs, can
> >>> you outline a solution on Wiki and drop a link into an email on this
> >> list?
> >>> We can then raise a vote.
> >>>
> >>> IMHO, our community would scale the best if more people picked up
> >>> responsibilities such as these. Grooming/Organizing JIRAs doesn't need
> to
> >>> be a responsibility owned by the maintainers. Anyone can take the lead
> on
> >>> discussions, etc...
> >>>
> >>> -s
> >>>
> >>> On Mon, Sep 17, 2018 at 2:09 AM Sumit Maheshwari <
> sumeet.ma...@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Strong +1 for moving to GitHub from Jira.
> >>>>
> >>>> On Mon, Sep 17, 2018 at 12:35 PM George Leslie-Waksman <
> >> waks...@gmail.com
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> Are there Apache rules preventing us from switching to GitHub Issues?
> >>>>>
> >>>>> That seems like it might better fit much of Airflow's user base.
> >>>>>
> >>>>>
> >>>>> On Sun, Sep 16, 2018, 9:21 AM Jeff Payne  wrote:
> >>>>>
> >>>>>> I agree that Jira could be better utilized. I read the original
> >>>>>> conversation on the mailing list about how Jira should be used (or
> >> if
> >>>> it
> >>>>>> should be used at all) and I'm still unclear about why it was picked
> >>>> over
> >>>>>> just using github issues. It refers to a dashboard, which I've yet
> >> to
> >>>>>> investigate, but Jira is much more than just dashboards.
> >>>>>>
> >>>>>> If this project is going to use Jira, then:
> >>>>>>
> >>>>>> 1) It would be great to see moderation and labeling of the Jira
> >>>> issues by
> >>>>>> the main contributors to make it easier for people to break into
> >>>>>> contributing.
> >>>>>> 2) It would also be nice if the initial conversation of whether or
> >>>> not an
> >>>>>> issue warrants development at all happened on the Jira issue, or at
> >>>> least
> >>>>>> some acknowledgement by the main contributors.
> >>>>>> 3) Larger enhancements and efforts or vague suggestions still get
> >>>>>> discussed on the dev mailing list before a Jira is even opened, but
> >>>> after
> >>>>>> that, the discussion moves to the Jira, with a link back to the
> >>>> mailing
> >>>>>> list email for reference.
> >>>>>> 4) The discussion on the PR is only concerned with HOW the
> >> change/fix
> >>>> is
> >>>>>> implemented.
> &g

Re: It's very hard to become a committer on the project

2018-09-16 Thread James Meickle
Definitely agree with this. I'm not always opposed to JIRA for projects,
but the way it's being used for this project makes it very hard to break
into contributing. The split between GH and JIRA is also painful since
there's no automatic integration of them.

On Sun, Sep 16, 2018 at 9:29 AM airflowuser
 wrote:

> Hello all,
>
> I'm struggling finding tickets to address and while discussing it on chat
> others reported they had the same problem when they began working on the
> project.
>
> The problem is due to:
> 1. It's very hard to locate tickets on Jira. The categories are a mess,
> versions are not enforced. each use can tag,label and set priority at his
> will. No one monitor or overwrite it
> 2. It's impossible for a new committer to  find issues which can be
> easy-fix and a "good first issue".
>
> My suggestions:
> 1. Looking at the ticket system there are usually less than 10 new tickets
> a day. It won't take too much time for someone with knowledge on the
> project to properly tag  the ticket.
>
> 2. I think that most of you don't even check the Jira. Most of you submit
> PRs and 5 seconds before opening a ticket (just because you must). There is
> no doubt that the Jira is a "side" system which doesn't really perform it's
> job.
>
> Take a look at this:
> https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-2999
> a member of the community asks for committers for input but no one
> replies. I doubt this is because no one has input.. I am sure that if a PR
> was submitted you had comments. It's simply because you don't see it. This
> is why I think the current Jira does't function properly. I think that
> Github can perform a better role. All of you as committers are already
> there and it's always better to work with one system rather with two. The
> colors and labels of the GitHub as very easy to notice.
>
> Either way, what ever you decide something needs to be change. Either Jira
> will be more informative or move to GitHub.
>
> Thank you all for your good work :)


Re: [VOTE] Replace with Gitter with Slack?

2018-09-06 Thread James Meickle
-1 unless we get a Slack with full retention.

On Thu, Sep 6, 2018 at 8:16 AM Daniel Cohen 
wrote:

> +1
>
> On Thu, Sep 6, 2018 at 3:04 PM Adam Boscarino
>  wrote:
>
> > +1
> >
> > On Wed, Sep 5, 2018 at 10:30 PM Sid Anand  wrote:
> >
> > > Hi Folks!
> > > In the Apache tradition, I'd like to ask the community to vote on
> > replacing
> > > Gitter with Slack.
> > >
> > > For more information about what was recently discussed, refer to
> > >
> > >
> >
> https://lists.apache.org/thread.html/8eeb6c46ec431b9158f87022ceaa5eed8dbaaf082c887dae55f86f96@%3Cdev.airflow.apache.org%3E
> > >
> > > If you would like to replace Gitter with Slack, vote +1. If you want to
> > > keep things they way they are, vote -1. You can also vote 0 if you
> don't
> > > care either way because you wouldn't use either much, preferring to use
> > the
> > > mailing list instead, which is highly encouraged as it is Apache's
> > official
> > > record.
> > >
> > > The vote will be open for 72 hours and will expire at 8p PT this
> > Saturday.
> > > -s
> > >
> > > P.S. If the community votes for Slack, we could create our own
> workspace
> > > (e.g. airflow.slack.com).
> > > P.P.S. In general, anyone in the community can launch a vote like this
> > from
> > > time to time. There is no binding/non-binding distinction since we are
> > not
> > > running an official Apache vote.
> > >
> >
> >
> > --
> > Adam Boscarino
> > Senior Data Engineer
> >
> > aboscar...@digitalocean.com
> > --
> > We're Hiring!  |
> > @digitalocean 
> >
>
>
> --
> daniel cohen
> +972-(0)54-4799-147
>


Re: Retiring Airflow Gitter?

2018-08-31 Thread James Meickle
I am in the gitter chat most work days and there's always activity.

I would be fine with switching to permanent retention slack for
searchability but don't see the point of switching without that feature.

On Fri, Aug 31, 2018, 12:59 Sid Anand  wrote:

> For a while now, we have had an Airflow Gitter account. Though this seemed
> like a good idea initially, I'd like to hear from the community if anyone
> gets value of out it. I don't believe any of the committers spend any time
> on Gitter.
>
> Early on, the initial committers tried to be available on it, but soon
> found it impossible to be available on all the timezones in which we had
> users. Furthermore, Gitter notoriously sucks at making previously answered
> questions discoverable. Also, the single-threaded nature of Gitter
> essentially makes it confusing to debug/discuss more than on topic at a
> time.
>
> The community seems to be humming along by relying on the Apache mailing
> lists, which don't suffer the downside listed above. Hence, as newbies join
> Apache Airflow, they likely hop onto Gitter. Are they getting value from
> it? If not, perhaps we are doing them a disservice and should consider just
> deleting it.
>
> Thoughts welcome.
> -s
>


Re: apache-airflow v1.10.0 on PyPi?

2018-08-15 Thread James Meickle
Can we make it a policy going forward to push GH tags for all RCs as part
of the release announcement? I deploy via the incubator-airflow repo, but
currently it only has tags for up to RC2, which means I have to look up and
then specify an ugly-looking commit to deploy an RC :(

On Wed, Aug 15, 2018 at 10:54 AM Taylor Edmiston 
wrote:

> Krish - You can also use the RCs before they're released on PyPI if you'd
> like to help test.  Instead of:
>
> pip install apache-airflow
>
> You can install the 1.10 stable latest with:
>
> pip install git+git://github.com/apache/incubator-airflow.git@v1-10-stable
>
> Or the 1.10 RC tags with eg:
>
> pip install git+git://github.com/apache/incubator-airflow.git@1.10.0rc2
>
> Best,
> Taylor
>
> *Taylor Edmiston*
> Blog  | CV
>  | LinkedIn
>  | AngelList
>  | Stack Overflow
> 
>
>
> On Thu, Aug 9, 2018 at 5:43 PM, Krish Sigler  wrote:
>
> > Got it, will use the mailing list in the future.  Thanks for the info
> >
> > On Thu, Aug 9, 2018 at 2:42 PM, Bolke de Bruin 
> wrote:
> >
> > > Hi Kris,
> > >
> > > Please use the mailing list for these kind of questions.
> > >
> > > Airflow 1.10.0 hasn’t been released yet. We are going through the
> > motions,
> > > but it will take a couple of days before it’s official (if all goes
> > well).
> > >
> > > B.
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > Op 9 aug. 2018 om 23:33 heeft Krish Sigler  het
> > > volgende geschreven:
> > >
> > > Hi,
> > >
> > > First, I apologize if this is weird.  I saw on the Airflow github page
> > > that you most recently updated the v1.10.0 changelog, and I found your
> > > email using the instructions here (https://www.sourcecon.com/
> > > how-to-find-almost-any-github-users-email-address/).  If that's too
> > weird
> > > feel free to tell me and/or ignore this.
> > >
> > > I'm emailing because I'm working with the apache-airflow project,
> > > specifically for setting up pipelines involving GCP packages.  My
> > > environment uses Python3, and I've been running into the issue outlined
> > in
> > > this PR: https://github.com/apache/incubator-airflow/pull/3273.  I
> > > noticed that the fix is part of the v1.10.0 changelog.
> > > However, the latest version available on PyPi is 1.9.0.  On the Airflow
> > > wiki page I read that the project is intended to be updated every ~6
> > > months, and v1.9.0 was released in January.
> > >
> > > So my question, if you're at liberty to tell me, is can I expect
> v1.10.0
> > > to be available on PyPi in the near future?  If so then great!  That
> > would
> > > solve my package dependency problem.  If not, then I'll look into some
> > > workaround for my issue.
> > >
> > > Thanks,
> > > Krish
> > >
> > >
> >
>


Re: Basic modeling question

2018-08-08 Thread James Meickle
It sounds like you want something like this?

root_operator = DummyOperator()

def offset_operator(i):
  my_sql_query = "SELECT * FROM ds_add(execution_date, {offset})
;".format(offset=i)
  sql_operator = SQLOperator(task_id="offset_by_{}".format(i)",
query=my_sql_query)
  return sql_operator

offset_operators = list(offset_operator(i) for i in range(7))
root_operator >> offset_operators

# Daily just waits on today, no offset
do_daily_work = DummyOperator()
offset_operators[0] >> do_daily_work

# Weekly waits on today AND the six prior offsets
do_weekly_work = DummyOperator()
offset_operators >> do_weekly_work

IOW, every day you wait for that day's data to be available, and then run
the daily job; you also wait for the previous six days data to be
available, and when it is, run the weekly job.

n.b. - if you do it this way you will have up to 7 tasks polling the "same"
data point, which is slightly wasteful. But it's also not much code or
mental effort to write it this way.

On Wed, Aug 8, 2018 at 2:44 PM Gabriel Silk 
wrote:

> My main concern is how to express the fact that the weekly rollup depends
> on the previous 7 days worth of data, and ensure that it does not run until
> the tasks that generate those 7 days of data have run, assuming that tasks
> can run non-sequentially.
>
> It's easy enough when you have the following situation:
>
> (daily log ingestion) <-- (daily rollup)
>
> In any given DAG run, you are guaranteed to have the data needed for (daily
> rollup), because the dependency that generated its data just ran.
>
> But I'm not sure how best to model it when you have all of the following:
>
> (daily log ingestion) <-- (daily rollup)
> (daily log ingestion) <-- (weekly rollup)
> (daily log ingestion) <-- (monthly rollup)
>
>
>
> On Wed, Aug 8, 2018 at 11:29 AM, Taylor Edmiston 
> wrote:
>
> > Gabriel -
> >
> > Ah, I missed your earlier comment about weekly/monthly rollups also being
> > on a daily cadence.  So is your concern e.g., more about reducing the
> > redundant process of the weekly rollup tasks for the days of that range
> > that already processed in the previous DAG run(s)?  Or mainly about the
> > dependency of not executing the first weekly at all until the first 7
> daily
> > rollups worth of data have built up?
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stack Overflow
> > <https://stackoverflow.com/users/149428/taylor-edmiston>
> >
> >
> > On Wed, Aug 8, 2018 at 2:14 PM, James Meickle  > invalid> wrote:
> >
> > > If you want to run (daily, rolling weekly, rolling monthly) backups on
> a
> > > daily basis, and they're mostly the same but have some additional
> > > dependencies, you can write a DAG factory method, which you call three
> > > times. Certain nodes only get added to the longer-than-daily backups.
> > >
> > > On Wed, Aug 8, 2018 at 2:03 PM Gabriel Silk  >
> > > wrote:
> > >
> > > > Thanks Andy and Taylor for the suggestions --
> > > >
> > > > I see how that would work for the case where you want a weekly rollup
> > > that
> > > > runs on a weekly cadence.
> > > >
> > > > But what about a rolling weekly or monthly rollup that runs each day?
> > > >
> > > > On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper <
> > andy.coo...@astronomer.io>
> > > > wrote:
> > > >
> > > > > To expand on Taylor's idea
> > > > >
> > > > > I recently wrote a ScheduleBlackoutSensor that would allow you to
> > > > prevent a
> > > > > task from running if it meets the criteria provided. It accepts an
> > > array
> > > > of
> > > > > args for any number of the criteria so you could leverage this
> sensor
> > > to
> > > > > provide "blackout" runs for a range of days of the week.
> > > > >
> > > > > https://github.com/apache/incubator-airflow/pull/3702/files
> > > > >
> > > > > For example,
> > > > >
> > > > > task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag)
> > > > >
> > > > > Would prevent a task from running Monday - Saturday, allowing it to
> > run
> > > > on
> > > > > Sunday.
> &g

Re: Basic modeling question

2018-08-08 Thread James Meickle
If you want to run (daily, rolling weekly, rolling monthly) backups on a
daily basis, and they're mostly the same but have some additional
dependencies, you can write a DAG factory method, which you call three
times. Certain nodes only get added to the longer-than-daily backups.

On Wed, Aug 8, 2018 at 2:03 PM Gabriel Silk 
wrote:

> Thanks Andy and Taylor for the suggestions --
>
> I see how that would work for the case where you want a weekly rollup that
> runs on a weekly cadence.
>
> But what about a rolling weekly or monthly rollup that runs each day?
>
> On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper 
> wrote:
>
> > To expand on Taylor's idea
> >
> > I recently wrote a ScheduleBlackoutSensor that would allow you to
> prevent a
> > task from running if it meets the criteria provided. It accepts an array
> of
> > args for any number of the criteria so you could leverage this sensor to
> > provide "blackout" runs for a range of days of the week.
> >
> > https://github.com/apache/incubator-airflow/pull/3702/files
> >
> > For example,
> >
> > task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag)
> >
> > Would prevent a task from running Monday - Saturday, allowing it to run
> on
> > Sunday.
> >
> > You could leverage this Sensor as you would any other sensor or you could
> > invert the logic so that you would only need to specify
> >
> > task = ScheduleBlackoutSensor(day_of_week=6, dag=dag)
> >
> > To "whitelist" a task to run on Sundays.
> >
> >
> > Let me know if you have any questions
> >
> > On Wed, Aug 8, 2018 at 1:47 PM Taylor Edmiston 
> > wrote:
> >
> > > Gabriel -
> > >
> > > One approach I've seen for a similar use case is to have multiple
> related
> > > rollups in one DAG that runs daily, then have the non-daily tasks skip
> > most
> > > of the time (e.g., weekly only actually executes on Sundays and is
> > > parameterized to look at the last 7 days).
> > >
> > > You could implement that not running part a few ways, but one idea is a
> > > sensor in front of the weekly rollup task.  Imagine a SundaySensor like
> > > return
> > > execution_date.weekday() == 6.  One thing to keep in mind here is
> > > dependence on the DAG's cron schedule being more granular than the
> tasks.
> > >
> > > I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor
> > > that would be nice to have.
> > >
> > > Of course this does mean some scheduler inefficiency on the skip days,
> > but
> > > as long as those skips are fast and the overall number of tasks is
> > small, I
> > > can accept that.
> > >
> > > *Taylor Edmiston*
> > > Blog  | CV
> > >  | LinkedIn
> > >  | AngelList
> > >  | Stack Overflow
> > > 
> > >
> > >
> > > On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk  >
> > > wrote:
> > >
> > > > Hello Airflow community,
> > > >
> > > > I have a basic question about how best to model a common data
> pipeline
> > > > pattern here at Dropbox.
> > > >
> > > > At Dropbox, all of our logs are ingested and written into Hive in
> > hourly
> > > > and/or daily rollups. On top of this data we build many weekly and
> > > monthly
> > > > rollups, which typically run on a daily cadence and compute results
> > over
> > > a
> > > > rolling window.
> > > >
> > > > If we have a metric X, it seems natural to put the daily, weekly, and
> > > > monthly rollups for metric X all in the same DAG.
> > > >
> > > > However, the different rollups have different dependency structures.
> > The
> > > > daily job only depends on a single day partition, whereas the weekly
> > job
> > > > depends on 7, the monthly on 28.
> > > >
> > > > In Airflow, it seems the two paradigms for modeling dependencies are:
> > > > 1) Depend on a *single run of a task* within the same DAG
> > > > 2) Depend on *multiple runs of task* by using an ExternalTaskSensor
> > > >
> > > > I'm not sure how I could possibly model this scenario using approach
> > #1,
> > > > and I'm not sure approach #2 is the most elegant or performant way to
> > > model
> > > > this scenario.
> > > >
> > > > Any thoughts or suggestions?
> > > >
> > >
> >
>


Re: [VOTE] Airflow 1.10.0rc3

2018-08-06 Thread James Meickle
Not a vote, but a comment: it might be worth noting that the new
environment variable is also required if you have any Airflow plugin test
suites that install Airflow as part of their dependencies. In my case, I
had to set the new env var outsidfe of tox and add this:

```
[testenv]
passenv = SLUGIFY_USES_TEXT_UNIDECODE
```

(`setenv` did not work as that provides env vars at runtime but not
installtime, as far as I can tell.)


On Sun, Aug 5, 2018 at 5:20 PM Bolke de Bruin  wrote:

> +1 :-)
>
> Sent from my iPhone
>
> > On 5 Aug 2018, at 23:08, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >
> > Yup, just worked out the same thing.
> >
> > I think as "punishment" for me finding bugs so late in two RCs (this,
> and 1.9) I should run the release for the next release.
> >
> > -ash
> >
> >> On 5 Aug 2018, at 22:05, Bolke de Bruin  wrote:
> >>
> >> Yeah I figured it out. Originally i was using a different
> implementation of UTCDateTime, but that was unmaintained. I switched, but
> this version changed or has a different contract. While it transforms on
> storing to UTC it does not so when it receives timezone aware fields from
> the db. Hence the issue.
> >>
> >> I will prepare a PR that removes the dependency and implements our own
> extension of DateTime. Probably tomorrow.
> >>
> >> Good catch! Just in time :-(.
> >>
> >> B.
> >>
> >>> On 5 Aug 2018, at 22:43, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >>>
> >>> Entirely possible, though I wasn't even dealing with the scheduler -
> the issue I was addressing was entirely in the webserver for a pre-existing
> Task Instance.
> >>>
> >>> Ah, I hadn't noticed/twigged we are using sqlalchemy-utc. It appears
> that isn't working right/ as expected. This line:
> https://github.com/spoqa/sqlalchemy-utc/blob/master/sqlalchemy_utc/sqltypes.py#L34
> doens't look right for us - as you mentioned the TZ is set to something
> (rather than having no TZ value).
> >>>
> >>> Some background on how Pq handles TZs. It always returns DTs in the TZ
> of the connection. I'm not sure if this is unique to postgres or if other
> DBs behave the same.
> >>>
> >>> postgres=# select '2018-08-03 00:00:00+00:00'::timestamp with time
> zone;
> >>>timestamptz
> >>> 
> >>> 2018-08-03 01:00:00+01
> >>>
> >>> postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
> >>>timestamptz
> >>> 
> >>> 2018-08-03 01:00:00+01
> >>>
> >>> The server will always return TZs in the connection timezone.
> >>>
> >>> postgres=# set timezone=utc;
> >>> SET
> >>> postgres=# select '2018-08-03 02:00:00+02'::timestamp with time zone;
> >>>timestamptz
> >>> 
> >>> 2018-08-03 00:00:00+00
> >>> (1 row)
> >>>
> >>> postgres=# select '2018-08-03 01:00:00+01'::timestamp with time zone;
> >>>timestamptz
> >>> 
> >>> 2018-08-03 00:00:00+00
> >>> (1 row)
> >>>
> >>>
> >>>
> >>>
> >>> -ash
> >>>
>  On 5 Aug 2018, at 21:28, Bolke de Bruin  wrote:
> 
>  This is the issue:
> 
>  [2018-08-05 22:08:21,952] {jobs.py:906} INFO - NEXT RUN DATE:
> 2018-08-03 00:00:00+00:00 tzinfo: 
>  [2018-08-05 22:08:22,007] {jobs.py:1425} INFO - Created  example_http_operator @ 2018-08-03 02:00:00+02:00:
> scheduled__2018-08-03T00:00:00+00:00, externally triggered: False>
> 
>  [2018-08-05 22:08:24,651] {jobs.py:906} INFO - NEXT RUN DATE:
> 2018-08-04 02:00:00+02:00 tzinfo: psycopg2.tz.FixedOffsetTimezone(offset=120,
> name=None)
>  [2018-08-05 22:08:24,696] {jobs.py:1425} INFO - Created  example_http_operator @ 2018-08-04 02:00:00+02:00:
> scheduled__2018-08-04T02:00:00+02:00, externally triggered: False>
> 
>  Notice at line 1+2: that the next run date is correctly in UTC but
> from the DB it gets a +2. At the next bit (3+4) we get a 
> psycopg2.tz.FixedOffsetTimezone
> which should be set to UTC according to the specs of
> https://github.com/spoqa/sqlalchemy-utc <
> https://github.com/spoqa/sqlalchemy-utc> , but it isn’t.
> 
>  So changing your setting of the DB to UTC fixes the symptom but not
> the cause.
> 
>  B.
> 
> > On 5 Aug 2018, at 22:03, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >
> > Sorry for being terse before.
> >
> > So the issue is that the ts loaded from the DB is not in UTC, it's
> in GB/+01 (the default of the DB server)
> >
> > For me, on a currently running 1.9 (no TZ) db:
> >
> > airflow=# select * from task_instance;
> > get_op| example_http_operator | 2018-07-23 00:00:00
> >
> > This date time appears in the log url, and the path it looks at on
> S3 is
> >
> > .../example_http_operator/2018-07-23T00:00:00/1.log
> >
> > If my postgres server has a default timezone of GB (which the one
> running on my laptop does), and I then apply the migration then it is
> converted to that local time.
> >
> > 

Re: Catchup By default = False vs LatestOnlyOperator

2018-07-23 Thread James Meickle
We use LatestOnlyOperator in production. Generally our data is available on
a regular schedule, and we update production services with it as soon as it
is available; we might occasionally want to re-run historical days, in
which case we want to run the same DAG but without interacting with live
production services at all.

On Mon, Jul 23, 2018 at 2:18 PM, George Leslie-Waksman 
wrote:

> As the author of LatestOnlyOperator, the goal was as a stopgap until
> catchup=False landed.
>
> There are some (very) fringe use cases where you might still want
> LatestOnlyOperator but in almost all cases what you want is probably
> catchup=False.
>
> The situations where LatestOnlyOperator is still useful are where you want
> to run most of your DAG for every schedule interval but you want some of
> the tasks to run only on the latest run (not catching up, not backfilling).
>
> It may be best to deprecate LatestOnlyOperator at this point to avoid
> confusion.
>
> --George
>
> On Sat, Jul 21, 2018 at 7:34 PM Ben Tallman  wrote:
>
> > As the author of catch-up, the idea is that in many cases your data
> > doesn't "window" nicely and you want instead to just run as if it were a
> > brilliant Cron...
> >
> > Ben
> >
> > Sent from my iPhone
> >
> > > On Jul 20, 2018, at 11:39 PM, Shah Altaf  wrote:
> > >
> > > Hi my understanding is: if you use the LatestOnlyOperator then when you
> > run
> > > the DAG for the first time you'll see a whole bunch of DAG runs queued
> > up,
> > > and in each run the LatestOnlyOperator will cause the rest of the DAG
> run
> > > to be skipped.  Only the latest DAG will run in 'full'.
> > >
> > > With catchup = False, you should just get just the latest DAG run.
> > >
> > >
> > > On Fri, Jul 20, 2018 at 10:58 PM Shubham Gupta <
> > shubham180695...@gmail.com>
> > > wrote:
> > >
> > >> -- Forwarded message -
> > >> From: Shubham Gupta 
> > >> Date: Fri, Jul 20, 2018 at 2:38 PM
> > >> Subject: Catchup By default = False vs LatestOnlyOperator
> > >> To: 
> > >>
> > >>
> > >> Hi!
> > >>
> > >> Can someone please explain the difference b/w catchup by default =
> False
> > >> and LatestOnlyOperator?
> > >>
> > >> Regarding
> > >> Shubham Gupta
> > >>
> >
>


Re: [Live Webinar Tomorrow] AIRFlow at Scale - Register Now

2018-07-19 Thread James Meickle
Thank you! Shared that with the team here.

On Thu, Jul 19, 2018 at 7:57 AM, Sumit Maheshwari 
wrote:

> Yeah, sorry missed that the webinar is in IST.
>
> Anyway, got the recording link
> https://www.brighttalk.com/webcast/15789/330901
>
>
>
> On Tue, Jul 10, 2018 at 6:26 PM, James Meickle <
> jmeic...@quantopian.com.invalid> wrote:
>
> > Almost signed up for this and then realized it's IST... would be very
> > interested in a recording or some other option that doesn't require
> waking
> > up at 5:30 AM my time, as the content sounds great! :)
> >
> > On Tue, Jul 10, 2018 at 7:58 AM, Sumit Maheshwari 
> > wrote:
> >
> > > FYI:
> > >
> > >
> > > -- Forwarded message --
> > > From: Qubole Team 
> > > Date: Tue, Jul 10, 2018 at 4:12 PM
> > > Subject: [Live Webinar Tomorrow] AIRFlow at Scale - Register Now
> > > To: sum...@qubole.com
> > >
> > >
> > > Big Data Your Way Any Cloud. Any Engine. Any Scale Visit Qubole at
> Spark
> > > Summit
> > >
> > >
> > >
> > >
> > >
> > > <http://visit.qubole.com/P000K08QAP0dY8030roZja2>
> > >
> > >
> > >
> > > Hi Sumit,
> > >
> > > The Data Team at Qubole collects usage and telemetry data from a
> million
> > > machines a month. We run many complex ETL workflows to process this
> data
> > > and provide reports, insights and recommendations to customers,
> analysts
> > > and data scientists. We use open source distribution of Apache Airflow
> to
> > > orchestrate our ETL and process more than 1 terabyte of data daily.
> > >
> > > Attend *tomorrow's webinar (3-4 pm IST)*, powered by *Digital Vidya*,
> and
> > > learn about *how we have extended Apache Airflow to manage the
> > operational
> > > inefficiencies that arise when you manage data pipelines in a
> > multi-tenant
> > > environment*. We will also be talking about how we have made the data
> > > pipelines robust by adding data quality checks using CheckOperators.
> > >
> > > *Key Takeaways:*
> > >
> > >- How we manage deploys and upgrades of data pipelines in a
> > multi-tenant
> > >environment
> > >- How we manage configuration for data pipelines in a multi-tenant
> > >environment
> > >- Data quality issues we faced with data ingestion/transformation
> > >- Enhancements we had to make to Check operators
> > >- Challenges faced in developing the alerting framework
> > >- Lesson learned and best practices in using Apache Airflow for data
> > >quality checks
> > >
> > >
> > > *Register Now! <http://visit.qubole.com/P000K08QAP0dY8030roZja2>*
> > >
> > > Best Regards,
> > > Qubole Team
> > >
> > >
> > >
> > > Register Now for the Webinar→
> > > <http://visit.qubole.com/P000K08QAP0dY8030roZja2>
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > [image: logo-qubole-lp.png]
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Who we are
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Data Science Optimization
> > >
> > > Qubole cluster auto-scaling and automated lifecycle management
> eliminate
> > > barriers to data science.
> > >
> > >
> > >
> > >
> > >
> > > Spark Notebooks in One Click
> > >
> > > Combine code execution, rich text, mathematics, plots, and rich media
> for
> > > self-service data exploration.
> > >
> > >
> > >
> > >
> > >
> > > Become Data Driven
> > >
> > > Give your team access to Machine Learning, ETL with Spark SQL, and
> > > real-time processing of streaming data.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > www.qubole.com
> > >
> > >
> > > [image: facebook] <http://visit.qubole.com/X8A0dpP0K000Z38ra02jQZ0>
> > >  [image:
> > > LinkedIn] <http://visit.qubole.com/GjP0Z420rr80Q08AdKa>   [image:
> > > Twitter] <http://visit.qubole.com/A40s08QP20K8jda00A010rZ>
> > >
> > >
> > >
> > > Qubole, 469 El Camino Real, Suite 205, Santa Clara, CA 95050
> > > <https://maps.google.com/?q=469+El+Camino+Real,+Suite+205,
> > > +Santa+Clara,+CA+95050&entry=gmail&source=g>
> > >
> > >
> > >
> > >   <http://visit.qubole.com/Y08840tKd00ar002APQ02Zj>
> > >
> > > This email was sent to sum...@qubole.com. If you no longer wish to
> > receive
> > > these emails you may unsubscribe
> > > <http://visit.qubole.com/u/yj4QradK080AZP508w00020> at any time.
> > >
> >
>


Re: PR for refactoring Airflow SLAs

2018-07-17 Thread James Meickle
Hi all,

I'd still love to get some eyes on this one if anyone has time. Definitely
needs some direction as to what is required before merging, since this is a
higher-level API change...

-James M.

On Mon, Jul 9, 2018 at 11:58 AM, James Meickle 
wrote:

> Hi folks,
>
> Based on my earlier email to the list, I have submitted a PR that splits
> `sla=` into three independent SLA parameters, as well as heavily
> restructuring other parts of the SLA feature:
>
> https://github.com/apache/incubator-airflow/pull/3584
>
> This is my first Airflow PR and I'm still learning the codebase, so
> there's likely to be flaws with it. But I'm most interested in the general
> compatibility of this feature with the rest of Airflow. We want this for
> our purposes at Quantopian, but we'd really prefer to get it into Airflow
> core rather than running a fork forever!
>
> Let me know your thoughts,
>
> -James M.
>


Re: DAG Level permissions (was Re: RBAC Update)

2018-07-17 Thread James Meickle
Really excited for this one - we have a lot of internal access controls and
this will help us implement them properly. It's going to be great being
able to give everyone access to see the overall state of DAG progress
without seeing its parameters or logs!

On Tue, Jul 17, 2018 at 12:48 AM, Ruiqin Yang  wrote:

> Congratulations! Extraordinary work! Thank you very much! This has been a
> highly desired feature for us for quite a while.
>
> Cheers,
> Kevin Yang
>
> Tao Feng 于2018年7月16日 周一下午9:30写道:
>
> > Hi,
> >
> > Just want to give an update that Airflow DAG level access just checked in
> > today(https://github.com/apache/incubator-airflow/pull/3197). Thanks a
> lot
> > for Max and Joy's review which helps me improving the pr.  I  create the
> > following three tickets as a follow up:
> >
> > https://issues.apache.org/jira/browse/AIRFLOW-2694 # Allow parsing
> access
> > control in DAG file
> > https://issues.apache.org/jira/browse/AIRFLOW-2693 # Implement sync_perm
> > model view endpoint
> > https://issues.apache.org/jira/browse/AIRFLOW-2695 # Support assign
> groups
> > of dag permission to a role
> >
> > I will start working on them in Q3.
> >
> > Thanks a lot,
> > -Tao
> >
> > On Sun, Apr 8, 2018 at 11:26 AM, Tao Feng  wrote:
> >
> > > Hi Max, Joy and Everyone,
> > >
> > > Based on the discussion, I put up a work in progress pr (
> > > https://github.com/apache/incubator-airflow/pull/3197/) with a doc(
> > > https://docs.google.com/document/d/1qs26lE9kAuCY0Qa0ga-8
> > > 0EQ7d7m4s-590lhjtMBjmxw/edit#) for DAG level access. I would like to
> get
> > > some early feedbacks and see if I miss anything or am in the wrong
> > > direction as it touches the core part.
> > >
> > > In the meantime, I will still continue improving the pr for couples of
> > > todos.
> > >
> > > Looking forward to your feedback.
> > >
> > > Thanks,
> > > -Tao
> > >
> > > On Mon, Apr 2, 2018 at 2:44 PM, Tao Feng  wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> Thanks a lot for all the great discussions. To summarize in brief,
> here
> > >> are the few approaches we discussed so far:
> > >>
> > >> 1. One permission per DAG. The user has homogenous rights on the dag.
> > >>   The concerns:
> > >>
> > >>- not flexible to certain use cases(e.g the user has view only
> access
> > >>on certain dags and write access on certain other dags );
> > >>- difficult to implement as it breaks the existing FAB security
> > model.
> > >>
> > >> 2. Extend current model(ab_permission_view_role) with an additional
> > >> optional column (dag_id).
> > >>  The concerns:
> > >>
> > >>- introduce a 3rd column to existing permission_view table.
> > >>- It requires us to handle db migration carefully from view only
> > >>access UI to dag level access UI.
> > >>
> > >> 3. Define a set of pre-defined dag-level permissions. Reuse the
> current
> > >> existing permission_view model and consider each individual dag as a
> new
> > >> view.
> > >>
> > >> It seems that the 3rd approach is a preferable approach which makes
> the
> > >> security model easy to extend without introducing additional
> > complexity. If
> > >> no other concern, I will work with Max to create an initial proposal /
> > PR
> > >> based on 3rd approach for the work(https://issues.apache.org
> > >> /jira/browse/AIRFLOW-2267).
> > >>
> > >> Thanks,
> > >> -Tao
> > >>
> > >> On Sat, Mar 31, 2018 at 12:09 AM, Joy Gao  wrote:
> > >>
> > >>> +1!
> > >>>
> > >>> I was originally tempted to re-use existing perms and views for
> > dag-level
> > >>> access control since dag-level perm/view is a subset of view-level
> > >>> perm/view, but your proposal of defining new dag-level perms/views
> > >>> independent from view-level perms/views is interesting. This actually
> > >>> makes
> > >>> a lot of sense, since even in the existing models, views can also be
> > menu
> > >>> options, so we are simply extending the concept of views to include
> > dags
> > >>> as
> > >>> well.
> > >>>
> > >>> On Fri, Mar 30, 2018 at 5:24 PM Maxime Beauchemin <
> > >>> maximebeauche...@gmail.com> wrote:
> > >>>
> > >>> > I'd suggest something else that avoids having to add a 3rd column.
> I
> > >>> think
> > >>> > we can fit our use case into the existing structure nicely.
> > >>> >
> > >>> > My idea is to mimic what FAB does with its own Models.
> > >>> >
> > >>> > When you create a Model and ModelView in FAB (say DagRun for
> > example),
> > >>> it
> > >>> > creates a new view_menu (DagRun) and matches it with existing
> > >>> permission
> > >>> > (can_read, can_delete, can_edit, can_show, ...) to create a new set
> > of
> > >>> > "permission_views" which are combinations of permission and
> view_menu
> > >>> as in
> > >>> > "DagRun - can_read", "DagRun - can_edit", ... It's not a cartesian
> > >>> product
> > >>> > of all perms and view_menus, it's a predetermined list of
> > >>> model-specific
> > >>> > perms that get combined with DagRun here.
> > >>> >
> > >>> > Similarly, when Airflow would

Re: [VOTE] Airflow 1.10.0rc1

2018-07-12 Thread James Meickle
Looking at that diff, it seems like the function as a whole needs some
love, even if that commit were reverted. The use of os.walk means it's
going to crawl the entire tree every time, and use accumulated patterns to
check against each file in each folder. The behavior should be to use
patterns to also exclude entire folders from consideration, which would
likely dramatically speed up the function for most use cases.

On Thu, Jul 12, 2018 at 9:14 AM, Ash Berlin-Taylor <
ash_airflowl...@firemirror.com> wrote:

> I wasn't paying enough attention. I can reproduce it.
>
> I'm not sure it was intentional but the bug was introduced in
> https://github.com/apache/incubator-airflow/pull/3171 <
> https://github.com/apache/incubator-airflow/pull/3171>
>
> I'd rather not release with a regression since 1.9, so updating my vote to
> -0.5 (binding), but not blocking/vetoing release if others are okay with it.
>
> -ash
>
> > On 12 Jul 2018, at 13:46, Ash Berlin-Taylor  com> wrote:
> >
> > That said I can't reproduce it myself (I will take discussion to the
> Jira ticket)
> >
> > -ash
> >
> >> On 12 Jul 2018, at 13:36, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >>
> >> Possible blocking regression here - .airflowignore doesn't seem to be
> respected anymore
> >>
> >> https://issues.apache.org/jira/browse/AIRFLOW-2729 <
> https://issues.apache.org/jira/browse/AIRFLOW-2729>
> >>
> >> -ash
> >>> On 12 Jul 2018, at 08:45, Bolke de Bruin  wrote:
> >>>
> >>> Hi Jakob,
> >>>
> >>> Thanks. We do include an INSTALL document that explains how to well
> install airflow and is quite a standard location for install instructions.
> Or did the install instruction needed to be included in the Vote?
> >>>
> >>> B.
> >>>
> >>>
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
>  Op 12 jul. 2018 om 07:54 heeft Jakob Homan  het
> volgende geschreven:
> 
>  +1 (binding)
> 
>  * Sigs look good
>  * Artifact has incubating in name
>  * LICENSE/NOTICE/DISCLAIMER look good
>  - nit: DISCLAIMER is not word wrapped to 80 chars as I've seen in
>  other projects.
>  * Nit: Last release Sebb called out "Copyright 2016 and onwards" in
>  NOTICE as being imprecise.  This language remains.
>  * Spot check on license headers looks ok.
>  * Nit: Last release there was a request for instructions on how to
>  build the code.  This isn't included.
> 
>  -Jakob
> 
> > On 11 July 2018 at 19:50, Sid Anand  wrote:
> > FYI!
> > I just installed the release candidate. The first thing I noticed is
> a
> > missing tool tip for the Null State in the Recent Tasks column on the
> > landing page. Since the null globe is new to this UI, users will
> likely
> > hover over it to inquire what it means... and will be left wanting.
> Of
> > course, they could click on the globe, which will take them to
> > http://localhost:8080/admin/taskinstance/?flt1_dag_id_
> equals=example_bash_operator&flt2_state_equals=null,
> > which will always show an empty list, leaving them a bit more
> confused.
> >
> > -s
> >
> > On Wed, Jul 11, 2018 at 3:13 PM Carl Johan Gustavsson
> >  wrote:
> >
> >> Hi Bolke,
> >>
> >> (Switching email to avoid moderation on my emails.)
> >>
> >> The normal Airflow test suite does not fail as it uses a LC_ALL set
> to
> >> utf-8.
> >>
> >> I think it is a proper test though, it is a minimal reproducible
> version of
> >> the code that fails. And the only difference in behaviour is at 3.7
> which
> >> we don’t support anyway so I’m fairly sure it is broken for all
> supported
> >> Python 3 versions.
> >>
> >> I now tried running the tests in docker using 3.5 with the
> LC_ALL/LANG
> >> unset and I see the same failure.
> >>
> >> I don’t think this is a big thing though and we could release it
> without
> >> the fix I made. I think most people run it with a sane LC_ALL, but
> >> apparently we didn’t.
> >> Here’s the log for the test:
> >>
> >>> docker run -t -i -v `pwd`:/airflow/ python:3.5 bash
> >> root@b99b297df111:/# locale
> >> LANG=C.UTF-8
> >> LANGUAGE=
> >> LC_CTYPE="C.UTF-8"
> >> LC_NUMERIC="C.UTF-8"
> >> LC_TIME="C.UTF-8"
> >> LC_COLLATE="C.UTF-8"
> >> LC_MONETARY="C.UTF-8"
> >> LC_MESSAGES="C.UTF-8"
> >> LC_PAPER="C.UTF-8"
> >> LC_NAME="C.UTF-8"
> >> LC_ADDRESS="C.UTF-8"
> >> LC_TELEPHONE="C.UTF-8"
> >> LC_MEASUREMENT="C.UTF-8"
> >> LC_IDENTIFICATION="C.UTF-8"
> >> LC_ALL=
> >>> unset LANG
> >> root@b99b297df111:/# locale
> >> LANG=
> >> LANGUAGE=
> >> LC_CTYPE="POSIX"
> >> LC_NUMERIC="POSIX"
> >> LC_TIME="POSIX"
> >> LC_COLLATE="POSIX"
> >> LC_MONETARY="POSIX"
> >> LC_MESSAGES="POSIX"
> >> LC_PAPER="POSIX"
> >> LC_NAME="POSIX"
> >> LC_ADDRESS="POSIX"
> >> LC_TELEPHONE="POSIX"
> >> LC_MEASUREMENT="POSIX"
> 

Re: [Live Webinar Tomorrow] AIRFlow at Scale - Register Now

2018-07-10 Thread James Meickle
Almost signed up for this and then realized it's IST... would be very
interested in a recording or some other option that doesn't require waking
up at 5:30 AM my time, as the content sounds great! :)

On Tue, Jul 10, 2018 at 7:58 AM, Sumit Maheshwari  wrote:

> FYI:
>
>
> -- Forwarded message --
> From: Qubole Team 
> Date: Tue, Jul 10, 2018 at 4:12 PM
> Subject: [Live Webinar Tomorrow] AIRFlow at Scale - Register Now
> To: sum...@qubole.com
>
>
> Big Data Your Way Any Cloud. Any Engine. Any Scale Visit Qubole at Spark
> Summit
>
>
>
>
>
> 
>
>
>
> Hi Sumit,
>
> The Data Team at Qubole collects usage and telemetry data from a million
> machines a month. We run many complex ETL workflows to process this data
> and provide reports, insights and recommendations to customers, analysts
> and data scientists. We use open source distribution of Apache Airflow to
> orchestrate our ETL and process more than 1 terabyte of data daily.
>
> Attend *tomorrow's webinar (3-4 pm IST)*, powered by *Digital Vidya*, and
> learn about *how we have extended Apache Airflow to manage the operational
> inefficiencies that arise when you manage data pipelines in a multi-tenant
> environment*. We will also be talking about how we have made the data
> pipelines robust by adding data quality checks using CheckOperators.
>
> *Key Takeaways:*
>
>- How we manage deploys and upgrades of data pipelines in a multi-tenant
>environment
>- How we manage configuration for data pipelines in a multi-tenant
>environment
>- Data quality issues we faced with data ingestion/transformation
>- Enhancements we had to make to Check operators
>- Challenges faced in developing the alerting framework
>- Lesson learned and best practices in using Apache Airflow for data
>quality checks
>
>
> *Register Now! *
>
> Best Regards,
> Qubole Team
>
>
>
> Register Now for the Webinar→
> 
>
>
>
>
>
>
>
> [image: logo-qubole-lp.png]
>
>
>
>
>
>
>
> Who we are
>
>
>
>
>
>
>
>
>
> Data Science Optimization
>
> Qubole cluster auto-scaling and automated lifecycle management eliminate
> barriers to data science.
>
>
>
>
>
> Spark Notebooks in One Click
>
> Combine code execution, rich text, mathematics, plots, and rich media for
> self-service data exploration.
>
>
>
>
>
> Become Data Driven
>
> Give your team access to Machine Learning, ETL with Spark SQL, and
> real-time processing of streaming data.
>
>
>
>
>
>
>
>
>
> www.qubole.com
>
>
> [image: facebook] 
>  [image:
> LinkedIn]    [image:
> Twitter] 
>
>
>
> Qubole, 469 El Camino Real, Suite 205, Santa Clara, CA 95050
>  +Santa+Clara,+CA+95050&entry=gmail&source=g>
>
>
>
>   
>
> This email was sent to sum...@qubole.com. If you no longer wish to receive
> these emails you may unsubscribe
>  at any time.
>


PR for refactoring Airflow SLAs

2018-07-09 Thread James Meickle
Hi folks,

Based on my earlier email to the list, I have submitted a PR that splits
`sla=` into three independent SLA parameters, as well as heavily
restructuring other parts of the SLA feature:

https://github.com/apache/incubator-airflow/pull/3584

This is my first Airflow PR and I'm still learning the codebase, so there's
likely to be flaws with it. But I'm most interested in the general
compatibility of this feature with the rest of Airflow. We want this for
our purposes at Quantopian, but we'd really prefer to get it into Airflow
core rather than running a fork forever!

Let me know your thoughts,

-James M.


Re: K8S deployment operator proposal

2018-07-05 Thread James Meickle
I think it's going to be an antipattern to write Python configuration in
Airflow to configure a Kubernetes deployment since even a "simple"
deployment would likely be more classes/objects than the DAG itself. I do
like the idea of having a more featured operator than the PodOperator, but
if I were to choose a second operator to implement, it would definitely be
a Helm operator instead of a Kube API operator.

On Tue, Jul 3, 2018 at 12:47 PM, Daniel Imberman 
wrote:

> An example of creating a deployment using the k8s model architecture can be
> found here:
> https://github.com/kubernetes-client/python/blob/master/
> examples/deployment_examples.py
>
> def create_deployment_object():
> # Configureate Pod template container
> container = client.V1Container(
> name="nginx",
> image="nginx:1.7.9",
> ports=[client.V1ContainerPort(container_port=80)])
> # Create and configurate a spec section
> template = client.V1PodTemplateSpec(
> metadata=client.V1ObjectMeta(labels={"app": "nginx"}),
> spec=client.V1PodSpec(containers=[container]))
> # Create the specification of deployment
> spec = client.ExtensionsV1beta1DeploymentSpec(
> replicas=3,
> template=template)
> # Instantiate the deployment object
> deployment = client.ExtensionsV1beta1Deployment(
> api_version="extensions/v1beta1",
> kind="Deployment",
> metadata=client.V1ObjectMeta(name=DEPLOYMENT_NAME),
> spec=spec)
>
> return deployment
>
> This would involve a more k8s knowledge from the user, but would have the
> massive benefit that we would not have to maintain new features as the k8s
> API updates (Would simply update version). A user would have to supply is a
> deployment object and possibly a "success criteria" (i.e. an endpoint to
> test).
>
> Conversely, we could make the API a bit easier by only requiring a spec and
> an optional metadata, after which we would handle a lot of the boilerplate.
>
> On Tue, Jul 3, 2018 at 9:20 AM Daniel Imberman 
> wrote:
>
> >  Hi all,
> >
> > Enclosed is a proposal for a kubernetes deployment management operator. I
> > think this would be a good addition to current k8s offerings s.t. users
> can
> > actually launch persistent applications from airflow DAGs.
> >
> > * What?*
> >  Add an operator that monitors a k8s deployment, declaring the task
> > complete on proper deployment/accessibility of endpoint
> >
> > * Why?*
> >  Not all tasks are single pods, sometimes you would want to run one task
> > that launches a service, and then a second task that smoke tests/stress
> > tests/
> >  gives state to an application deployment. This would give airflow extra
> > functionality as a CI/CD tool in the k8s ecosystem.
> >
> > * Fix:*
> >  Create a modification (or extension) of the k8sPodOperator that can
> > handle entire deployments (possibly using the k8s model API to ensure
> > full flexibility of users).
> >
> >  Thank you.
> >
>


Re: What information is passed around different components of Airflow?

2018-07-05 Thread James Meickle
Airflow logs are stored on the worker filesystem. When a worker starts, it
runs a subprocess that serves logs via Flask:
https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985

If you use the remote logging feature, the logs are (instead? also?) stored
in S3.

Postgres stores most everything you see in the UI. Task and DAG state, user
accounts, privileges (in RBAC), variables, connections, etc.

I believe the rabbitmq information is just task states and names, and that
the workers fetch most of what they need from the database. But if you
intercepted it you could manipulate which tasks are being run, so I'd still
treat it as sensitive.

On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam  wrote:

> Hi,
>
> We run Apache Airflow as a set of k8s deployments inside of a GKE cluster,
> similar to the way specified in Mumoshu's github repo:
> https://github.com/mumoshu/kube-airflow.
>
> We are investigating securing our use of Airflow and are wondering about
> some of Airflow's implementation details. Specifically, we run some tasks
> where the workers have access to sensitive data. Some of the data can make
> its way into the task logs. However, we want to make sure isn't passed
> around eg. to the scheduler/database/message queue, and if it is, it should
> be encrypted in any network traffic (eg. via mutual tls).
>
> - Does airflow pass around logs to the postgres db, or rabbitmq?
> - Is the information in postgres mainly operational in nature?
> - Is the information in rabbitmq mainly operational in nature?
> - What about the scheduler?
> - Anything else we're missing?
>
> Any ideas are appreciated!
>
> Thanks in advance!
>


Re: Apache Airflow 1.10.0b2

2018-06-22 Thread James Meickle
I installed v1-10-test at work. I'm having an issue with it:

File
"XXX/airflow/dags/airflow-sandbox/submodules/quantflow/tests/operators/test_zipline_operators.py",
line 6, in 
  from freezegun import freeze_time
  ImportError: No module named 'freezegun'

>From the same dir:

$ cat airflow/dags/airflow-sandbox/.airflowignore
submodules/*

In this case, the web UI appears to be ignoring .airflowignore and
recursing into a submodule's test directory. This behavior doesn't exist in
1.9 for the same submodule.

Let me know if I can help debug anything.

On Thu, Jun 21, 2018 at 5:53 PM, Naik Kaxil  wrote:

> I had found another bug regarding NotFound Error while refreshing and I
> report at https://issues.apache.org/jira/browse/AIRFLOW-2654 and resolved
> by https://github.com/apache/incubator-airflow/pull/3527 . This has been
> merged to master but will this be merged to v1-10-test or will directly be
> merged when v1-10-stable is created?
>
> Kaxil
>
> On 21/06/2018, 21:08, "Bolke de Bruin"  wrote:
>
> There is a Pr out by Kevin but it needs some adjustments and tests:
>
> https://github.com/apache/incubator-airflow/pull/3508
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 21 jun. 2018 om 22:02 heeft Bolke de Bruin 
> het volgende geschreven:
> >
> > Marking as a blocker. When was this introduced?
> >
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 21 jun. 2018 om 01:58 heeft Naik Kaxil  het
> volgende geschreven:
> >>
> >> It is working on the FAB UI after removing the old airflow.cfg file
> and running `airflow initdb`.
> >>
> >> But with that the old webserver seems to be broken (Issue has
> already raised by Kevin here: https://issues.apache.org/
> jira/browse/AIRFLOW-2624).
> >>
> >> Regards,
> >> Kaxil
> >>
> >> On 20/06/2018, 22:35, "Naik Kaxil"  wrote:
> >>
> >>   Thanks Bolke.
> >>
> >>   I have been testing this today and I see that the logs are not
> showing up for me in the Web UI (old UI using Flask Admin).
> >>
> >>   The path of my log is: /Users/kaxil/airflow/logs/
> example_bash_operator/runme_0/2018-06-20T21:31:36.996580+00:00/1.log
> >>
> >>   However, The log tab in the UI is blank and my guess is it is
> looking at a wrong path.
> >>
> >>   I will test this for the new FAB UI as well and give you all an
> update.
> >>
> >>   Regards,
> >>   Kaxil
> >>
> >>   On 19/06/2018, 15:05, "Bolke de Bruin"  wrote:
> >>
> >>   Hi All,
> >>
> >>   I have rebranched v1-10-test today. I have created a sdist
> package that is available at:
> >>
> >>   http://people.apache.org/~bolke/apache-airflow-1.10.0b2+
> incubating.tar.gz  bolke/apache-airflow-1.10.0b2+incubating.tar.gz>
> >>
> >>   In order to distinguish it from an actual (apache) release it
> is:
> >>
> >>   1. Marked as beta (python package managers do not install
> beta versions by default - PEP 440)
> >>   2. It is not signed
> >>   3. It is not at an official apache distribution location
> >>
> >>   You can also put something like this in a requirements.txt
> file:
> >>
> >>   git+
> >>   https://github.com/apache/incubator-airflow@v1-10-test#
> egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,
> mysql,postgres,redis,slack,s3  incubator-airflow@v1-10-test#egg=apache-airflow[celery,
> crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3>]
> >>    apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,
> postgres,redis,slack,s3]  incubator-airflow@master#egg=apache-airflow[celery,crypto,
> emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3]>>
> >>
> >>   and then "pip install -r requirements.txt”.
> >>
> >>   Tomorrow I will do a beta 3, Friday I will kick-off an actual
> vote for release.
> >>
> >>   Cheers
> >>   Bolke
> >>
> >>
> >>
> >>
> >>
> >>   Kaxil Naik
> >>
> >>   Data Reply
> >>   2nd Floor, Nova South
> >>   160 Victoria Street, Westminster
> >>   London SW1E 5LB - UK
> >>   phone: +44 (0)20 7730 6000
> >>   k.n...@reply.com
> >>   www.reply.com
> >>
> >>
> >>
> >>
> >>
> >>
> >> Kaxil Naik
> >>
> >> Data Reply
> >> 2nd Floor, Nova South
> >> 160 Victoria Street, Westminster
> >> London SW1E 5LB - UK
> >> phone: +44 (0)20 7730 6000
> >> k.n...@reply.com
> >> www.reply.com
>
>
>
>
>
>
> Kaxil Naik
>
> Data Reply
> 2nd Floor, Nova South
> 160 Victoria Street, Westminster
> London SW1E 5LB - UK
> phone: +44 (0)20 7730 6000
> k.n...@reply.com
> www.reply.com
>


Re: Airflow 1.10.0

2018-06-15 Thread James Meickle
Hi,

I have a sandbox cluster at work (3 EC2 VMs + Celery on Elasticache) that I
have been running 1.10 on. This is because we want to test in advance the
Kubernetes operator and RBAC. Happy to lend some assistance with that.

-James M.

On Fri, Jun 15, 2018 at 6:28 AM, Driesprong, Fokko 
wrote:

> Hi all,
>
> As you might know, I'm a consultant at GoDataDriven and I just changed to a
> different project. At this project they are not using Airflow (yet), so for
> me it is hard to test the RC's etc. I'm willing to help in the process, but
> currently I'm struggling to find time for the project on the side
> (including Airflow).
>
> Cheers, Fokko
>
> 2018-06-13 23:28 GMT+02:00 Bolke de Bruin :
>
> > Same here and will be for a while still :-(. So unfortunately 1.10 is a
> > bit stalled.
> >
> > B,
> >
> > > On 13 Jun 2018, at 17:16, Chris Riccomini 
> wrote:
> > >
> > > Hey Fokko,
> > >
> > > Sorry that I've been MIA for a while. Just wanted to check in on 1.10,
> > and
> > > its current status.
> > >
> > > Cheers,
> > > Chris
> > >
> > > On Tue, May 1, 2018 at 11:18 PM Driesprong, Fokko  >
> > > wrote:
> > >
> > >> Hi Bryon,
> > >>
> > >> We'll be releasing the RC's soon. If there aren't much issues with the
> > >> RC's, it will be released quickly. But we need the community to test
> > these.
> > >>
> > >> Cheers, Fokko
> > >>
> > >> 2018-05-01 20:57 GMT+02:00 Wicklund, Bryon  >:
> > >>
> > >>> Hey I was wondering if you had a date in mind or an estimate for when
> > >>> Airflow 1.10.0 will be released?
> > >>>
> > >>> Thanks!
> > >>> -Bryon
> > >>>
> > >>> This e-mail, including attachments, may include confidential and/or
> > >>> proprietary information, and may be used only by the person or entity
> > >>> to which it is addressed. If the reader of this e-mail is not the
> > >> intended
> > >>> recipient or his or her authorized agent, the reader is hereby
> notified
> > >>> that any dissemination, distribution or copying of this e-mail is
> > >>> prohibited. If you have received this e-mail in error, please notify
> > the
> > >>> sender by replying to this message and delete this e-mail
> immediately.
> > >>>
> > >>
> >
> >
>


Re: Capturing data changes that happen after the initial data pull

2018-06-07 Thread James Meickle
One way to do this would be to have your DAG file return two nearly
identical DAGs (like put it in a factory function and call it twice). The
difference would be that the "final" run would have a conditional to add an
extra time sensor at the DAG root, to wait N days for the data to finalize.
The effect would be that two DAG runs would start each day, but the latter
would stay uncompleted for a while.

Doing it that way has the advantage of still aligning with daily updates
rather than having a daily DAG and a weekly DAG; keeping execution dates
correct to which data is being processed; and not having the primary DAG
runs stay open for days (better for status tracking).

It has the disadvantage that any extremely long running tasks (including
sensors) will be more likely to fail, and that you would have several more
tasks and DAG runs open and taking resources/concurrency.

On Jun 6, 2018 3:34 PM, "Pedro Machado"  wrote:

This is a similar case. My idea was to rerun the whole data pull. The
current DAG is idempotent so there is no issue with inserting duplicates.
Now, I'm trying to figure out the best way to code it in Airflow. Thanks


On Wed, Jun 6, 2018 at 2:29 PM Ben Gregory  wrote:

> I've seen a similar use case with DoubleClick/Google Analytics (
> https://support.google.com/ds/answer/2791195?hl=en), where the reporting
> metrics have a "lookback window" of up to 30 days to mark conversion
> attribution (so if a user converts on the 14th day of clicking on an ad it
> will still be counted.
>
> What we ended up doing in that case is setting the start/stop params of
the
> query to the the full window (pulled daily) and then upserted in Redshift
> based on a primary key (in our case actually a composite key with multiple
> attributes). So you end up pulling a lot of redundant data but since
> there's no way to pull only updated records (which sounds like the case
> you're in), it's the best way to ensure your reporting is up-to-date.
>
>
>
> On Wed, Jun 6, 2018 at 1:10 PM Pedro Machado  wrote:
>
> > Yes. It's literally the same API calls with the same dates, only done a
> few
> > days later. It's just redoing the same data pull but instead of pulling
> one
> > date each dag run, it would pull all dates for the previous week on
> > Tuesdays.
> >
> > Thanks!
> >
>
>
> --
>
> [image: Astronomer Logo] 
>
> *Ben Gregory*
> Data Engineer
>
> Mobile: +1-615-483-3653 • Online: astronomer.io <
> https://www.astronomer.io/>
>
> Download our new ebook.  From
> Volume
> to Value - A Guide to Data Engineering.
>


Re: Single Airflow Instance Vs Multiple Airflow Instance

2018-06-06 Thread James Meickle
An important consideration here is that there are several settings that are
cluster-wide. In particular, cluster-wide concurrency settings could result
in Team B's DAG refusing to schedule based on an error in Team A's DAG.

Do your teams follow similar practices in how eagerly they ship code, or
have similar SLAs for resolving issues? If so, you are probably fine using
co-tenancy. If not, you should probably talk about it first to make sure
the teams are okay with co-tenancy.

On Wed, Jun 6, 2018 at 11:24 AM, gauthiermarti...@gmail.com <
gauthiermarti...@gmail.com> wrote:

> Hi Everyone,
>
> We have been experimenting with airflow for about 6 months now.
> We are planning to have multiple departments to use it. Since we don't
> have any internal experience with Airflow we are wondering if single
> instance per department is more suited than single instance with
> multi-tenancy? We have been aware about the upcoming release of airflow
> 1.10 and changes that will be made to the RBAC which will be more suited
> for multi-tenancy.
>
> Any advice on this ? Any tips could be helpful to us.
>


Re: Dealing with data latency

2018-06-06 Thread James Meickle
Yes, exactly. Sensors are ultimately just a few methods on top of a
standard operator:
https://airflow.apache.org/_modules/airflow/operators/sensors.html

The BaseSensorOperator doesn't modify how retries work. You definitely want
a retry in the case of the worker running the sensor dying. But even if you
have a temporary DNS outage, or drop an SSH connection - that might merit
needing a retry too, depending on how the operator was implemented (whether
it performs any retrying itself before causing a task failure).

On Tue, Jun 5, 2018 at 8:12 PM, Pedro Machado  wrote:

> Hi James,
> I've noticed that some dags fail if the services are restarted while a
> sensor is waiting. Originally I didn't think retries would be relevant for
> a time sensor but it sounds like if the worker crashes, the only way for
> the sensor to rerun is if the retry count hasn't been met. Is this one of
> the points you are making?
> Thanks.
>
> On Tue, Jun 5, 2018 at 9:41 AM James Meickle 
> wrote:
>
> > We have to use a lot of time sensors like this, for reports that
> shouldn't
> > be filed to a third party before a certain time of day. Since these
> sensors
> > are themselves tasks, they can fail to be scheduled or can fail, like if
> > the underlying worker instance dies. I would recommend double checking
> your
> > concurrency settings (esp. since you will have multiple days worth of
> DAGs
> > concurrently running) and your retry settings.
> >
> > On Tue, Jun 5, 2018 at 10:34 AM, Pedro Machado 
> > wrote:
> >
> > > Thanks, Max!
> > >
> > > On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > The common standard is to have the execution_date aligned with the
> > > > partition date in the database (say 2018-08-08) and contain data from
> > > > 2018-08-08T00:00:000
> > > > to 2018-08-09T23:59:999.
> > > >
> > > > The partition date and execution_date match and correspond to the
> left
> > > > bound of the time interval processed.
> > > >
> > > > Then you'd use some sensors to make sure this cannot run until the
> > > desired
> > > > time or conditions are met.
> > > >
> > > > Max
> > > >
> > > > On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado 
> > > wrote:
> > > >
> > > > > Hi. What is the recommended way to deal with data latency? For
> > > example, I
> > > > > have a feed that is not considered final until 72 hours have passed
> > > after
> > > > > the end of the daily period.
> > > > >
> > > > > For example, Monday's data would be ready by Thursday at 23:59.
> > > > >
> > > > > Should I pull data based on the execution date minus a 72 hour
> offset
> > > or
> > > > > use the execution date and somehow delay the data pull for 72
> hours?
> > > > >
> > > > > The latter would be more intuitive (data pull date = execution
> date)
> > > but
> > > > I
> > > > > am not sure if it's a good pattern.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Pedro
> > > > >
> > > >
> > >
> >
>


Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
I think that a banner notification would be a fair penalty if you access
Airflow without authentication, or have API authentication turned off, or
are accessing via http:// with a non-localhost `Host:`. (Are there any
other circumstances to think of?)

I would also suggest serving a default robots.txt to mitigate accidental
indexing of public instances (as most public instances will be accidentally
public, statistically speaking). If you truly want your Airflow instance
public and indexed, you should have to go out of your way to permit that.

On Tue, Jun 5, 2018 at 1:51 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> What about a clear alert on the UI showing when auth is off? Perhaps a
> large red triangle-exclamation icon on the navbar with a tooltip
> "Authentication is off, this Airflow instance in not secure." and clicking
> take you to the doc's security page.
>
> Well and then of course people should make sure their infra isn't open to
> the Internet. We really shouldn't have to tell people to keep their
> infrastructure behind a firewall. In most environments you have to do quite
> a bit of work to open any resource up to the Internet (SSL certs, special
> security groups for load balancers/proxies, ...). Now I'm curious to
> understand how UMG managed to do this by mistake...
>
> Also a quick reminder to use the Connection abstraction to store secrets,
> ideally using the environment variable feature.
>
> Max
>
> On Tue, Jun 5, 2018 at 10:02 AM Taylor Edmiston 
> wrote:
>
> > One of our engineers wrote a blog post about the UMG mistakes as well.
> >
> > https://www.astronomer.io/blog/universal-music-group-airflow-leak/
> >
> > I know that best practices are well known here, but I second James'
> > suggestion that we add some docs, code, or config so that the framework
> > optimizes for being (nearly) production-ready by default and not just
> easy
> > to start with for local dev.  Admittedly this takes some work to not add
> > friction to the local onboarding experience.
> >
> > Do most people keep separate airflow.cfg files per environment like
> what's
> > considered the best practice in the Django world?  e.g.
> > https://stackoverflow.com/q/10664244/149428
> >
> > Taylor
> >
> > *Taylor Edmiston*
> > Blog <https://blog.tedmiston.com/> | CV
> > <https://stackoverflow.com/cv/taylor> | LinkedIn
> > <https://www.linkedin.com/in/tedmiston/> | AngelList
> > <https://angel.co/taylor> | Stack Overflow
> > <https://stackoverflow.com/users/149428/taylor-edmiston>
> >
> >
> > On Tue, Jun 5, 2018 at 9:57 AM, James Meickle 
> > wrote:
> >
> > > Bumping this one because now Airflow is in the news over it...
> > >
> > > https://www.bleepingcomputer.com/news/security/contractor-
> > > exposes-credentials-for-universal-music-groups-it-
> > > infrastructure/?utm_campaign=Security%2BNewsletter&utm_
> > > medium=email&utm_source=Security_Newsletter_co_79
> > >
> > > On Fri, Mar 23, 2018 at 9:33 AM, James Meickle <
> jmeic...@quantopian.com>
> > > wrote:
> > >
> > > > While Googling something Airflow-related a few weeks ago, I noticed
> > that
> > > > someone's Airflow dashboard had been indexed by Google and was
> > accessible
> > > > to the outside world without authentication. A little more Googling
> > > > revealed a handful of other indexed instances in various states of
> > > > security. I did my best to contact the operators, and waited for
> > > responses
> > > > before posting this.
> > > >
> > > > Airflow is not a secure project by default (
> https://issues.apache.org/
> > > > jira/browse/AIRFLOW-2047), and you can do all sorts of mean things to
> > an
> > > > instance that hasn't been intentionally locked down. (And even then,
> > you
> > > > shouldn't rely exclusively on your app's authentication for providing
> > > > security.)
> > > >
> > > > Having "internal" dashboards/data sources/executors exposed to the
> web
> > is
> > > > dangerous, since old versions can stick around for a very long time,
> > help
> > > > compromise unrelated deployments, and generally just create very bad
> > > press
> > > > for the overall project if there's ever a mass compromise (see: Redis
> > and
> > > > MongoDB).
> > > >
> > > > Shipping secure defaults is hard, but perhaps we could add best
> > practices
> > > > like instructions for deploying a robots.txt with Airflow? Or an
> impact
> > > > statement about what someone could do if they access your Airflow
> > > instance?
> > > > I think that many people deploying Airflow for the first time might
> not
> > > > realize that it can get indexed, or how much damage someone can cause
> > via
> > > > accessing it.
> > > >
> > >
> >
>


Re: Dealing with data latency

2018-06-05 Thread James Meickle
We have to use a lot of time sensors like this, for reports that shouldn't
be filed to a third party before a certain time of day. Since these sensors
are themselves tasks, they can fail to be scheduled or can fail, like if
the underlying worker instance dies. I would recommend double checking your
concurrency settings (esp. since you will have multiple days worth of DAGs
concurrently running) and your retry settings.

On Tue, Jun 5, 2018 at 10:34 AM, Pedro Machado  wrote:

> Thanks, Max!
>
> On Mon, Jun 4, 2018 at 12:47 PM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > The common standard is to have the execution_date aligned with the
> > partition date in the database (say 2018-08-08) and contain data from
> > 2018-08-08T00:00:000
> > to 2018-08-09T23:59:999.
> >
> > The partition date and execution_date match and correspond to the left
> > bound of the time interval processed.
> >
> > Then you'd use some sensors to make sure this cannot run until the
> desired
> > time or conditions are met.
> >
> > Max
> >
> > On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado 
> wrote:
> >
> > > Hi. What is the recommended way to deal with data latency? For
> example, I
> > > have a feed that is not considered final until 72 hours have passed
> after
> > > the end of the daily period.
> > >
> > > For example, Monday's data would be ready by Thursday at 23:59.
> > >
> > > Should I pull data based on the execution date minus a 72 hour offset
> or
> > > use the execution date and somehow delay the data pull for 72 hours?
> > >
> > > The latter would be more intuitive (data pull date = execution date)
> but
> > I
> > > am not sure if it's a good pattern.
> > >
> > > Thanks,
> > >
> > > Pedro
> > >
> >
>


Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread James Meickle
Bumping this one because now Airflow is in the news over it...

https://www.bleepingcomputer.com/news/security/contractor-exposes-credentials-for-universal-music-groups-it-infrastructure/?utm_campaign=Security%2BNewsletter&utm_medium=email&utm_source=Security_Newsletter_co_79

On Fri, Mar 23, 2018 at 9:33 AM, James Meickle 
wrote:

> While Googling something Airflow-related a few weeks ago, I noticed that
> someone's Airflow dashboard had been indexed by Google and was accessible
> to the outside world without authentication. A little more Googling
> revealed a handful of other indexed instances in various states of
> security. I did my best to contact the operators, and waited for responses
> before posting this.
>
> Airflow is not a secure project by default (https://issues.apache.org/
> jira/browse/AIRFLOW-2047), and you can do all sorts of mean things to an
> instance that hasn't been intentionally locked down. (And even then, you
> shouldn't rely exclusively on your app's authentication for providing
> security.)
>
> Having "internal" dashboards/data sources/executors exposed to the web is
> dangerous, since old versions can stick around for a very long time, help
> compromise unrelated deployments, and generally just create very bad press
> for the overall project if there's ever a mass compromise (see: Redis and
> MongoDB).
>
> Shipping secure defaults is hard, but perhaps we could add best practices
> like instructions for deploying a robots.txt with Airflow? Or an impact
> statement about what someone could do if they access your Airflow instance?
> I think that many people deploying Airflow for the first time might not
> realize that it can get indexed, or how much damage someone can cause via
> accessing it.
>


Re: Improving Airflow SLAs

2018-05-24 Thread James Meickle
Hm, not sure I understand your question. To my mind this use case already
isn't possible because while the callback function is currently a DAG
attribute, the SLA value has always been set on specific tasks within the
DAG. Perhaps this gets obscured by the current SLA miss email reporting
downstream tasks, but it's an individual task's SLA (or more than one task
if they got batched) that was triggering that email, not the DAG's.

I'm not sure it's worth adding DAG-level SLAs too since that would likely
result in confusion with task-level SLAs. I feel like for most use cases,
when the DAGRun starts/finishes is more of an implementation detail, and
the actual concern is for specific tasks (even if it's only the first or
the last tasks in your pipeline). Open to discussion on that though!

On Thu, May 24, 2018 at 3:16 PM, Ace Haidrey  wrote:

> Hi James,
> I haven’t read everything or looked at the entire PR yet but one thing I
> wanted to ask was you state you move the SLA miss callback to the task
> level. In our org and I can imagine in others, we would like to have the
> callback stay at the DAG level so we can see if the entire pipeline is
> taking longer than X hours, not just if each task is taking more than X
> hours. Is this still possible, to do that feasibly?
>
> > On May 24, 2018, at 12:12 PM, James Meickle 
> wrote:
> >
> > Just giving this a bump; it's a pretty major rework so I'd love to know
> > whether this effort is likely to be accepted if I bring it to a PR-able
> > state, before I invest more time.
> >
> > On Wed, May 23, 2018 at 1:59 PM, James Meickle 
> > wrote:
> >
> >> Hi folks,
> >>
> >> I've created a branch off of v1-10-test; the diff can be found here:
> >> https://github.com/apache/incubator-airflow/compare/v1-
> >> 10-test...Eronarn:sla_improvements
> >>
> >> As a recap, this work is expected to do the following:
> >>
> >> - split the "sla" parameter into three independent SLAs: expected
> >> duration, expected start, and expected finish
> >> - move the SLA miss callback to be a task-level attribute rather than
> >> DAG-level (removing a lot of the "batching" functionality)
> >> - convert the SLA miss email to the default SLA miss callback
> >> - add a "type" to SLA misses, which will be part of the primary key, and
> >> can be checked against in the callback to respond appropriately to the
> type
> >> of SLA that was missed.
> >> - don't send SLA misses for skipped tasks, or for backfill jobs
> >>
> >> Before I polish up the remaining TODO functions and write a migration
> and
> >> tests, I'd appreciate feedback from the maintainers as to whether this
> >> seems to be on the right track, design-wise. (Note that it's definitely
> not
> >> going to pass tests right now; I am having significant problems getting
> >> Airflow's test suite running locally so I'm not even attempting at the
> >> moment.)
> >>
> >> Thanks,
> >>
> >> -James M.
> >>
> >> On Wed, May 9, 2018 at 12:43 PM, James Meickle  >
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Since the response so far has been positive or neutral, I intend to
> >>> submit one or more PRs targeting 2.0 (I think that some parts will be
> >>> separable from a larger SLA refactor). I intend to address at least the
> >>> following JIRA issues:
> >>>
> >>> https://issues.apache.org/jira/browse/AIRFLOW-2236
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1472
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1360
> >>> https://issues.apache.org/jira/browse/AIRFLOW-557
> >>> https://issues.apache.org/jira/browse/AIRFLOW-133
> >>>
> >>> Regards,
> >>>
> >>> -James M.
> >>>
> >>>
> >>>
> >>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> >>> maximebeauche...@gmail.com> wrote:
> >>>
> >>>> About de-coupling the SLA management process, I've had conversations
> in
> >>>> the
> >>>> direction of renaming the scheduler to "supervisor" to reflect the
> fact
> >>>> that it's not just scheduling processes, it does a lot more tasks than
> >>>> just
> >>>> that, SLA management being one of them.
> >>>>
> >>>> I still think the default should be t

Re: Improving Airflow SLAs

2018-05-24 Thread James Meickle
Just giving this a bump; it's a pretty major rework so I'd love to know
whether this effort is likely to be accepted if I bring it to a PR-able
state, before I invest more time.

On Wed, May 23, 2018 at 1:59 PM, James Meickle 
wrote:

> Hi folks,
>
> I've created a branch off of v1-10-test; the diff can be found here:
> https://github.com/apache/incubator-airflow/compare/v1-
> 10-test...Eronarn:sla_improvements
>
> As a recap, this work is expected to do the following:
>
> - split the "sla" parameter into three independent SLAs: expected
> duration, expected start, and expected finish
> - move the SLA miss callback to be a task-level attribute rather than
> DAG-level (removing a lot of the "batching" functionality)
> - convert the SLA miss email to the default SLA miss callback
> - add a "type" to SLA misses, which will be part of the primary key, and
> can be checked against in the callback to respond appropriately to the type
> of SLA that was missed.
> - don't send SLA misses for skipped tasks, or for backfill jobs
>
> Before I polish up the remaining TODO functions and write a migration and
> tests, I'd appreciate feedback from the maintainers as to whether this
> seems to be on the right track, design-wise. (Note that it's definitely not
> going to pass tests right now; I am having significant problems getting
> Airflow's test suite running locally so I'm not even attempting at the
> moment.)
>
> Thanks,
>
> -James M.
>
> On Wed, May 9, 2018 at 12:43 PM, James Meickle 
> wrote:
>
>> Hi all,
>>
>> Since the response so far has been positive or neutral, I intend to
>> submit one or more PRs targeting 2.0 (I think that some parts will be
>> separable from a larger SLA refactor). I intend to address at least the
>> following JIRA issues:
>>
>> https://issues.apache.org/jira/browse/AIRFLOW-2236
>> https://issues.apache.org/jira/browse/AIRFLOW-1472
>> https://issues.apache.org/jira/browse/AIRFLOW-1360
>> https://issues.apache.org/jira/browse/AIRFLOW-557
>> https://issues.apache.org/jira/browse/AIRFLOW-133
>>
>> Regards,
>>
>> -James M.
>>
>>
>>
>> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>
>>> About de-coupling the SLA management process, I've had conversations in
>>> the
>>> direction of renaming the scheduler to "supervisor" to reflect the fact
>>> that it's not just scheduling processes, it does a lot more tasks than
>>> just
>>> that, SLA management being one of them.
>>>
>>> I still think the default should be to require a single supervisor that
>>> would do all the "supervision" work though. I'm generally against
>>> requiring
>>> more types of nodes on the cluster. But perhaps the supervisor could have
>>> switches to be started in modes where it would only do a subset of its
>>> tasks, so that people can run multiple specialized supervisor nodes if
>>> they
>>> want to.
>>>
>>> For the record, I was thinking that renaming the scheduler to supervisor
>>> would likely happen as we re-write it to enable multiple concurrent
>>> supervisor processes. It turns out that parallelizing the scheduler
>>> hasn't
>>> been as critical as I thought it would be originally, especially with the
>>> current multi-process scheduler. Sounds like the community is getting a
>>> lot
>>> of mileage out of this current multi-process scheduler.
>>>
>>> Max
>>>
>>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen 
>>> wrote:
>>>
>>> > I would love to see this proposal gets implemented in airflow.
>>> > In our case duration based SLA makes much more sense and I ended up
>>> adding
>>> > a decorator to the execute method in our custom operators.
>>> >
>>> > Best regards,
>>> > Jiening
>>> >
>>> > -Original Message-
>>> > From: James Meickle [mailto:jmeic...@quantopian.com]
>>> > Sent: Wednesday 02 May 2018 9:00 PM
>>> > To: dev@airflow.incubator.apache.org
>>> > Subject: Improving Airflow SLAs [External]
>>> >
>>> > At Quantopian we use Airflow to produce artifacts based on the previous
>>> > day's stock market data. These artifacts are required for us to trade
>>> on
>>> > today's stock market. Therefore, I've been investing time in improving
&g

Re: Improving Airflow SLAs

2018-05-23 Thread James Meickle
Hi folks,

I've created a branch off of v1-10-test; the diff can be found here:
https://github.com/apache/incubator-airflow/compare/v1-10-test...Eronarn:sla_improvements

As a recap, this work is expected to do the following:

- split the "sla" parameter into three independent SLAs: expected duration,
expected start, and expected finish
- move the SLA miss callback to be a task-level attribute rather than
DAG-level (removing a lot of the "batching" functionality)
- convert the SLA miss email to the default SLA miss callback
- add a "type" to SLA misses, which will be part of the primary key, and
can be checked against in the callback to respond appropriately to the type
of SLA that was missed.
- don't send SLA misses for skipped tasks, or for backfill jobs

Before I polish up the remaining TODO functions and write a migration and
tests, I'd appreciate feedback from the maintainers as to whether this
seems to be on the right track, design-wise. (Note that it's definitely not
going to pass tests right now; I am having significant problems getting
Airflow's test suite running locally so I'm not even attempting at the
moment.)

Thanks,

-James M.

On Wed, May 9, 2018 at 12:43 PM, James Meickle 
wrote:

> Hi all,
>
> Since the response so far has been positive or neutral, I intend to submit
> one or more PRs targeting 2.0 (I think that some parts will be separable
> from a larger SLA refactor). I intend to address at least the following
> JIRA issues:
>
> https://issues.apache.org/jira/browse/AIRFLOW-2236
> https://issues.apache.org/jira/browse/AIRFLOW-1472
> https://issues.apache.org/jira/browse/AIRFLOW-1360
> https://issues.apache.org/jira/browse/AIRFLOW-557
> https://issues.apache.org/jira/browse/AIRFLOW-133
>
> Regards,
>
> -James M.
>
>
>
> On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
>> About de-coupling the SLA management process, I've had conversations in
>> the
>> direction of renaming the scheduler to "supervisor" to reflect the fact
>> that it's not just scheduling processes, it does a lot more tasks than
>> just
>> that, SLA management being one of them.
>>
>> I still think the default should be to require a single supervisor that
>> would do all the "supervision" work though. I'm generally against
>> requiring
>> more types of nodes on the cluster. But perhaps the supervisor could have
>> switches to be started in modes where it would only do a subset of its
>> tasks, so that people can run multiple specialized supervisor nodes if
>> they
>> want to.
>>
>> For the record, I was thinking that renaming the scheduler to supervisor
>> would likely happen as we re-write it to enable multiple concurrent
>> supervisor processes. It turns out that parallelizing the scheduler hasn't
>> been as critical as I thought it would be originally, especially with the
>> current multi-process scheduler. Sounds like the community is getting a
>> lot
>> of mileage out of this current multi-process scheduler.
>>
>> Max
>>
>> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen 
>> wrote:
>>
>> > I would love to see this proposal gets implemented in airflow.
>> > In our case duration based SLA makes much more sense and I ended up
>> adding
>> > a decorator to the execute method in our custom operators.
>> >
>> > Best regards,
>> > Jiening
>> >
>> > -Original Message-
>> > From: James Meickle [mailto:jmeic...@quantopian.com]
>> > Sent: Wednesday 02 May 2018 9:00 PM
>> > To: dev@airflow.incubator.apache.org
>> > Subject: Improving Airflow SLAs [External]
>> >
>> > At Quantopian we use Airflow to produce artifacts based on the previous
>> > day's stock market data. These artifacts are required for us to trade on
>> > today's stock market. Therefore, I've been investing time in improving
>> > Airflow notifications (such as writing PagerDuty and Slack
>> integrations).
>> > My attention has turned to Airflow's SLA system, which has some
>> drawbacks
>> > for our use case:
>> >
>> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
>> > skipped for this execution date will still trigger emails/callbacks.
>> This
>> > is a huge problem for us because we run almost no tasks on weekends
>> (since
>> > the stock market isn't open).
>> >
>> > 2) Defining SLAs can be awkward because they are relative to the
>> execution
>> > date

Airflow presentation at Velocity New York

2018-05-17 Thread James Meickle
Hi folks,

At Velocity New York in October, I will be presenting about how Quantopian
uses Airflow for financial data:
https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/70048

We couldn't have adopted Airflow so quickly without the hard work of
contributors who have made it such an amazing piece of software. So, thank
you! And feel free to reach out if you'll be at the conference.

-James M.


Airflow dev mailing list DMARC settings

2018-05-16 Thread James Meickle
Hi folks,

I got an email from our email administrator that:

"However, it looks like the AirFlow mailing list isn't rewriting email
headers in messages sent to the list, such that all messages sent to the
list from domains that use DMARC are non-compliant.

At some point we're going to have to flip the bit on DMARC rejection for
our domain, even if the maintainers of this list don't fix it to be
compliant, at which point at least some recipients of the list will stop
receiving your emails to the list because their mail servers will reject
them as non-compliant."

The only ASF info on this I could find was here:
https://blogs.apache.org/infra/entry/dmarc_filtering_on_lists_that

I don't know if that blog post is still up to date, but it implies that a
project member would need to file a JIRA issue requesting a change.

-James M.


Re: How to know the DAG is starting to run

2018-05-11 Thread James Meickle
Song:

You can put an operator as the very first node in the DAG, and have
everything else in the DAG depend on it. For example, this is the approach
we use to only execute DAG tasks on stock market trading days.

-James M.

On Fri, May 11, 2018 at 3:57 AM, Song Liu  wrote:

> Hi,
>
> I have something just want to be done only once when DAG is constructed,
> but it seems that DAG will be instanced every time when run each of
> operator.
>
> So is that there function in DAG that tell us it is starting to run now ?
>
> Thanks,
> Song
>


Re: Improving Airflow SLAs

2018-05-09 Thread James Meickle
Hi all,

Since the response so far has been positive or neutral, I intend to submit
one or more PRs targeting 2.0 (I think that some parts will be separable
from a larger SLA refactor). I intend to address at least the following
JIRA issues:

https://issues.apache.org/jira/browse/AIRFLOW-2236
https://issues.apache.org/jira/browse/AIRFLOW-1472
https://issues.apache.org/jira/browse/AIRFLOW-1360
https://issues.apache.org/jira/browse/AIRFLOW-557
https://issues.apache.org/jira/browse/AIRFLOW-133

Regards,

-James M.



On Thu, May 3, 2018 at 12:13 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> About de-coupling the SLA management process, I've had conversations in the
> direction of renaming the scheduler to "supervisor" to reflect the fact
> that it's not just scheduling processes, it does a lot more tasks than just
> that, SLA management being one of them.
>
> I still think the default should be to require a single supervisor that
> would do all the "supervision" work though. I'm generally against requiring
> more types of nodes on the cluster. But perhaps the supervisor could have
> switches to be started in modes where it would only do a subset of its
> tasks, so that people can run multiple specialized supervisor nodes if they
> want to.
>
> For the record, I was thinking that renaming the scheduler to supervisor
> would likely happen as we re-write it to enable multiple concurrent
> supervisor processes. It turns out that parallelizing the scheduler hasn't
> been as critical as I thought it would be originally, especially with the
> current multi-process scheduler. Sounds like the community is getting a lot
> of mileage out of this current multi-process scheduler.
>
> Max
>
> On Thu, May 3, 2018 at 7:31 AM, Jiening Wen 
> wrote:
>
> > I would love to see this proposal gets implemented in airflow.
> > In our case duration based SLA makes much more sense and I ended up
> adding
> > a decorator to the execute method in our custom operators.
> >
> > Best regards,
> > Jiening
> >
> > -Original Message-
> > From: James Meickle [mailto:jmeic...@quantopian.com]
> > Sent: Wednesday 02 May 2018 9:00 PM
> > To: dev@airflow.incubator.apache.org
> > Subject: Improving Airflow SLAs [External]
> >
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issu

Re: Improving Airflow SLAs

2018-05-03 Thread James Meickle
That's a very interesting thought, and I have definitely been bitten
multiple times by bugs related to how closely tied SLAs are to the
scheduler:
https://issues.apache.org/jira/browse/AIRFLOW-2178?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20smtp

However, I'm not convinced that adding a new process for a monitoring
service would actually be much better architecturally than just improving
the scheduler codebase. To have high confidence you'd likely want an
external, non-Airflow "is this running" check anyways (for example, we
alert if there are no "heartbeating scheduler" log lines).



On Thu, May 3, 2018 at 2:26 AM, Ananth Durai  wrote:

> Since we are talking about the SLA implementation, The current SLA miss
> implementation is part of the scheduler code. So in the cases like
> scheduler max out the process / not running for some reason, we will miss
> all the SLA alert. It is worth to decouple SLA alert from the scheduler
> path and run as a separate process.
>
>
> Regards,
> Ananth.P,
>
>
>
>
>
>
> On 2 May 2018 at 20:31, David Capwell  wrote:
>
> > We use SLA as well and works great for some DAGs and painful for others
> >
> > We rely on sensors to validate the data is ready before we run and each
> dag
> > waits on sensors for different times (one dag waits for 8 hours since it
> > expects date at the start of day but tends to get it 8 hours later).  We
> > also have some nested dags that have about 10 tasks deep.
> >
> > In these two cases SLA warnings come very late since the semantics we see
> > is DAG completion time; what we really want is what you were talking
> about,
> > expected execution times
> >
> > Also SLA trigger on backfills and manual reruns of tasks
> >
> > I see this as a critical feature for production monitoring so would love
> to
> > see this get improved
> >
> > On Wed, May 2, 2018, 12:00 PM James Meickle 
> > wrote:
> >
> > > At Quantopian we use Airflow to produce artifacts based on the previous
> > > day's stock market data. These artifacts are required for us to trade
> on
> > > today's stock market. Therefore, I've been investing time in improving
> > > Airflow notifications (such as writing PagerDuty and Slack
> integrations).
> > > My attention has turned to Airflow's SLA system, which has some
> drawbacks
> > > for our use case:
> > >
> > > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > > skipped for this execution date will still trigger emails/callbacks.
> This
> > > is a huge problem for us because we run almost no tasks on weekends
> > (since
> > > the stock market isn't open).
> > >
> > > 2) Defining SLAs can be awkward because they are relative to the
> > execution
> > > date instead of the task start time. There's no way to alert if a task
> > runs
> > > for "more than an hour", for any non-trivial DAG. Instead you can only
> > > express "more than an hour from execution date".  The financial data we
> > use
> > > varies in when it arrives, and how long it takes to process (data
> volume
> > > changes frequently); we also have tight timelines that make retries
> > > difficult, so we want to alert an operator while leaving the task
> > running,
> > > rather than failing and then alerting.
> > >
> > > 3) SLA miss emails don't have a subject line containing the instance
> URL
> > > (important for us because we run the same DAGs in both
> > staging/production)
> > > or the execution date they apply to. When opened, they can get hard to
> > read
> > > for even a moderately sized DAG because they include a flat list of
> task
> > > instances that are unsorted (neither alpha nor topo). They are also
> > lacking
> > > any links back to the Airflow instance.
> > >
> > > 4) SLA emails are not callbacks, and can't be turned off (other than
> > either
> > > removing the SLA or removing the email attribute on the task instance).
> > The
> > > way that SLA miss callbacks are defined is not intuitive, as in
> contrast
> > to
> > > all other callbacks, they are DAG-level rather than task-level. Also,
> the
> > > call signature is poorly defined: for instance, two of the arguments
> are
> > > just strings produced from the other two arguments.
> > >
> > > I have some thoughts about ways to fix these issues:
> > >
> > > 1) I just consider this one a bug.

Improving Airflow SLAs

2018-05-02 Thread James Meickle
At Quantopian we use Airflow to produce artifacts based on the previous
day's stock market data. These artifacts are required for us to trade on
today's stock market. Therefore, I've been investing time in improving
Airflow notifications (such as writing PagerDuty and Slack integrations).
My attention has turned to Airflow's SLA system, which has some drawbacks
for our use case:

1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
skipped for this execution date will still trigger emails/callbacks. This
is a huge problem for us because we run almost no tasks on weekends (since
the stock market isn't open).

2) Defining SLAs can be awkward because they are relative to the execution
date instead of the task start time. There's no way to alert if a task runs
for "more than an hour", for any non-trivial DAG. Instead you can only
express "more than an hour from execution date".  The financial data we use
varies in when it arrives, and how long it takes to process (data volume
changes frequently); we also have tight timelines that make retries
difficult, so we want to alert an operator while leaving the task running,
rather than failing and then alerting.

3) SLA miss emails don't have a subject line containing the instance URL
(important for us because we run the same DAGs in both staging/production)
or the execution date they apply to. When opened, they can get hard to read
for even a moderately sized DAG because they include a flat list of task
instances that are unsorted (neither alpha nor topo). They are also lacking
any links back to the Airflow instance.

4) SLA emails are not callbacks, and can't be turned off (other than either
removing the SLA or removing the email attribute on the task instance). The
way that SLA miss callbacks are defined is not intuitive, as in contrast to
all other callbacks, they are DAG-level rather than task-level. Also, the
call signature is poorly defined: for instance, two of the arguments are
just strings produced from the other two arguments.

I have some thoughts about ways to fix these issues:

1) I just consider this one a bug. If a task instance is skipped, that was
intentional, and it should not trigger any alerts.

2) I think that the `sla=` parameter should be split into something like
this:

`expected_start`: Timedelta after execution date, representing when this
task must have started by.
`expected_finish`: Timedelta after execution date, representing when this
task must have finished by.
`expected_duration`: Timedelta after task start, representing how long it
is expected to run including all retries.

This would give better operator control over SLAs, particularly for tasks
deeper in larger DAGs where exact ordering may be hard to predict.

3) The emails should be improved to be more operator-friendly, and take
into account that someone may get a callback for a DAG they don't know very
well, or be paged by this notification.

4.1) All Airflow callbacks should support a list, rather than requiring a
single function. (I've written a wrapper that does this, but it would be
better for Airflow to just handle this itself.)

4.2) SLA miss callbacks should be task callbacks that receive context, like
all the other callbacks. Having a DAG figure out which tasks have missed
SLAs collectively is fine, but getting SLA failures in a batched callback
doesn't really make much sense. Per-task callbacks can be fired
individually within a batch of failures detected at the same time.

4.3) SLA emails should be the default SLA miss callback function, rather
than being hardcoded.

Also, overall, the SLA miss logic is very complicated. It's stuffed into
one overloaded function that is responsible for checking for SLA misses,
creating database objects for them, filtering tasks, selecting emails,
rendering, and sending. Refactoring it would be a good maintainability win.

I am already implementing some of the above in a private branch, but I'd be
curious to hear community feedback as to which of these suggestions might
be desirable upstream. I could have this ready for Airflow 2.0 if there is
interest beyond my own use case.


Re: 1.10.0beta1 now available for download

2018-05-01 Thread James Meickle
Thanks for the pointer! I went through and set this up today, using Google
OAuth as the RBAC provider. Overall I'm quite enthusiastic about this move,
but I thought that it might be helpful to collect feedback as someone who
hasn't been following the overall process and is therefore coming at it
with fresh eyes.

- The Flask appbuilder security documentation is poor quality (e.g.,
there's some broken sentences); if Airflow is to send people there, it
might be worth PRing some of the docs to at least look more professional.

- There's not much documentation out there on how to properly set up an
OAuth app in Google (in my case, using the G+ API). From an adoption POV,
it would be good to screenshot the (current) steps in the process, and
point out which values should be used in which fields on Google. For
example, I had to grep the code base to find the callback URL.

- The initial login UI seems over-complex: you have to click the provider
icon, and then click either login or register. The standard for this
workflow is that you login by clicking the desired provider's icon, and
doing so will register you automatically if you aren't already. In my case
I only have one provider, so this menu was even more confusing.

- It was not clear to me that the "Public" role has absolutely no
permissions. When I set this as the default role and registered, I could no
longer access the site until I cleared cookies. I thought it was an OAuth
error at first, but it turns out the Public role has fewer effective
permissions than an anonymous user; this resulted in a redirect loop
because I could not even view the homepage. I had to correct this in the
database to be able to log in.

- The roles list (at roles/list/ ) is intimidatingly large and hard to
parse. For instance, I couldn't tell at a glance what "user" allows
relative to "viewer". It would be good to have a narrative description of
what each of these roles is intended for, and to present the list of
permissions in a more clustered or diffable way. Permissions lists tend to
only grow, after all.

- A "Viewer" currently lacks enough access to see their own profile.

- "User Statistics" (userstatschartview/chart/) uses the internal name,
rather than firstname/lastname - which in my case is a `google_idnumber`
name. Should probably show both names.

Unrelatedly to RBAC (I think), on this branch on my sandbox instance, tasks
appear to be failing with the only logs present in the UI as:

[{'end_of_log': True}, {'end_of_log': True}, {'end_of_log': True},
{'end_of_log': True}, {'end_of_log': True}, {'end_of_log': True}]


Finally, in case anyone else wanted to test run a similar setup, here is
the webserver_config.py that I ended up using (note that it has Jinja
templating via Ansible):

import os
from airflow import configuration as conf
from flask_appbuilder.security.manager import AUTH_OAUTH
basedir = os.path.abspath(os.path.dirname(__file__))

# The SQLAlchemy connection string.
SQLALCHEMY_DATABASE_URI = conf.get('core', 'SQL_ALCHEMY_CONN')

# Flask-WTF flag for CSRF
CSRF_ENABLED = True

# The name to display, e.g. "Airflow Staging Sandbox"
APP_NAME = "Airflow {{ env }} {{ app_config | capitalize }}"

# Use OAuth
AUTH_TYPE = AUTH_OAUTH

# Will allow user self registration
AUTH_USER_REGISTRATION = True

# The default user self registration role
AUTH_USER_REGISTRATION_ROLE = "{{ airflow_rbac_registration_role |
default('Viewer') }}"

# Google OAuth:
OAUTH_PROVIDERS = [{
# The name of the provider
'name': 'google',
# The icon to use
'icon': 'fa-google',
# The name of the key that the provider sends
'token_key': 'access_token',
# Just in case, whitelist to only @quantopian.com emails
'whitelist': ['@quantopian.com'],
# Define the remote app:
'remote_app': {
'base_url': 'https://www.googleapis.com/oauth2/v2/',
'access_token_url': 'https://accounts.google.com/o/oauth2/token',
'authorize_url': 'https://accounts.google.com/o/oauth2/auth',
'request_token_url': None,
'request_token_params': {
# Uses the Google+ API, requestingf the 'email' and 'profile' scope
'scope': 'email profile'
},
'consumer_key': '{{ vault_airflow_google_oauth_key }}',
'consumer_secret': '{{ vault_airflow_google_oauth_secret }}'
}
}]



On Mon, Apr 30, 2018 at 12:54 PM, Jørn A Hansen 
wrote:

> On Mon, 30 Apr 2018 at 15.56, James Meickle 
> wrote:
>
> > Installed this off of the branch, and I do get the Kubernetes executor
> > (incl. demo DAG) and some bug fixes - but I don't see any RBAC feature
> > anywhere I'd think to

Re: How to consolidate log files?

2018-05-01 Thread James Meickle
I suspect that what you actually want here is to run an external log
ingestion service (e.g. ELK stack), and watch the log directories on each
worker. They are very hierarchically laid out so it would be easy to grab
what you're looking for and tag them appropriately, then look at them in a
UI that is better suited to displaying logs from many concurrent sources.

On Mon, Apr 30, 2018 at 6:18 PM, Ruiqin Yang  wrote:

> AFAIK, airflow doesn't provide log in this way. Multiple tasks would run in
> different processes and potentially in parallel, thus writing to the same
> file at run time would produce log file with mix log lines from different
> tasks. Also I believe airflow now does not seperate stdour and stderr, they
> all go to same place. Not sure if there's a good point in code to
> consolidate the logs from different tasks. Maybe you can have a separate
> script/service to do the log consolidate job since the log structure and
> format are known.
>
> Cheers,
> Kevin Y
>
> On Mon, Apr 30, 2018 at 12:16 PM, mad2...@columbia.edu <
> mad2...@columbia.edu
> > wrote:
>
> > Hi Mailing List,
> >
> > Is there a way to consolidate airflow task logs by stdout and stderr?
> > Currently, the structure is something like /logs/taskname/task_run_date/
> log1.txt
> > which is the log for a particular taskname at a particular run date.
> What
> > I would like is two large log files for all tasks, something like
> > /logs/errors.txt and /logs/outputs.txt  Which would contain all the
> stderr
> > and stdout messages for all runs of all tasks regardless of run date. I
> > essentially want two very large log files.  For example, if I have task A
> > and task B, instead of having two directories and then subdirectories
> for A
> > and B, I would just like two files one with errors from A and B and one
> for
> > outputs from A and B. Does airflow provide this information?
> >
> > Thanks!
> >
>


Re: 1.10.0beta1 now available for download

2018-04-30 Thread James Meickle
Installed this off of the branch, and I do get the Kubernetes executor
(incl. demo DAG) and some bug fixes - but I don't see any RBAC feature
anywhere I'd think to look. Do I need to set up some config to get that to
show up?

On Mon, Apr 23, 2018 at 2:06 PM, Bolke de Bruin  wrote:

> Hi All,
>
> I am really happy that Fokko and I have created the v1-10-test branch and
> subsequently build the first beta of Apache Airflow 1.10!
>
> It is available for testing here:
>
> https://dist.apache.org/repos/dist/dev/incubator/airflow/1.10.0beta1/
>
> Highlights include:
>
> * New RBAC web interface in beta
> * Timezone support
> * First class kubernetes operator
> * Experimental kubernetes executor
> * Documentation improvements
> * Performance optimizations for large DAGs
> * many GCP and S3 integration improvements
> * many new operators
> * many many many bug fixes
>
> We are aiming for a fully compliant Apache release so we should be able to
> kick off the graduation process after this release. I hope you help us out
> getting there!
>
> Kind regards,
>
> Bolke & Fokko


Re: About the project support in Airflow

2018-04-25 Thread James Meickle
Another reason you would want separated infrastructure is that there are a
lot of ways to exhaust Airflow resources or otherwise cause contention -
like having too many sensors or sub-DAGs using up all available tasks.

Doesn't seem like a great idea to push for having different teams with
co-tenancy until there is also per-team control over resource use...

On Tue, Apr 24, 2018 at 8:27 PM, 刘松(Cycle++开发组) 
wrote:

> It seems that all the current approach is pointing to multiple instance of
> airflow, but project concept is very nature since one user might to handle
> different type of tasks.
>
> Another thing about the multiple user support, one way is also to deploy
> multiple instance, but it seems that airflow is providing multiple user
> function builtin.
>
> So I can not be convinced that using multiple instance for multiple
> project purpose.
>
> Thanks,
> Song
>
>
>
>
> On Wed, Apr 25, 2018 at 4:25 AM +0800, "Ace Haidrey"  > wrote:
>
>
> Looks neat Taylor!
>
> And regarding the original question, going off of what Maxime and Bolke
> said, at Pandora, it made more sense for us to have an instance per team
> since each team has its own system user for prod and the instance can run
> all processes as that user. Alternatively you could have a super user that
> can sudo as those other system users, and have many teams on a single
> instance but that is a security concern (what if one team sudo's as the
> other team and accidentally overwrites data - there is nothing stopping
> them from doing it). It depends what your org set up is, but let me know if
> there are any questions I can help with.
>
> Ace
>
>
> > On Apr 24, 2018, at 1:16 PM, Taylor Edmiston  wrote:
> >
> > We use a similar approach like Bolke mentioned with running multiple
> > Airflow instances.
> >
> > I haven't read the Pandora article yet, but we have an Astronomer Open
> > Edition (fully open source) that bundles similar tools like Prometheus,
> > Grafana, Celery, etc with Airflow and a Docker Compose file if you're
> > looking to get a setup like that up and running quickly.
> >
> > https://github.com/astronomerio/astronomer/blob/master/examples/airflow-
> enterprise/docker-compose.yml
> > https://github.com/astronomerio/astronomer
> >
> > *Taylor Edmiston*
> > Blog  | Stack Overflow CV
> >  | LinkedIn
> >  | AngelList
> >
> >
> >
> > On Tue, Apr 24, 2018 at 3:30 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Related blog post about multi-tenant Airflow deployment out of Pandora:
> >> https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee
> >>
> >> On Tue, Apr 24, 2018 at 10:20 AM, Bolke de Bruin
> >> wrote:
> >>
> >>> My suggestion would be to deploy airflow per project. You could even
> use
> >>> airflow to manage your ci/cd pipeline.
> >>>
> >>> B.
> >>>
> >>> Sent from my iPhone
> >>>
>  On 24 Apr 2018, at 18:33, Maxime Beauchemin <
> >> maximebeauche...@gmail.com>
> >>> wrote:
> 
>  People have been talking about namespacing DAGs in the past. I'd
> >>> recommend
>  using tags (many to many) instead of categories/projects (one to
> many).
> 
>  It should be fairly easy to add this feature. One question is whether
> >>> tags
>  are defined as code or in the UI/db only.
> 
>  Max
> 
> > On Tue, Apr 24, 2018 at 1:48 AM, Song Liu
> >> wrote:
> >
> > Hi,
> >
> > Basically the DAGs are created for a project purpose, so if I have
> >> many
> > different projects, will the Airflow support the Project concept and
> > organize them separately ?
> >
> > Is this a known requirement or any plan for this already ?
> >
> > Thanks,
> > Song
> >
> >>>
> >>
>
>
>


Re: RBAC Update

2018-04-02 Thread James Meickle
To my mind, I would expect the MVP of per-DAG RBAC to include three
settings: viewing DAG state, executing or modifying DAGs, and viewing tasks
within the DAG (logs/code/details). For instance we would love to expose a
view of the production dataload state to our engineers, without exposing
production logs or allowing them to run/modify anything.

Creating these mappings would be somewhat painful to manage, but that seems
fine for an MVP. I would expect some future version to add grouping/tagging
to DAGs and to users, to say e.g. "users in group X can see DAGs in group Y
and modify DAGs in group Z."

On Fri, Mar 30, 2018 at 1:26 PM, Brian Greene <
br...@heisenbergwoodworking.com> wrote:

> I’d think we’d have privilege ‘can_view’ etc, and then a join table (priv)
> <-> (dagid) <-> (user/group).  Then it’s a simple query to get the perms
> for a given dag (as you list In option 2 below).
>
> It also makes a “secure by default” easy - a lack of entries in that table
> for a dag can mean only “admin” access or some such.
>
> Then any dag can have any combo of permissions for any combo of users.
> Adding the groups option raises complexity around nesting, so maybe skip it
> for r1?
>
> My $.02
>
> Brian
>
> Sent from a device with less than stellar autocorrect
>
> > On Mar 29, 2018, at 10:27 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> >
> > Hijacking the thread further here, any thoughts on how to breakdown per
> DAG
> > access?
> >
> > Tao & I are talking about introducing per-DAG permissions and one big
> > question is whether we'll need to support different operation-types at a
> > per-DAG level, which changes the way we need to model the perms.
> >
> > First [simpler] option is to introduce one perm per DAG. If you have
> access
> > to 5 DAGs, and you have `can_clear` and `can_run`, you'll have homogenous
> > rights on the DAGs you have access to.
> >
> > Second option is to have a breakdown per DAG. Meaning for each DAG we
> > create a set of perms ({dag_id}_can_view, {dag_id}_can_modify, ...). So
> one
> > user could have modify on some DAGs, view on others, and other DAGs would
> > be invisible. This could be broken down further ({dag_id}_can_clear, ...)
> > but it gets hard to manage.
> >
> > Thoughts?
> >
> > Max
> >
> >> On Wed, Mar 28, 2018 at 10:02 PM, Tao Feng  wrote:
> >>
> >> Great work Joy. This is awesome! I am interested in helping out the per
> dag
> >> level access.  Just created a ticket to check(AIRFLOW-2267). Let me
> know if
> >> you have any suggestions. I will share my proposal once I am ready.
> >>
> >>> On Fri, Mar 23, 2018 at 6:45 PM, Joy Gao  wrote:
> >>>
> >>> Hey guys!
> >>>
> >>> The RBAC UI 
> has
> >>> been merged to master. I'm looking forward to early adopters' feedback
> >> and
> >>> bug reports. I also hope to have more folks helping out with the RBAC
> UI,
> >>> especially with introducing DAG-Level access control, which is a
> feature
> >>> that a lot of people have been asking. If you are interested in helping
> >> out
> >>> with this effort, let's talk more!
> >>>
> >>> This commit will be in the 1.10.0 release, and we are going to maintain
> >>> both UIs simultaneously for a short period of time. Once RBAC UI is
> >> stable
> >>> and battle-tested, we will deprecate the old UI and eventually remove
> it
> >>> from the repo (around Airflow 2.0.0 or 2.1.0 release). This is to
> prevent
> >>> two UIs from forking into separate paths, as that would become very
> >>> difficult to maintain.
> >>>
> >>> Going forward while both UIs are up, if you are making a change to any
> >>> files in airflow/www/ (old UI), where applicable, please also make the
> >>> change to the airflow/www_rbac/ (new UI). If you rather not make
> changes
> >> in
> >>> both UIs, it is recommended that you only make the changes to the RBAC
> >> UI,
> >>> since that is the one we are maintaining in the long term.
> >>>
> >>> I'm excited that the RBAC UI will be able to bring additional security
> to
> >>> Airflow, and with FAB framework in place we can look into leveraging it
> >> for
> >>> a unified set of APIs used by both UI and CLI.
> >>>
> >>> Joy
> >>>
> >>>
> >>>
>  On Thu, Feb 8, 2018 at 11:31 AM, Joy Gao  wrote:
> 
>  Hi folks,
> 
>  I have a PR 
> >> out
>  for the new UI. I've included instructions on how to test it out in
> the
> >>> PR
>  description. Looking forward to your feedbacks.
> 
>  Cheers,
>  Joy
> 
> > On Fri, Dec 1, 2017 at 6:18 PM, Joy Gao  wrote:
> >
> > Thanks for the background info. Would be really awesome for you to
> >> have
> > PyPi access :D I'll make the change to have Airflow Webserver's FAB
> > dependency pointing to my fork for the mean time.
> >
> > For folks who are interested in RBAC, I will be giving a talk/demo at
> >>> the Airflow
> > Meet-Up
> > 

Re: What are the advantages of plugins, not sure I see any?

2018-04-02 Thread James Meickle
I have internally been working on a stock market date/time operator plugin
that I am hoping to open source soon. Figuring out the best way to combine
Airflow plugin packaging and Python packaging has been rather unpleasant,
to be honest. I'd love to see this revisited in a future version of Airflow
so that plugins are just a layer on top of standard Python packages.

On Fri, Mar 30, 2018 at 4:53 PM, Taylor Edmiston 
wrote:

> We might be an edge case running Airflow as a platform at Astronomer, but
> we make hooks and operators that are reused across many Airflow instances
> by customers.  (Starting to open source at
> https://github.com/airflow-plugins.)  We've also run a Mesos executor as a
> plugin for similar reasons so that as we fix bugs or add features we can
> reuse it across Airflow installs.
>
> To add one more point — we've tossed around the idea of building more
> tooling around plugins, so you could do something like: $ airflow plugin
> install -U github-plugin then import GitHubHook and...
>
> A package install via PyPI could work as well but we haven't seen anyone
> else distributing Airflow plugins as packages yet.
>
> T
>
> *Taylor Edmiston*
> TEdmiston.com  | Blog
> 
> Stack Overflow CV  | LinkedIn
>  | AngelList
> 
>
>
> On Fri, Mar 30, 2018 at 4:37 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Yes, it makes most sense to just have a `common/hooks` and
> > `common/operators/` in the repo where your DAGs live and import them at
> > will.
> >
> > Max
> >
> > On Fri, Mar 30, 2018 at 1:30 PM, Kyle Hamlin 
> wrote:
> >
> > > Thanks for the responses! I think my conclusion was similar, they seem
> > good
> > > for redistribution, but if you're only working with operators and hooks
> > and
> > > aren't sharing that code then it might not make too much sense to use
> > them.
> > >
> > > On Fri, Mar 30, 2018 at 4:23 PM Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > The original intent was to use plugins as a way to share sets of
> > objects
> > > > and applications build on top of Airflow.
> > > >
> > > > For instance it'd be possible to ship the things listed bellow as
> > Airflow
> > > > plugin:
> > > >
> > > > * "validate-and-schedule my query" UI
> > > > * a set of ML-related hooks and operators that match a specific
> > workflow
> > > > * a Hive metastore-browser UI
> > > > * drag and drop UI
> > > > * ...
> > > >
> > > > As far as I know it's not a super popular feature. Maybe the scope of
> > > > Airflow is already large enough without having all that stuff sitting
> > on
> > > > top of it. :)
> > > >
> > > > As George pointed out it could also allow to accelerate the release
> > > cadence
> > > > of sets of things that are currently inside Airflow. Things like
> Google
> > > > Cloud-related operators and hooks could ship as a plugin on their own
> > > > release schedule.
> > > >
> > > > Max
> > > >
> > > > On Thu, Mar 29, 2018 at 11:07 PM, Alex Tronchin-James 949-412-7220
> > > > <(949)%20412-7220> <
> > > > alex.n.ja...@gmail.com> wrote:
> > > >
> > > > > At Netflix we've put our plugin inside the DAGs folder and pointed
> > the
> > > > > config to it there so we can both import directly in DAGs AND
> update
> > > the
> > > > > plugin as we go. This makes it easy to test changes to operators
> > needed
> > > > for
> > > > > ongoing DAG development in the same PR.
> > > > >
> > > > > The two plugin features I've used which don't translate to the
> direct
> > > > > import approach are custom macros (we provide some internal
> > libraries)
> > > > and
> > > > > UI menu links, which we use for linking local docs describing our
> > > > > deployment and custom operators, server/worker monitoring with
> atlas,
> > > and
> > > > > genie job monitoring.
> > > > >
> > > > > On Thu, Mar 29, 2018 at 4:56 PM George Leslie-Waksman
> > > > >  wrote:
> > > > >
> > > > > > It's presumably useful if you want to package your plugins for
> > other
> > > > > people
> > > > > > to use but it seems like everyone just adds those directly to the
> > > > Airflow
> > > > > > codebase these days.
> > > > > >
> > > > > > On Thu, Mar 29, 2018 at 4:27 PM Kyle Hamlin  >
> > > > wrote:
> > > > > >
> > > > > > > Yeah so far I have only written hooks and operators so maybe
> the
> > > > > benefit
> > > > > > > only  kicks in for other airflow abstractions.
> > > > > > >
> > > > > > > > On Mar 29, 2018, at 7:15 PM, George Leslie-Waksman <
> > > > > > > geo...@cloverhealth.com.INVALID> wrote:
> > > > > > > >
> > > > > > > > We also import our operators and sensors directly.
> > > > > > > >
> > > > > > > > However, executors and some other pieces are a little bit
> > harder
> > > to
> > > > > > deal
> > > > > > > > with as non-plugins
> > > > > > > >
> > > > > > > >> On Thu, Mar 29, 2018 at 3:56 PM Kyle Hamlin <
> > > h

Re: RBAC Update

2018-03-26 Thread James Meickle
This is super exciting for us, as we want one of our non-technical teams to
be able to re-run failed DAGs. Will be giving this a try soon as I'm back
from SREcon!

On Fri, Mar 23, 2018 at 9:45 PM, Joy Gao  wrote:

> Hey guys!
>
> The RBAC UI  has
> been merged to master. I'm looking forward to early adopters' feedback and
> bug reports. I also hope to have more folks helping out with the RBAC UI,
> especially with introducing DAG-Level access control, which is a feature
> that a lot of people have been asking. If you are interested in helping out
> with this effort, let's talk more!
>
> This commit will be in the 1.10.0 release, and we are going to maintain
> both UIs simultaneously for a short period of time. Once RBAC UI is stable
> and battle-tested, we will deprecate the old UI and eventually remove it
> from the repo (around Airflow 2.0.0 or 2.1.0 release). This is to prevent
> two UIs from forking into separate paths, as that would become very
> difficult to maintain.
>
> Going forward while both UIs are up, if you are making a change to any
> files in airflow/www/ (old UI), where applicable, please also make the
> change to the airflow/www_rbac/ (new UI). If you rather not make changes in
> both UIs, it is recommended that you only make the changes to the RBAC UI,
> since that is the one we are maintaining in the long term.
>
> I'm excited that the RBAC UI will be able to bring additional security to
> Airflow, and with FAB framework in place we can look into leveraging it for
> a unified set of APIs used by both UI and CLI.
>
> Joy
>
>
>
> On Thu, Feb 8, 2018 at 11:31 AM, Joy Gao  wrote:
>
> > Hi folks,
> >
> > I have a PR  out
> > for the new UI. I've included instructions on how to test it out in the
> PR
> > description. Looking forward to your feedbacks.
> >
> > Cheers,
> > Joy
> >
> > On Fri, Dec 1, 2017 at 6:18 PM, Joy Gao  wrote:
> >
> >> Thanks for the background info. Would be really awesome for you to have
> >> PyPi access :D I'll make the change to have Airflow Webserver's FAB
> >> dependency pointing to my fork for the mean time.
> >>
> >> For folks who are interested in RBAC, I will be giving a talk/demo at
> the Airflow
> >> Meet-Up
> >>  Incubating-Meetup/events/244525050/>
> >> next Monday. Happy to chat afterwards about it as well :)
> >>
> >> On Thu, Nov 30, 2017 at 8:36 AM, Maxime Beauchemin <
> >> maximebeauche...@gmail.com> wrote:
> >>
> >>> A bit of related history here:
> >>> https://github.com/dpgaspar/Flask-AppBuilder/issues/399
> >>>
> >>> On Thu, Nov 30, 2017 at 8:33 AM, Maxime Beauchemin <
> >>> maximebeauche...@gmail.com> wrote:
> >>>
> >>> > Given I have merge rights on FAB I could probably do another round of
> >>> > review and get your PRs through. I would really like to get the main
> >>> > maintainer's input on things that touch the core (composite-key
> >>> support) as
> >>> > he might have concerns/intuitions that we can't know about.
> >>> >
> >>> > I do not have Pypi access though so I cannot push new releases out. I
> >>> > could ask for that.
> >>> >
> >>> > I've threatened to fork the project before, that's always an option.
> >>> I've
> >>> > noticed his involvement is sporadic and comes in bursts.
> >>> >
> >>> > In the meantime, you can have the dependency in Airflow Webserver
> >>> pointing
> >>> > straight to your fork.
> >>> >
> >>> > Max
> >>> >
> >>> > On Wed, Nov 29, 2017 at 7:02 PM, Joy Gao  wrote:
> >>> >
> >>> >> I just created a new webserver instance if you haven't gotten a
> >>> chance to
> >>> >> fiddle around with the new web UI and the RBAC configurations
> (thanks
> >>> >> Maxime for getting started with this earlier!):
> >>> >>
> >>> >> http://104.209.38.171:8080/
> >>> >>
> >>> >> Admin Account
> >>> >> username: admin
> >>> >> password: admin
> >>> >>
> >>> >> Read-Only Account
> >>> >> username: viewer
> >>> >> password: password
> >>> >>
> >>> >>
> >>> >> On Wed, Nov 29, 2017 at 2:58 PM, Joy Gao  wrote:
> >>> >>
> >>> >> > Hi folks,
> >>> >> >
> >>> >> > Thanks for all the feedback regarding to the new Airflow Webserver
> >>> UI
> >>> >> > ! I've been actively
> >>> >> > addressing all the bugs that were raised on Github. So I want to
> >>> take
> >>> >> this
> >>> >> > opportunity to discuss two issues coming up:
> >>> >> >
> >>> >> > The first issue is unaddressed PRs in FAB. If these PRs continue
> to
> >>> stay
> >>> >> > unaddressed, RBAC is blocked from making further progress. If this
> >>> >> continue
> >>> >> > to be an issue, I'm inclined to fork FAB, even though it's not
> >>> >> idealistic.
> >>> >> >
> >>> >> >
> >>> >> >- PR/631  Flask-AppBuilder/pull/631>
> >>> >> Binary
> >>> >> >column support (merged, unreleased)
> >>> >> >

PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-03-23 Thread James Meickle
While Googling something Airflow-related a few weeks ago, I noticed that
someone's Airflow dashboard had been indexed by Google and was accessible
to the outside world without authentication. A little more Googling
revealed a handful of other indexed instances in various states of
security. I did my best to contact the operators, and waited for responses
before posting this.

Airflow is not a secure project by default (
https://issues.apache.org/jira/browse/AIRFLOW-2047), and you can do all
sorts of mean things to an instance that hasn't been intentionally locked
down. (And even then, you shouldn't rely exclusively on your app's
authentication for providing security.)

Having "internal" dashboards/data sources/executors exposed to the web is
dangerous, since old versions can stick around for a very long time, help
compromise unrelated deployments, and generally just create very bad press
for the overall project if there's ever a mass compromise (see: Redis and
MongoDB).

Shipping secure defaults is hard, but perhaps we could add best practices
like instructions for deploying a robots.txt with Airflow? Or an impact
statement about what someone could do if they access your Airflow instance?
I think that many people deploying Airflow for the first time might not
realize that it can get indexed, or how much damage someone can cause via
accessing it.


Re: Submitting 1000+ tasks to airflow programatically

2018-03-22 Thread James Meickle
I'm very excited about the possibility of implementing a DAGFetcher (per
prior thread about this) that is aware of dynamic data sources, and can
handle abstracting/caching/deploying them itself, rather than having each
Airflow process run the query for each DAG refresh.

On Thu, Mar 22, 2018 at 2:12 PM, Taylor Edmiston 
wrote:

> I'm interested in hearing further discussion too, and if others have tried
> something similar to our approach.  Several companies on this list have
> mentioned various approaches to dynamic DAGs, and I think everyone needs
> them eventually.  Maybe it's an opportunity for additional docs regarding
> use cases like this and to document best practices from the community.
>
> *Taylor Edmiston*
> TEdmiston.com  | Blog
> 
> Stack Overflow CV  | LinkedIn
>  | AngelList
> 
>
>
> On Thu, Mar 22, 2018 at 12:43 PM, Kyle Hamlin  wrote:
>
> > @Chris @Taylor
> > Thank you guy very much for your explanations! Your strategy makes a lot
> of
> > sense to me. Generating a dag for each client I'm going to have a ton of
> > dags on the front page but at least that is searchable haha. I'm going to
> > give this implementation a shot and I'll try to report back with the
> > outcome.
> >
> > Can anyone comment on future work to support data science workflows like
> > these, or is Airflow fundamentally the wrong tool?
> >
> > On Thu, Mar 22, 2018 at 12:07 PM Taylor Edmiston 
> > wrote:
> >
> > > We're not using SubDagOperator.  Our approach is using 1 DAG file to
> > > generate a separate DAG class instance for each similar config, which
> > gets
> > > hoisted into global namespace.  In simplified pseudo-Python, it looks
> > like:
> > >
> > > # sources --> {'configs': [{...}, {...}], 'expire': ''}
> > > cache = Variable.get('sources', default_var={}, deserialize_json=True)
> > > sources = fetch_configs() if is_empty(cache) or is_expired(cache) else
> > > cache['configs']
> > > for source in sources:
> > >   dag = DAG(...)
> > >   globals()[source._id] = dag
> > >   # ...create tasks and set dependencies for each DAG (some config
> pulled
> > > from source object for each)...
> > >
> > > We added the cache part for the same reason you pointed out, because
> the
> > > DAG processing loop was hitting the API a lot.  Btw, you can also turn
> > down
> > > how much the processing loop runs with scheduler_heartbeat_sec under
> the
> > > scheduler group in config.
> > >
> > > We also considered the route Chris mentioned of updating cache via a
> > > separate DAG but weren't crazy about having a DAG scheduled once per
> > > minute.
> > >
> > > *Taylor Edmiston*
> > > TEdmiston.com  | Blog
> > > 
> > > Stack Overflow CV  | LinkedIn
> > >  | AngelList
> > > 
> > >
> > >
> > > On Thu, Mar 22, 2018 at 9:17 AM, David Capwell 
> > wrote:
> > >
> > > > For us we compile down to Python rather than do the logic in Python,
> > that
> > > > makes it so the load doesn't do real work.
> > > >
> > > > We have our own DSL that is just a simplified compiler; parse,
> analyze,
> > > > optimize, code gen.  In code gen we just generate the Python code.
> Our
> > > > build then packages it up and have airflow fetch it (very hacky fetch
> > > right
> > > > now)
> > > >
> > > > This does make it so loading is simple and fast, but means you can't
> > use
> > > > the Python api directly
> > > >
> > > > On Thu, Mar 22, 2018, 12:43 AM Andrew Maguire  >
> > > > wrote:
> > > >
> > > > > I've had similar issues with large dags being slow to render on ui
> > and
> > > > > crashing chrome.
> > > > >
> > > > > I got around it by changing the default tree view from 25 to just
> 5.
> > > > >
> > > > > Involves a couple changes to source files though, would be great if
> > > some
> > > > of
> > > > > the ui defaults could go into airflow.cfg.
> > > > >
> > > > > https://stackoverflow.com/a/48665734/1919374
> > > > >
> > > > > On Thu, 22 Mar 2018, 01:26 Chris Fei,  wrote:
> > > > >
> > > > > > @Kyle, I do something similar and have run into the problems
> you've
> > > > > > mentioned. In my case, I access data from S3 and then generate
> > > separate
> > > > > > DAGs (of different structures) based on the data that's pulled.
> > I've
> > > > > > also found that the UI for accessing a single large DAG is slow
> so
> > I
> > > > > > prefer to keep many separate DAGs. What I'd try is to define a
> DAG
> > > > > > that's responsible for accessing your API and caching the client
> > IDs
> > > > > > somewhere locally, maybe just to a file on disk or as an Airflow
> > > > > > Variable. You can run this DAG on whatever schedule is
> appropriate
> > > for
> > > > > > you. From there, build a function that creates a DAG and then for
> > > eac

Re: How to have dynamic downstream tasks that depend on data generated upstream

2018-03-15 Thread James Meickle
To my mind, it's not a great idea to clear a resource that you're
dynamically using to determine the contents of a DAG. Is there any way that
you can refactor the table to be immutable? Instead of querying all rows in
the table, you would query records in an "unprocessed" state. Instead of
truncating the table, you would mark everything in the table as
"processed". (Though optional, it would be even better for each row to
store the date it was processed, so that you can re-run this DAG in the
future.)

If storing that much data or refactoring the table isn't possible, could
you run this query once for the day, store the results in S3 (or Redis, or
...), and always fetch those results? That way the DAG always has the "most
recent" view, even if you delete records mid-day.

On Wed, Mar 14, 2018 at 10:20 PM, Aaron Polhamus 
wrote:

> Question for the community. Did some hunting around and didn't see any
> compelling answers. SO link:
> https://stackoverflow.com/questions/49290546/how-to-set-
> up-a-dag-when-downstream-task-definitions-depend-on-upstream-outcomes
>
>
> --
>
>
> *Aaron Polhamus*
> *Chief Technology Officer *
>
> Cel (México): +52 (55) 1951-5612
> Cell (USA): +1 (206) 380-3948
> Tel: +52 (55) 1168 9757 - Ext. 181
>
> --
> ***Por favor referirse a nuestra página web
>  para más información
> acerca de nuestras políticas de privacidad.*
>
>


Re: How to add hooks for strong deployment consistency?

2018-03-06 Thread James Meickle
I'm currently looking into building a dynamic DAG that will load
user-provided data daily, into user-generated DB tables, using provided
schema definitions. There will be some ordering/dependencies such that
certain datasets depend on others. Since users can add new datasets at any
time, both the number of nodes and their growth from day to day is
unbounded. The way I am thinking of approaching this is to build an
"execution plan" in advance and distribute that as an artifact, instead of
having each Airflow worker re-evaluate the DAG and re-query the database of
user datasets.

That's doable right now by shipping a static artifact, but there are many
warts. I could definitely see this as something that fits well with the
described Airflow "DAGFetcher" abstraction. My wish list for an overall
system based on that concept would look like this:

- The DAGFetcher API is aware of execution dates, so it can query the DAG
generator service with reference to a time period (not just "right now").
- The DAGFetcher API is aware of previous outputs, whether or not the DAG
generator service is. This isn't a cache (performance optimization, records
are ephemeral), but a ledger (immutable point in time record). For some
services, like git, the service is also the ledger. In my case, the DAGRun
is the product of the current git commit (versioned) and the point in time
view of a database (unversioned).
- I can view a "pending" DAGRun in the UI, representing a DAGRun that is
expected but not yet evaluated by the DAGFetcher.
- I can view a "exception" DAGRun in the UI, representing a DAGRun where
the DAGFetcher raised an exception, and retry fetching through the UI.
- I can alert if a DAGRun is in "pending" state for too long, or enters
"exception".
- Multiple DAGRuns can reference the same DAGFetcher result, if there's
been no changes from day to day.
- The UI can represent multiple DAGFetcher results for the same DAG, such
as showing a blank entry for execution dates where a task didn't exist (it
was there historically but was removed, or it's new)

On Mon, Mar 5, 2018 at 4:54 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> A few notes about the pickling topic to answer William:
> * first reason why we cannot use pickles: jinja template objects are not
> picklable, there's a hack to pickle the content of the template instead of
> the object, but that breaks jinja inheritance and imports
> * pickles are messy, and Airflow allows users to attach objects to DAG
> objects (on_error_callable, on_retry_callable, params, ...) and pickles
> will go down the recursive rabbit hole and import everything large chunks
> of what's in `sys.modules` sometimes. (probably could be mitigated)
> * pickles are a messy serialization format, lots of drawbacks, security
> issues, incompatibility between py2 and py3, ...
> * tpickles have a bad reputation, many people advised avoiding it like
> plague since the feature was first built
> * original approach of pickling to the db is kind of a hack
>
> I also agree that caching is probably required especially around large DAGs
> and for "semi-stateless" web servers to operate properly.
>
> Max
>
>
> On Thu, Mar 1, 2018 at 1:15 PM, David Capwell  wrote:
>
> > We need two versions but most likely would not use either... That being
> > artifactory and git (would really love for this to be pluggable!)
> >
> > We have our own dag fetch logic which right now pulls from git, caches,
> > then redirect airflow to that directory.  For us we have airflow
> automated
> > so you push a button to get a cluster, for this reason there are enough
> > instances that we have DDOS attacked git (opps).
> >
> > We are planning to change this to fetch from artifactory, and have a
> > stateful proxy for each cluster so we stop DDOS attacking core
> > infrastructure.
> >
> > On Mar 1, 2018 11:45 AM, "William Wong"  wrote:
> >
> > Also relatively new to Airflow here. Same as David above, Option 1 is not
> > an option for us either for the same reasons.
> >
> > What I would like to see is that it can be user selectable / modifiable.
> >
> > Use Case:
> > We have a DAG with thousands of task dependencies/tasks. After 24hrs of
> > progressing, we need to take a subset of those tasks and rerun them with
> a
> > different configuration (reasons range from incorrect parameters to
> > infrastructure issues, doesn't really matter here).
> >
> > What I hope can happen:
> > 1. Pause DAG
> > 2. Upload and tag newest dag version
> > 3. Set dag_run to use latest tag,
> > 4. Resolve DAG sync using  clearly
> > defined/documented>
> > 5. Unpause DAG
> >
> > I do like the DagFetcher idea. This logic should shim in nicely in the
> > DagBag code. Maxime, I also vote for the GitDagFetcher. Two thoughts
> about
> > the GitDagFetcher:
> > - I probably won't use fuse across 100's of nodes in my k8s/swarm. Not
> sure
> > how this would work without too much trouble.
> > - It might be confusing if some git sha's have no changes to 

Airflow at SREcon?

2018-02-23 Thread James Meickle
Quantopian is aiming to switch over to Airflow over the course of this
year, replacing our existing systems for ingesting financial data. I'll be
at SREcon Americas this year, potentially with some of my coworkers; we're
located in Boston, which doesn't seem to have many users yet, so we'd love
to get to know some other users while we're on the west coast!

It looks like we'll be missing the next Bay Area meetup, but perhaps some
Airflow users will also be at SREcon?

-James M.


Re: max_active_runs

2018-02-15 Thread James Meickle
I have made this mistake a few times. I think it would be great if Airflow
warned about DAG-level arguments being passed into tasks they don't apply
to, since that would indicate an easily fixable mistake.

On Wed, Feb 14, 2018 at 9:22 AM, Gerard Toonstra 
wrote:

> One of those days ...
>
> max_active_runs is a dag property. default_args only get passed as default
> args to task instances, but it never applies there.
>
> Thanks!
>
> G>
>
> On Wed, Feb 14, 2018 at 2:47 PM, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
>
> > It seems unlikely, but could it be the location of where max_active_runs
> > is specified? In our DAGs we pass it directly as an argument to the DAG()
> > call, not via default_arguments and it behaves itself for us. I think
> > I should check that!
> >
> > -ash
> >
> >
> > > On 14 Feb 2018, at 13:43, Gerard Toonstra  wrote:
> > >
> > > A user on airflow 1.9.0 reports that 'max_active_runs' isn't
> respected. I
> > > remembered having fixed something related to this ages ago and this is
> > > here:
> > >
> > > https://issues.apache.org/jira/browse/AIRFLOW-137
> > >
> > > That however was related to backfills and clearing the dagruns.
> > >
> > > I watched him in the scenario and he literally creates a new simple dag
> > > with the following config:
> > >
> > > -
> > >
> > > from airflow import DAG
> > > from datetime import datetime, timedelta
> > >
> > > from airflow.contrib.operators.bigquery_operator import
> BigQueryOperator
> > > from airflow.contrib.operators.bigquery_to_gcs import
> > > BigQueryToCloudStorageOperator
> > > from airflow.contrib.operators.gcs_download_operator import
> > > GoogleCloudStorageDownloadOperator
> > > from airflow.contrib.operators.file_to_gcs import
> > > FileToGoogleCloudStorageOperator
> > > from airflow.operators.python_operator import PythonOperator
> > > from airflow.models import Variable
> > > import time
> > >
> > > default_args = {
> > >'owner': 'airflow',
> > >'start_date': datetime(2018, 2, 10),
> > >'max_active_runs': 1,
> > >'email_on_failure': False,
> > >'email_on_retry': False,
> > > }
> > >
> > > dag = DAG('analytics6', default_args=default_args,
> schedule_interval='15
> > 12
> > > * * *')
> > >
> > > -
> > >
> > > When it gets activated, multiple dagruns are created when there are
> still
> > > tasks running on the first date.
> > >
> > > His version is 1.9.0 from pypi.
> > >
> > > Is max_active_runs broken or are there other explanations for this
> > > particular behavior?
> > >
> > > Rgds,
> > >
> > > Gerard
> >
> >
>