from:"Maxime Beauchemin"

Re: Graduation resolution passed - Airflow is a TLP

2018-12-20 Thread Maxime Beauchemin

"They grow up so fast!" :)

This is huge! Congratulations to everyone involved.

On Thu, Dec 20, 2018 at 3:53 PM Feng Lu  wrote:

> Fantastic news!! Congrats everyone!
>
> On Thu, Dec 20, 2018 at 2:18 PM Tao Feng  wrote:
>
> > Thanks Jakob for driving the graduation! Great news!
> >
> > On Thu, Dec 20, 2018 at 1:13 PM Jakob Homan  wrote:
> >
> > > Hey all-
> > >The Board minutes haven't been published yet (probably due to
> > > Holiday-related slowness), but I can see through the admin tool that
> > > our Graduation resolution was approved yesterday at the meeting.
> > > Airflow is the 199th current active Top Level Project in Apache.
> > >
> > > Congrats all.
> > >
> > > -Jakob
> > >
> >
>

Re: timeout not working in SqlSensor?

2018-12-20 Thread Maxime Beauchemin

I think it's `timeout` not `time_out`.
https://airflow.apache.org/code.html#basesensoroperator

On Thu, Dec 20, 2018 at 12:12 PM Scott Halgrim
 wrote:

> Does the timeout param work the way I think it should in Airflow 1.8? My
> query just pokes at the poke interval indefinitely. I want it to timeout
> after a few hours and fail instead.
>
> The code is basically
>
> “””
> task_name = SqlSensor(
>conn_id='redshift',
>sql=(
>"""
>  select * from this_table where date=‘2018-12-18'
>  """
>),
>task_id=’task_name',
>dag=dag,
>start_date=datetime(2016, 9, 25),
>poke_interval=20,
>time_out=45,
> “”"
> So after running the query a couple of times I would expect a failure, but
> that’s not what’s happening, it just keeps querying and querying every 20
> seconds.
>
> Thanks,
>
> Scott
>

Re: [IE] [VOTE] Graduate the Apache Airflow as a TLP

2018-11-30 Thread Maxime Beauchemin

Oh my! it's happening! Congrats and thanks to all contributors!

+1 (binding)

On Fri, Nov 30, 2018 at 2:56 PM Sunil Varma Chiluvuri <
sunilvarma.chiluv...@equifax.com> wrote:

> +1 (non-binding)!
>
> -Original Message-
> From: Jakob Homan [mailto:jgho...@gmail.com]
> Sent: Friday, November 30, 2018 3:33 PM
> To: dev@airflow.incubator.apache.org
> Subject: [IE] [VOTE] Graduate the Apache Airflow as a TLP
>
> Hey all!
>
> Following a very successful DISCUSS[1] regarding graduating Airflow to
> Top Level Project (TLP) status, I'm starting the official VOTE.
>
> Since entering the Incubator in 2016, the community has:
>* successfully produced 7 releases
>* added 9 new committers/PPMC members
>* built a diverse group of committers from multiple different employers
>* had more than 3,300 JIRA tickets opened
>* completed the project maturity model with positive responses[2]
>
> Accordingly, I believe we're ready to graduate and am calling a VOTE
> on the following graduation resolution.  This VOTE will remain open
> for at least 72 hours.  If successful, the resolution will be
> forwarded to the IPMC for its consideration.  If that VOTE is
> successful, the resolution will be voted upon by the Board at its next
> monthly meeting.
>
> Everyone is encouraged to vote, even if their vote is not binding.
> We've built a nice community here, let's make sure everyone has their
> voice heard.
>
> Thanks,
> Jakob
>
> [1]
> https://lists.apache.org/thread.html/%3c0a763b0b-7d0d-4353-979a-ac6769eb0...@gmail.com%3E
> [2]
> https://cwiki.apache.org/confluence/display/AIRFLOW/Maturity+Evaluation
>
> 
>
> Establish the Apache Airflow Project
>
> WHEREAS, the Board of Directors deems it to be in the best
> interests of the Foundation and consistent with the
> Foundation's purpose to establish a Project Management
> Committee charged with the creation and maintenance of
> open-source software, for distribution at no charge to
> the public, related to workflow automation and scheduling
> that can be used to author and manage data pipelines.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee (PMC), to be known as the "Apache Airflow Project",
> be and hereby is established pursuant to Bylaws of the
> Foundation; and be it further
>
> RESOLVED, that the Apache Airflow Project be and hereby is
> responsible for the creation and maintenance of software
> related to workflow automation and scheduling that can be
> used to author and manage data pipelines; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Airflow" be
> and hereby is created, the person holding such office to
> serve at the direction of the Board of Directors as the chair
> of the Apache Airflow Project, and to have primary responsibility
> for management of the projects within the scope of
> responsibility of the Apache Airflow Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and
> hereby are appointed to serve as the initial members of the
> Apache Airflow Project:
>
> * Alex Guziel 
> * Alex Van Boxel 
> * Arthur Wiedmer 
> * Ash Berlin-Taylor 
> * Bolke de Bruin 
> * Chris Riccomini 
> * Dan Davydov 
> * Fokko Driesprong 
> * Hitesh Shah 
> * Jakob Homan 
> * Jeremiah Lowin 
> * Joy Gao 
> * Kaxil Naik 
> * Maxime Beauchemin 
> * Siddharth Anand 
> * Sumit Maheshwari 
> * Tao Feng 
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Bolke de Bruin
> be appointed to the office of Vice President, Apache Airflow, to
> serve in accordance with and subject to the direction of the
> Board of Directors and the Bylaws of the Foundation until
> death, resignation, retirement, removal or disqualification,
> or until a successor is appointed; and be it further
>
> RESOLVED, that the initial Apache Airflow PMC be and hereby is
> tasked with the creation of a set of bylaws intended to
> encourage open development and increased participation in the
> Apache Airflow Project; and be it further
>
> RESOLVED, that the Apache Airflow Project be and hereby
> is tasked with the migration and rationalization of the Apache
> Incubator Airflow podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache
> Incubator Airflow podling encumbered upon the Apache Incubator
> Project are hereafter discharged.
> This message contains proprietary information from Equifax which may be
> confidential. If you are not an intended recipient, please refrain from any
> disclosure, copying, distribution or use of this information and note that
> such actions are prohibited. If you have received this transmission in
> error, please notify by e-mail postmas...@equifax.com. Equifax® is a
> registered trademark of Equifax Inc. All rights reserved.
>

Re: grant edit permission for airflow wiki

2018-11-26 Thread Maxime Beauchemin

Granted!

Max

On Mon, Nov 26, 2018 at 10:21 PM Tao Feng  wrote:

> Could any Airflow wiki admin grant me the edit permisssion? My wiki user
> name is tfeng.
>
> Thanks,
> -Tao
>

Re: [DISCUSS] Apache Airflow graduation from the incubator

2018-11-26 Thread Maxime Beauchemin

This is great to see happen, it's been a long time coming! Also count me
in, I'll be happy to help on that last push!

On Mon, Nov 26, 2018 at 5:21 PM Hitesh Shah  wrote:

> +1. The Airflow community has come a long way since its addition to the
> Incubator and I believe it is more than ready to graduate. Based on past
> experience around all the "paperwork", I would suggest that the earliest
> graduation date would be the January board meeting and working backwards to
> plan out filling in the various wikis and triggering the relevant votes.
>
> thanks
> Hitesh
>
>
>
> On Mon, Nov 26, 2018 at 2:09 PM Bolke de Bruin  wrote:
>
> > Hi Jakob
> >
> > Thanks for the vote of confidence! That's appreciated.
> >
> > I linked the maturity model and already did a grading (I think :-) ) in
> my
> > original mail, so one thing less to worry about. It's probably a good
> idea
> > if some of the other committers and contributers also take a look. 2
> items
> > I'm unsure about.
> >
> > Cheers
> > Bolke
> >
> >
> >
> > Sent from my iPhone
> >
> > > On 26 Nov 2018, at 22:30, Jakob Homan  wrote:
> > >
> > > With my Mentor hat on, I'm entirely confident that Airflow is ready to
> > > graduate.  The community broadly gets the Apache Way and operates
> > > within it.  The community is healthy and engaged.  The last couple
> > > releases went well, with no hitches whatsoever for the last one.
> > >
> > > The graduation process is mainly paperwork[1] and running VOTEs [2]
> > > here and on the IPMC.  Last time I suggested this had to be done by
> > > the Champion, which wasn't correct - anyone from the PPMC can do so.
> > > I may have some free cycles over the next few weeks, so I'll take a
> > > look at the check list to see what we can get out of the way, but any
> > > of the PPMCers can also take items.  The IPMC also likes it if
> > > Podlings go through and grade themselves on the Maturity Model[2], if
> > > someone wants to do that.
> > >
> > > -Jakob
> > >
> > > [1] http://incubator.apache.org/projects/airflow.html
> > > [2]
> >
> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/%3CCAMdzn8vNbKQr2FF8WcJydFj16q0bgn0B42jfu5h28RS-ZQ=w...@mail.gmail.com%3E
> > > [3]
> >
> https://community.apache.org/apache-way/apache-project-maturity-model.html
> > >> On Mon, Nov 26, 2018 at 1:10 PM Stefan Seelmann <
> > m...@stefan-seelmann.de> wrote:
> > >>
> > >> I agree that Apache Airflow should graduate.
> > >>
> > >> I'm only involved since beginning of this year, but the project did
> two
> > >> releases during that time, once TLP releasing becomes easier :)
> > >>
> > >> Regarding QU30 you may consider to use the ASF wide security mailing
> > >> list [3] and process [4].
> > >>
> > >> Kind Regards,
> > >> Stefan
> > >>
> > >> [3] https://www.apache.org/security/
> > >> [4] https://www.apache.org/security/committers.html
> > >>
> > >>
> > >>> On 11/26/18 8:46 PM, Bolke de Bruin wrote:
> > >>> Ping!
> > >>>
> > >>> Sent from my iPhone
> > >>>
> >  On 24 Nov 2018, at 12:57, Bolke de Bruin  wrote:
> > 
> >  Hi All,
> > 
> >  With the Apache Airflow community healthy and growing, I think now
> > would be a good time to
> >  discuss where we stand regarding to graduation from the Incubator,
> > and what requirements remains.
> > 
> >  Apache Airflow entered incubation around 2 years ago, since then,
> the
> > Airflow community learned
> >  a lot about how to do things in Apache ways. Now we are a very
> > helpful and engaged community,
> >  ready to help on all questions from the Airflow community. We
> > delivered multiple releases that have
> >  been increasing in quality ever since, now we can do self-driving
> > releases in good cadence.
> > 
> >  The community is growing, new committers and PPMC members keep
> > joining. We addressed almost all
> >  the maturity issues stipulated by Apache Project Maturity Model [1].
> > So final requirements remain, but
> >  those just need a final nudge. Committers and contributors are
> > invited to verify the list and pick up the last
> >  bits (QU30, CO50). Finally (yahoo!) all the License and IP issues we
> > can see got resolved.
> > 
> >  Base on those, I believes it's time for us to graduate to TLP. [2]
> > Any thoughts?
> >  And welcome advice from Airflow Mentors?
> > 
> >  Thanks,
> > 
> >  [1]
> > https://cwiki.apache.org/confluence/display/AIRFLOW/Maturity+Evaluation
> >  [2]
> >
> https://incubator.apache.org/guides/graduation.html#graduating_to_a_top_level_project
> > Regards,
> > >>
> >
>

Re: programmatically creating and airflow quirks

2018-11-25 Thread Maxime Beauchemin

The historical reason is that people would check in scripts in the repo
that had actual compute or other forms or undesired effect in module scope
(scripts with no "if __name__ == '__main__':") and Airflow would just run
this script while seeking for DAGs. So we added this mitigation patch that
would confirm that there's something Airflow-related in the .py file. Not
elegant, and confusing at times, but it also probably prevented some issues
over the years.

The solution here is to have a more explicit way of adding DAGs to the
DagBag (instead of the folder-crawling approach). The DagFetcher proposal
offers solutions around that, having a central "manifest" file that
provides explicit pointers to all DAGs in the environment.

Max

On Sat, Nov 24, 2018 at 5:04 PM Beau Barker 
wrote:

> In my opinion this searching for dags is not ideal.
>
> We should be explicitly specifying the dags to load somewhere.
>
>
> > On 25 Nov 2018, at 10:41 am, Kevin Yang  wrote:
> >
> > I believe that is mostly because we want to skip parsing/loading .py
> files
> > that doesn't contain DAG defs to save time, as scheduler is going to
> > parse/load the .py files over and over again and some files can take
> quite
> > long to load.
> >
> > Cheers,
> > Kevin Y
> >
> > On Fri, Nov 23, 2018 at 12:44 AM soma dhavala 
> > wrote:
> >
> >> happy to report that the “fix” worked. thanks Alex.
> >>
> >> btw, wondering why was it there in the first place? how does it help —
> >> saves time, early termination — what?
> >>
> >>
> >>> On Nov 23, 2018, at 8:18 AM, Alex Guziel 
> wrote:
> >>>
> >>> Yup.
> >>>
> >>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala  >> > wrote:
> >>>
> >>>
>  On Nov 23, 2018, at 3:28 AM, Alex Guziel  >> > wrote:
> 
>  It’s because of this
> 
>  “When searching for DAGs, Airflow will only consider files where the
> >> string “airflow” and “DAG” both appear in the contents of the .py file.”
> 
> >>>
> >>> Have not noticed it.  From airflow/models.py, in process_file — (both
> in
> >> 1.9 and 1.10)
> >>> ..
> >>> if not all([s in content for s in (b'DAG', b'airflow')]):
> >>> ..
> >>> is looking for those strings and if they are not found, it is returning
> >> without loading the DAGs.
> >>>
> >>>
> >>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
> >> it work?
> >>>
> >>>
>  On Thu, Nov 22, 2018 at 2:27 AM soma dhavala  >> > wrote:
> 
> 
> > On Nov 22, 2018, at 3:37 PM, Alex Guziel  >> > wrote:
> >
> > I think this is what is going on. The dags are picked by local
> >> variables. I.E. if you do
> > dag = Dag(...)
> > dag = Dag(…)
> 
>  from my_module import create_dag
> 
>  for file in yaml_files:
>  dag = create_dag(file)
>  globals()[dag.dag_id] = dag
> 
>  You notice that create_dag is in a different module. If it is in the
> >> same scope (file), it will be fine.
> 
> >
> 
> > Only the second dag will be picked up.
> >
> > On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
> soma.dhav...@gmail.com
> >> > wrote:
> > Hey AirFlow Devs:
> > In our organization, we build a Machine Learning WorkBench with
> >> AirFlow as
> > an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> > operators to customize the behaviour. These work flows are specified
> in
> > YAML.
> >
> > We drop a DAG loader (written python) in the default location airflow
> > expects the DAG files.  This DAG loader reads the specified YAML
> files
> >> and
> > converts them into airflow DAG objects. Essentially, we are
> > programmatically creating the DAG objects. In order to support
> muliple
> > parsers (yaml, json etc), we separated the DAG creation from loading.
> >> But
> > when a DAG is created (in a separate module) and made available to
> the
> >> DAG
> > loaders, airflow does not pick it up. As an example, consider that I
> > created a DAG picked it, and will simply unpickle the DAG and give it
> >> to
> > airflow.
> >
> > However, in current avatar of airfow, the very creation of DAG has to
> > happen in the loader itself. As far I am concerned, airflow should
> not
> >> care
> > where and how the DAG object is created, so long as it is a valid DAG
> > object. The workaround for us is to mix parser and loader in the same
> >> file
> > and drop it in the airflow default dags folder. During dag_bag
> >> creation,
> > this file is loaded up with import_modules utility and shows up in
> the
> >> UI.
> > While this is a solution, but it is not clean.
> >
> > What do DEVs think about a solution to this problem? Will saving the
> >> DAG to
> > the db and reading it from the db work? Or some core changes need to
> >> happen
> > in the

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-11-09 Thread Maxime Beauchemin

I feel pretty strongly about not having a ZK dependency/requirement for the
multi scheduler setup. ZK is a fine piece of tech that provides guarantees
we need, but having to install/maintain it is very prohibitive and another
potential point of failure. I don't know the latest but I heard that the
Apache Druid community is ripping out their ZK dependency in favor a
lighter weight Raft Java implementation.

Also voting against on sharding on DAG filename/filepath for the final
solution as it isn't HA (I do understand it's an easy and interesting hack
in the meantime though). Having any scheduler schedule any DAG is so much
more robust and predictable. 10k DAGs running every minute is <200 locks
per second. My feeling is that MySQL / Postgres can totally eat up that
load. For leader election there's probably a python Raft implementation out
there we can use.

Max

On Fri, Nov 9, 2018 at 4:48 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamb...@coupang.com> wrote:

> The missing points you brought up, yes that was one of the reasons it
> seemed like getting zookeeper or a DB coordinated procedure involved to
> both count and number the schedulers and mark one of them the lead. Locking
> each dag file for processing sounds easier, but we were seeing update
> transactions fail already without adding more pressure on the DB. Locking
> is another thing zk can handle. But adding zk seems like such deployment
> overhead that scheduler type like executor type needs to become a modular
> option in the process of the change.
>
> The ignore pattern method that was in use described earlier was basically
> adding an entry to a top level .airflowignore file via a flag or env
> instead of making the file. Now that was simple, with all the drawbacks
> mentioned already.
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> 
> From: Maxime Beauchemin 
> Sent: Friday, November 9, 2018 5:03:02 PM
> To: dev@airflow.incubator.apache.org
> Cc: d...@airflow.apache.org; yrql...@gmail.com
> Subject: Re: A Naive Multi-Scheduler Architecture Experiment of Airflow
>
> [CAUTION]: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
> [주의]: 본 이메일은 회사 외부에서 유입된 이메일입니다. 발신자의 신원과 이메일 내용이 안전한지 확인하기 전까지는 이메일에 포함된
> 링크를 클릭하거나 첨부파일을 열지 마십시오.
>
>
> I mean at that point it's just as easy (or easier) to do things properly:
> get the scheduler subprocesses to take a lock on the DAG it's about to
> process, and release it when it's done. Add a lock timestamp and bit of
> logic to expire locks (to self heal if the process ever crashed and failed
> at releasing the lock). Of course make sure that the
> confirm-its-not-locked-and-take-a-lock process is insulated in a database
> transaction, and your'e mostly good. That sounds like a very easy thing to
> do.
>
> The only thing that's missing at that point to fully support
> multi-schedulers is to centralize the logic that does the prioritization
> and pushing to workers. That's a bit more complicated, it assumes a leader
> (and leader election), and to change the logic of how individual
> "DAG-evaluator processes" communicate what task instances are runnable to
> that leader (over a message queue? over the database?).
>
> Max
>
> On Thu, Nov 8, 2018 at 10:02 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
> lamb...@coupang.com> wrote:
>
> > Since you're discussing multi-scheduler trials,
> > Based on v1.8 we have also tried something, based on passing in a regex
> to
> > each scheduler; DAG file paths which match it are ignored. This required
> > turning off some logic that deletes dag data for dags that are missing
> from
> > the dagbag.
> > It is pretty manual and not evenly distributed, but it allows some 5000+
> > DAGs or so with 6 scheduler instances. That said there's some pain around
> > maintaining such a setup, so we didn't opt for it (yet) in our v1.10
> setup.
> > The lack of cleaning up an old dag name is also not great (it can be done
> > semi manually). Then there's the work in trying to redefine patterns for
> > better mixes, testing that patterns don't all ignore the same file, nor
> > that more than one scheduler includes the same file. I generally wouldn't
> > suggest this approach.
> >
> > In considering to setup a similar modification to v1.10, we thought it
> > would make sense to instead tell each scheduler which scheduler number it
> > is, and how many total schedulers there are. Then each scheduler can use
> > some hash (cityhash?) on the whole py file path, mod it by the sc

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-11-09 Thread Maxime Beauchemin

I mean at that point it's just as easy (or easier) to do things properly:
get the scheduler subprocesses to take a lock on the DAG it's about to
process, and release it when it's done. Add a lock timestamp and bit of
logic to expire locks (to self heal if the process ever crashed and failed
at releasing the lock). Of course make sure that the
confirm-its-not-locked-and-take-a-lock process is insulated in a database
transaction, and your'e mostly good. That sounds like a very easy thing to
do.

The only thing that's missing at that point to fully support
multi-schedulers is to centralize the logic that does the prioritization
and pushing to workers. That's a bit more complicated, it assumes a leader
(and leader election), and to change the logic of how individual
"DAG-evaluator processes" communicate what task instances are runnable to
that leader (over a message queue? over the database?).

Max

On Thu, Nov 8, 2018 at 10:02 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamb...@coupang.com> wrote:

> Since you're discussing multi-scheduler trials,
> Based on v1.8 we have also tried something, based on passing in a regex to
> each scheduler; DAG file paths which match it are ignored. This required
> turning off some logic that deletes dag data for dags that are missing from
> the dagbag.
> It is pretty manual and not evenly distributed, but it allows some 5000+
> DAGs or so with 6 scheduler instances. That said there's some pain around
> maintaining such a setup, so we didn't opt for it (yet) in our v1.10 setup.
> The lack of cleaning up an old dag name is also not great (it can be done
> semi manually). Then there's the work in trying to redefine patterns for
> better mixes, testing that patterns don't all ignore the same file, nor
> that more than one scheduler includes the same file. I generally wouldn't
> suggest this approach.
>
> In considering to setup a similar modification to v1.10, we thought it
> would make sense to instead tell each scheduler which scheduler number it
> is, and how many total schedulers there are. Then each scheduler can use
> some hash (cityhash?) on the whole py file path, mod it by the scheduler
> count, and only parse it if it matches its scheduler number.
>
> This seemed like a good way to keep a fixed number of schedulers balancing
> new dag files, but we didn't do it (yet) because we started to think about
> getting fancier: what if a scheduler needs to be added? Can it be done
> without stopping the others and update the total count; or vice-versa for
> removing a scheduler. If one scheduler drops out can the others renumber
> themselves? If that could be solved, then the schedulers could be made into
> an autoscaling group… For this we thought about wrapping the whole
> scheduler instance's process up in some watchdog that might coordinate with
> something like zookeeper (or by using the existing airflow DB) but it got
> to be full of potential loopholes for the schedulers, like needing to be in
> sync about refilling the dagbag in concert with each other when there's a
> change in the total count, and problems when one drops off but is actually
> not really down for the count and pops back in having missed that the
> others decided changed their numbering, etc.
>
> I bring this up because the basic form of the ideas doesn't hinge on which
> folder a dag is in, which seems more likely to work nicely with team based
> hierarchies which also import reusable modules across DAG files.
> -Daniel
> P.S. yeah we did find there were times when schedulers exited because
> there was a db lock on task instances they were trying to update. So the DB
> needs to be managed by someone who knows how to scale it for that… or
> possibly the model needs to be made more conducive to minimally locking
> updates.
>
> On 10/31/18, 11:38 PM, "Deng Xiaodong"  wrote:
>
> Hi Folks,
>
> Previously I initiated a discussion about the best practice of Airflow
> setting-up, and it was agreed by a few folks that scheduler may become one
> of the bottleneck component (we can only run one scheduler instance, can
> only scale vertically rather than horizontally, etc.). Especially when we
> have thousands of DAGs, the scheduling latency may be high.
>
> In our team, we have experimented a naive multiple-scheduler
> architecture. Would like to share here, and also seek inputs from you.
>
> *1. Background*
> - Inside DAG_Folder, we can have sub-folders.
> - When we initiate scheduler instance, we can specify “--subdir” for
> it, which will specify the specific directory that the scheduler is going
> to “scan” (https://airflow.apache.org/cli.html#scheduler).
>
> *2. Our Naive Idea*
> Say we have 2,000 DAGs. If we run one single scheduler instance, one
> scheduling loop will traverse all 2K DAGs.
>
> Our idea is:
> Step-1: Create multiple sub-directories, say five, under DAG_Folder
> (subdir1, subdir2, …, subdir5)
> Step-2: Distribute the DAGs evenly into these sub-dire

Re: Duplicate key unique constraint error

2018-11-02 Thread Maxime Beauchemin

Wait, the title of this thread is "Duplicate key unique constraint error",
to me that screams that something is not ok. If the check+insert was atomic
(insulated) this error wouldn't happen. Also I'm pretty sure when I looked
the stack trace looked like a scheduler-specific stack trace. It may be a
rare race condition, but doesn't the stack trace prove the existence of a
race condition?

Max

On Fri, Nov 2, 2018 at 10:19 AM Abhishek Sinha 
wrote:

> Max,
>
> If check+insert works correctly, then even multiple instances of scheduler
> running in parallel should not throw this error. I am not sure then when
> can this error happen.
>
>
>
> On 2 November 2018 at 8:37:20 AM, Maxime Beauchemin (
> maximebeauche...@gmail.com) wrote:
>
> The scheduler should never fail hard. The schedule logic that tries to
> insert the new task instance should only insert a new one if it doesn't
> exist already and isolate that check+insert inside a database transaction.
>
> Max
>
> On Fri, Nov 2, 2018 at 5:38 AM Abhishek Sinha 
> wrote:
>
> > Brian,
> >
> > We use the trigger dag CLI command to trigger it manually.
> >
> > Even when you have custom operators, the duplicate key error should not
> > happen right? Shouldn't the combination of task id, dag id and execution
> > date be unique?
> >
> >
> > On 30 October 2018 at 10:23:27 PM, Abhishek Sinha (abhis...@infoworks.io)
>
> > wrote:
> >
> > Max,
> >
> > The schedule interval is 1 day.
> >
> >
> >
> > Sent from my iPhone
> >
> > > On 30-Oct-2018, at 9:29 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com>
> > wrote:
> > >
> > > Also what's your schedule interval? I'm just trying to confirm that
> this
> > > isn't a "run every minute, or anytime someone blinks" kind of DAG.
> > >
> > > Max
> > >
> > > On Tue, Oct 30, 2018 at 5:48 AM Brian Greene <
> > > br...@heisenbergwoodworking.com> wrote:
> > >
> > >> How do you trigger it externally?
> > >>
> > >> We have several custom operators that trigger other jobs and we had
> to
> > be
> > >> really careful or we’d get duplicate keys for the dag run and it
> would
> > fail
> > >> to kick off.
> > >>
> > >> One scheduler, but we saw it repeatedly and have it noted as a thing
> to
> > >> watch out for.
> > >>
> > >> Brian
> > >>
> > >> Sent from a device with less than stellar autocorrect
> > >>
> > >>> On Oct 29, 2018, at 2:03 PM, Abhishek Sinha 
> > >> wrote:
> > >>>
> > >>> Attaching the scheduler crash logs as well.
> > >>>
> > >>> https://pastebin.com/B2WEJKRB
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Regards,
> > >>>
> > >>> Abhishek Sinha | m: +919035191078 | e: abhis...@infoworks.io
> > >>>
> > >>>
> > >>> On Tue, Oct 30, 2018 at 12:19 AM Abhishek Sinha <
> abhis...@infoworks.io
> > >
> > >>> wrote:
> > >>>
> > >>>> Max,
> > >>>>
> > >>>> We always trigger the DAG externally. I am not sure if there is
> still
> > >> any
> > >>>> backfill involved.
> > >>>>
> > >>>> Is there a way where I can find out in logs, if more than one
> instance
> > >> of
> > >>>> scheduler is running?
> > >>>>
> > >>>>
> > >>>> On 29 October 2018 at 10:43:19 PM, Maxime Beauchemin (
> > >>>> maximebeauche...@gmail.com) wrote:
> > >>>>
> > >>>> The stacktrace seems to be pointing in that direction. Id check
> that
> > >>>> first. It seems like it **could** be a race condition with a
> backfill
> > as
> > >>>> well, unclear.
> > >>>>
> > >>>> It's still a bug though, and the scheduler should make sure to
> handle
> > >> this
> > >>>> and not raise/crash.
> > >>>>
> > >>>> On Mon, Oct 29, 2018, 10:05 AM Abhishek Sinha <
> abhis...@infoworks.io>
> > >>>> wrote:
> > >>>>
> > >>>>> Max,
> > >>>>>
> > >>>>> I do not think there wa

Re: Duplicate key unique constraint error

2018-11-02 Thread Maxime Beauchemin

The scheduler should never fail hard. The schedule logic that tries to
insert the new task instance should only insert a new one if it doesn't
exist already and isolate that check+insert inside a database transaction.

Max

On Fri, Nov 2, 2018 at 5:38 AM Abhishek Sinha  wrote:

> Brian,
>
> We use the trigger dag CLI command to trigger it manually.
>
> Even when you have custom operators, the duplicate key error should not
> happen right? Shouldn't the combination of task id, dag id and execution
> date be unique?
>
>
> On 30 October 2018 at 10:23:27 PM, Abhishek Sinha (abhis...@infoworks.io)
> wrote:
>
> Max,
>
> The schedule interval is 1 day.
>
>
>
> Sent from my iPhone
>
> > On 30-Oct-2018, at 9:29 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com>
> wrote:
> >
> > Also what's your schedule interval? I'm just trying to confirm that this
> > isn't a "run every minute, or anytime someone blinks" kind of DAG.
> >
> > Max
> >
> > On Tue, Oct 30, 2018 at 5:48 AM Brian Greene <
> > br...@heisenbergwoodworking.com> wrote:
> >
> >> How do you trigger it externally?
> >>
> >> We have several custom operators that trigger other jobs and we had to
> be
> >> really careful or we’d get duplicate keys for the dag run and it would
> fail
> >> to kick off.
> >>
> >> One scheduler, but we saw it repeatedly and have it noted as a thing to
> >> watch out for.
> >>
> >> Brian
> >>
> >> Sent from a device with less than stellar autocorrect
> >>
> >>> On Oct 29, 2018, at 2:03 PM, Abhishek Sinha 
> >> wrote:
> >>>
> >>> Attaching the scheduler crash logs as well.
> >>>
> >>> https://pastebin.com/B2WEJKRB
> >>>
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Abhishek Sinha | m: +919035191078 | e: abhis...@infoworks.io
> >>>
> >>>
> >>> On Tue, Oct 30, 2018 at 12:19 AM Abhishek Sinha  >
> >>> wrote:
> >>>
> >>>> Max,
> >>>>
> >>>> We always trigger the DAG externally. I am not sure if there is still
> >> any
> >>>> backfill involved.
> >>>>
> >>>> Is there a way where I can find out in logs, if more than one instance
> >> of
> >>>> scheduler is running?
> >>>>
> >>>>
> >>>> On 29 October 2018 at 10:43:19 PM, Maxime Beauchemin (
> >>>> maximebeauche...@gmail.com) wrote:
> >>>>
> >>>> The stacktrace seems to be pointing in that direction. Id check that
> >>>> first. It seems like it **could** be a race condition with a backfill
> as
> >>>> well, unclear.
> >>>>
> >>>> It's still a bug though, and the scheduler should make sure to handle
> >> this
> >>>> and not raise/crash.
> >>>>
> >>>> On Mon, Oct 29, 2018, 10:05 AM Abhishek Sinha 
> >>>> wrote:
> >>>>
> >>>>> Max,
> >>>>>
> >>>>> I do not think there was more than one instance of scheduler running.
> >>>>> Since the scheduler crashed and it has been restarted, I cannot
> >> confirm it
> >>>>> now. Is there any log that can provide this information?
> >>>>>
> >>>>> Could there be a different cause apart from multiple scheduler
> >> instances
> >>>>> running?
> >>>>>
> >>>>>
> >>>>> On 29 October 2018 at 9:30:56 PM, Maxime Beauchemin (
> >>>>> maximebeauche...@gmail.com) wrote:
> >>>>>
> >>>>> Abhishek, are you running more than one scheduler instance at once?
> >>>>>
> >>>>> Max
> >>>>>
> >>>>> On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha <
> abhis...@infoworks.io>
>
> >>>>> wrote:
> >>>>>
> >>>>>> The issue is happening more frequently now. Can someone please look
> >> into
> >>>>>> this?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (
> >>>>> abhis...@infoworks.io
> >>>>>> )
> >>>>>> wrote:
> >>>>>>
> >>>>>> Can someone please help in looking into this issue? It is critical
> >> since
> >>>>>> this has come up in one of our production environment. Also, this
> >> issue
> >>>>> has
> >>>>>> appeared only once till now.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Abhishek
> >>>>>>
> >>>>>> On 20-Sep-2018, at 10:18 PM, Abhishek Sinha 
> >>>>> wrote:
> >>>>>>
> >>>>>> Any update on this?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Abhishek
> >>>>>>
> >>>>>> On 18-Sep-2018, at 12:48 AM, Abhishek Sinha 
> >>>>> wrote:
> >>>>>>
> >>>>>> Pastebin: https://pastebin.com/K6BMTb5K
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Abhishek
> >>>>>>
> >>>>>> On 18-Sep-2018, at 12:31 AM, Stefan Seelmann <
> m...@stefan-seelmann.de
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> On 9/17/18 8:19 PM, Abhishek Sinha wrote:
> >>>>>>
> >>>>>> Any update on this?
> >>>>>>
> >>>>>> Please find the scheduler error log attached.
> >>>>>>
> >>>>>> Can you share the full python stack trace?
> >>>>>>
> >>>>>>
> >>>>>> Seems the mailing list doesn't allow attachments. Either post the
> >>>>>> stacktrace inline, or post it somewhere at pastebin or so.
> >>>>>>
> >>>>>
> >>>>>
> >>
>

Re: Deployment / Execution Model

2018-10-31 Thread Maxime Beauchemin

Deploying the DAGs should be decoupled from deploying Airflow itself. You
can just use a resource that syncs the DAGs repo to the boxes on the
Airflow cluster periodically (say every minute). Resource orchestrators
like Chef, Ansible, Puppet, should have some easy way to do that. Either
that or some sort of mount or mount-equivalent (k8s has constructs for
that, EFS on Amazon).

Also note that the DagFetcher abstraction that's been discussed before on
the mailing list would solve this and more.

Max

On Wed, Oct 31, 2018 at 2:37 PM Gabriel Silk 
wrote:

> Hello Airflow community,
>
>
> I'm currently putting Airflow into production at my company of 2000+
> people. The most significant sticking point so far is the deployment /
> execution model. I wanted to write up my experience so far in this matter
> and see how other people are dealing with this issue.
>
> First of all, our goal is to allow engineers to author DAGs and easily
> deploy them. That means they should be able to make changes to their DAGs,
> add/remove dependencies, and not have to  redeploy any of the core
> component (scheduler, webserver, workers).
>
> Our first attempt at productionizing Airflow used the vanilla DAGs folder,
> and including all the deps of all the DAGs with the airflow binary itself.
> Unfortunately, that meant we had to redeploy the scheduler, webserver
> and/or workers every time a dependency changed, which will definitely not
> work for us long term.
>
> The next option we considered was to use the "packaged DAGs" approach,
> whereby you place dependencies in a zip file. This would not work for us,
> due to the lack of support for dynamic libraries (see
> https://airflow.apache.org/concepts.html#packaged-dags)
>
> We have finally arrived at an option that seems reasonable, which is to use
> a single Operator that shells out to various binary targets that we build
> independently of Airflow, and which include their own dependencies.
> Configuration is serialized via protobuf and passed over stdin to the
> subprocess. The parent process (which is in Airflow's memory space) streams
> the logs from stdout and stderr.
>
> This approach has the advantage of working seamlessly with our build
> system, and allowing us to redeploy DAGs even when dependencies in the
> operator implementations change.
>
> Any thoughts / comments / feedback? Have people faced similar issues out
> there?
>
> Many thanks,
>
>
> -G Silk
>

Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-10-31 Thread Maxime Beauchemin

A few related thoughts:
* there may be hiccups around concurrency (pools, queues), though the
worker should double-checks that the constraints are still met when firing
the task, so in theory this should be ok
* there may be more "misfires" meaning the task gets sent to the worker,
but by the time it starts the conditions aren't met anymore because of a
race condition with one of the other schedulers. Here I'm assuming recent
versions of Airflow will simply eventually re-fire the misfires and heal
* cross DAG prioritization can't really take place anymore as there's not a
shared "ready-to-run" list of task instances that can be sorted by
priority_weight. Whichever scheduler instance fires first is likely to get
the open slots first.

Max


On Wed, Oct 31, 2018 at 1:00 PM Kevin Yang  wrote:

> Finally we start to talk about this seriously? Yeah! :D
>
> For your approach, a few thoughts:
>
>1. Shard by # of files may not yield same load--even very different load
>since we may have some framework DAG file producing 500 DAG and take
>forever to parse.
>2. I think Alex Guziel  had previously
>talked about using apache helix to shard the scheduler. I haven't look a
>lot into it but may be something you're interested in. I personally like
>that idea because we don't need to reinvent the wheel about a lot stuff(
>less code to maintain also ;) ).
>3. About the DB part, I should be contributing back some changes that
>can dramatically drop the DB CPU usage. Afterwards I think we should
> have
>plenty of headroom( assuming the traffic is ~4000 DAG files and ~40k
>concurrency running task instances) so we should probly be fine here.
>
> Also I'm kinda curious about your setup and want to understand why do you
> need to shard the scheduler, since the scheduler can now scale up pretty
> high actually.
>
> Thank you for initiate the discussion, I think it can turn out to be a very
> valuable and critical discussion--many people have been thinking/discussing
> about this and I can't wait to hear the ideas :D
>
> Cheers,
> Kevin Y
>
> On Wed, Oct 31, 2018 at 7:38 AM Deng Xiaodong  wrote:
>
> > Hi Folks,
> >
> > Previously I initiated a discussion about the best practice of Airflow
> > setting-up, and it was agreed by a few folks that scheduler may become
> one
> > of the bottleneck component (we can only run one scheduler instance, can
> > only scale vertically rather than horizontally, etc.). Especially when we
> > have thousands of DAGs, the scheduling latency may be high.
> >
> > In our team, we have experimented a naive multiple-scheduler
> architecture.
> > Would like to share here, and also seek inputs from you.
> >
> > **1. Background**
> > - Inside DAG_Folder, we can have sub-folders.
> > - When we initiate scheduler instance, we can specify “--subdir” for it,
> > which will specify the specific directory that the scheduler is going to
> > “scan” (https://airflow.apache.org/cli.html#scheduler).
> >
> > **2. Our Naive Idea**
> > Say we have 2,000 DAGs. If we run one single scheduler instance, one
> > scheduling loop will traverse all 2K DAGs.
> >
> > Our idea is:
> > Step-1: Create multiple sub-directories, say five, under DAG_Folder
> > (subdir1, subdir2, …, subdir5)
> > Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs
> in
> > each)
> > Step-3: then we can start scheduler instance on 5 different machines,
> > using command `airflow scheduler --subdir subdir` on machine .
> >
> > Hence eventually, each scheduler only needs to take care of 400 DAGs.
> >
> > **3. Test & Results**
> > - We have done a testing using 2,000 DAGs (3 tasks in each DAG).
> > - DAGs are stored using network attached storage (the same drive mounted
> > to all nodes), so we don’t concern about the DAG_Folder synchronization.
> > - No conflict observed (each DAG file will only be parsed & scheduled by
> > one scheduler instance).
> > - The scheduling speed improves almost linearly. Demonstrated that we can
> > scale scheduler horizontally.
> >
> > **4. Highlight**
> > - This naive idea doesn’t address scheduler availability.
> > - As Kelvin Yang shared earlier in another thread, the database may be
> > another bottleneck when the load is high. But this is not considered here
> > yet.
> >
> >
> > Kindly share your thoughts on this naive idea. Thanks.
> >
> >
> >
> > Best regards,
> > XD
> >
> >
> >
> >
> >
>

Re: Duplicate key unique constraint error

2018-10-30 Thread Maxime Beauchemin

Also what's your schedule interval? I'm just trying to confirm that this
isn't a "run every minute, or anytime someone blinks" kind of DAG.

Max

On Tue, Oct 30, 2018 at 5:48 AM Brian Greene <
br...@heisenbergwoodworking.com> wrote:

> How do you trigger it externally?
>
> We have several custom operators that trigger other jobs and we had to be
> really careful or we’d get duplicate keys for the dag run and it would fail
> to kick off.
>
> One scheduler, but we saw it repeatedly and have it noted as a thing to
> watch out for.
>
> Brian
>
> Sent from a device with less than stellar autocorrect
>
> > On Oct 29, 2018, at 2:03 PM, Abhishek Sinha 
> wrote:
> >
> > Attaching the scheduler crash logs as well.
> >
> > https://pastebin.com/B2WEJKRB
> >
> >
> >
> >
> > Regards,
> >
> > Abhishek Sinha | m: +919035191078 | e: abhis...@infoworks.io
> >
> >
> > On Tue, Oct 30, 2018 at 12:19 AM Abhishek Sinha 
> > wrote:
> >
> >> Max,
> >>
> >> We always trigger the DAG externally. I am not sure if there is still
> any
> >> backfill involved.
> >>
> >> Is there a way where I can find out in logs, if more than one instance
> of
> >> scheduler is running?
> >>
> >>
> >> On 29 October 2018 at 10:43:19 PM, Maxime Beauchemin (
> >> maximebeauche...@gmail.com) wrote:
> >>
> >> The stacktrace seems to be pointing in that direction. Id check that
> >> first. It seems like it **could** be a race condition with a backfill as
> >> well, unclear.
> >>
> >> It's still a bug though, and the scheduler should make sure to handle
> this
> >> and not raise/crash.
> >>
> >> On Mon, Oct 29, 2018, 10:05 AM Abhishek Sinha 
> >> wrote:
> >>
> >>> Max,
> >>>
> >>> I do not think there was more than one instance of scheduler running.
> >>> Since the scheduler crashed and it has been restarted, I cannot
> confirm it
> >>> now. Is there any log that can provide this information?
> >>>
> >>> Could there be a different cause apart from multiple scheduler
> instances
> >>> running?
> >>>
> >>>
> >>> On 29 October 2018 at 9:30:56 PM, Maxime Beauchemin (
> >>> maximebeauche...@gmail.com) wrote:
> >>>
> >>> Abhishek, are you running more than one scheduler instance at once?
> >>>
> >>> Max
> >>>
> >>> On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha 
> >>> wrote:
> >>>
> >>>> The issue is happening more frequently now. Can someone please look
> into
> >>>> this?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (
> >>> abhis...@infoworks.io
> >>>> )
> >>>> wrote:
> >>>>
> >>>> Can someone please help in looking into this issue? It is critical
> since
> >>>> this has come up in one of our production environment. Also, this
> issue
> >>> has
> >>>> appeared only once till now.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Abhishek
> >>>>
> >>>> On 20-Sep-2018, at 10:18 PM, Abhishek Sinha 
> >>> wrote:
> >>>>
> >>>> Any update on this?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Abhishek
> >>>>
> >>>> On 18-Sep-2018, at 12:48 AM, Abhishek Sinha 
> >>> wrote:
> >>>>
> >>>> Pastebin: https://pastebin.com/K6BMTb5K
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Abhishek
> >>>>
> >>>> On 18-Sep-2018, at 12:31 AM, Stefan Seelmann  >
> >>>> wrote:
> >>>>
> >>>> On 9/17/18 8:19 PM, Abhishek Sinha wrote:
> >>>>
> >>>> Any update on this?
> >>>>
> >>>> Please find the scheduler error log attached.
> >>>>
> >>>> Can you share the full python stack trace?
> >>>>
> >>>>
> >>>> Seems the mailing list doesn't allow attachments. Either post the
> >>>> stacktrace inline, or post it somewhere at pastebin or so.
> >>>>
> >>>
> >>>
>

Re: Duplicate key unique constraint error

2018-10-29 Thread Maxime Beauchemin

The stacktrace seems to be pointing in that direction. Id check that first.
It seems like it **could** be a race condition with a backfill as well,
unclear.

It's still a bug though, and the scheduler should make sure to handle this
and not raise/crash.

On Mon, Oct 29, 2018, 10:05 AM Abhishek Sinha  wrote:

> Max,
>
> I do not think there was more than one instance of scheduler running.
> Since the scheduler crashed and it has been restarted, I cannot confirm it
> now. Is there any log that can provide this information?
>
> Could there be a different cause apart from multiple scheduler instances
> running?
>
>
> On 29 October 2018 at 9:30:56 PM, Maxime Beauchemin (
> maximebeauche...@gmail.com) wrote:
>
> Abhishek, are you running more than one scheduler instance at once?
>
> Max
>
> On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha 
> wrote:
>
> > The issue is happening more frequently now. Can someone please look into
> > this?
> >
> >
> >
> >
> > On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (
> abhis...@infoworks.io
> > )
> > wrote:
> >
> > Can someone please help in looking into this issue? It is critical since
> > this has come up in one of our production environment. Also, this issue
> has
> > appeared only once till now.
> >
> >
> >
> >
> > Regards,
> >
> > Abhishek
> >
> > On 20-Sep-2018, at 10:18 PM, Abhishek Sinha 
> wrote:
> >
> > Any update on this?
> >
> >
> >
> >
> > Regards,
> >
> > Abhishek
> >
> > On 18-Sep-2018, at 12:48 AM, Abhishek Sinha 
> wrote:
> >
> > Pastebin: https://pastebin.com/K6BMTb5K
> >
> >
> >
> >
> > Regards,
> >
> > Abhishek
> >
> > On 18-Sep-2018, at 12:31 AM, Stefan Seelmann 
> > wrote:
> >
> > On 9/17/18 8:19 PM, Abhishek Sinha wrote:
> >
> > Any update on this?
> >
> > Please find the scheduler error log attached.
> >
> > Can you share the full python stack trace?
> >
> >
> > Seems the mailing list doesn't allow attachments. Either post the
> > stacktrace inline, or post it somewhere at pastebin or so.
> >
>
>

Re: Duplicate key unique constraint error

2018-10-29 Thread Maxime Beauchemin

Abhishek, are you running more than one scheduler instance at once?

Max

On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha 
wrote:

> The issue is happening more frequently now. Can someone please look into
> this?
>
>
>
>
> On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (abhis...@infoworks.io
> )
> wrote:
>
> Can someone please help in looking into this issue? It is critical since
> this has come up in one of our production environment. Also, this issue has
> appeared only once till now.
>
>
>
>
> Regards,
>
> Abhishek
>
> On 20-Sep-2018, at 10:18 PM, Abhishek Sinha  wrote:
>
> Any update on this?
>
>
>
>
> Regards,
>
> Abhishek
>
> On 18-Sep-2018, at 12:48 AM, Abhishek Sinha  wrote:
>
> Pastebin: https://pastebin.com/K6BMTb5K
>
>
>
>
> Regards,
>
> Abhishek
>
> On 18-Sep-2018, at 12:31 AM, Stefan Seelmann 
> wrote:
>
> On 9/17/18 8:19 PM, Abhishek Sinha wrote:
>
> Any update on this?
>
> Please find the scheduler error log attached.
>
> Can you share the full python stack trace?
>
>
> Seems the mailing list doesn't allow attachments. Either post the
> stacktrace inline, or post it somewhere at pastebin or so.
>

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin

Oh good to know! Scrap what I wrote then.

On Fri, Oct 19, 2018 at 9:08 AM Ash Berlin-Taylor  wrote:

> echo 'pandas==2.1.3' > constraints.txt
>
> pip install -c constraints.txt apache-airflow[pandas]
>
> That will ignore what ever we specify in setup.py and use 2.1.3.
> https://pip.pypa.io/en/latest/user_guide/#constraints-files
>
> (sorry for the brief message)
>
> > On 19 Oct 2018, at 17:02, Maxime Beauchemin 
> wrote:
> >
> >> releases in pip should have stable (pinned deps)
> > I think that's an issue. When setup.py (the only reqs that setuptools/pip
> > knows about) is restrictive, there's no way to change that in your
> > environment, install will just fail if you deviate (are there any
> > hacks/solutions around that that I don't know about???). For example if
> you
> > want a specific version of pandas in your env, and Airflow's setup.py has
> > another version of pandas pinned, you're out of luck. I think the only
> way
> > is to fork and make you own build at that point as you cannot alter
> > setup.py once it's installed. On the other hand, when a version range is
> > specified in setup.py, you're free to pin using your own reqs.txt within
> > the specified version range.
> >
> > I think pinning in setup.py is just not viable. setup.py should have
> > version ranges based semantic versioning expectations. (lib>=1.1.2,
> > <2.0.0). Personally I think we should always have 2 bounds based on
> either
> > 1-semantic versioning major release, or 2- a lower version than
> prescribed
> > by semver that we know breaks backwards compatibility features we
> require.
> >
> > I think we have consensus around something like pip-tools to generate a
> > "deterministic" `requirements.txt`. A caveat is we may need 2:
> > requirements.txt and requirements3.txt for Python 3 as some package
> > versions can be flagged as only py2 or only py3.
> >
> > Max
> >
> >
> > On Fri, Oct 19, 2018 at 1:47 AM Jarek Potiuk 
> > wrote:
> >
> >> I think i might have a proposal that could be acceptable by everyone in
> the
> >> discussion (hopefully :) ).  Let me summarise what I am leaning towards
> >> now:
> >>
> >> I think we can have a solution where it will be relatively easy to keep
> >> both "open" and "fixed" requirements (open in setup.py, fixed in
> >> requirements.txt). Possibly we can use pip-tools or poetry (including
> using
> >> of the poetry-setup <https://github.com/orsinium/poetry-setup> which
> seem
> >> to be able to generate setup.py/constraints.txt/requirements.txt from
> >> poetry setup). Poetry is still "new" so it might not work, then we can
> try
> >> to get similar approach with pip-tools or our own custom solution. Here
> are
> >> the basic assumptions:
> >>
> >>   - we can leave master with "open" requirements which makes it
> >>   potentially unstable with potential conflicting dependencies. We will
> >> also
> >>   document how to generate stable set of requirements (hopefully
> >>   automatically) and a way how to install from master using those. *This
> >>   addresses needs of people using master for active development with
> >> latest
> >>   libraries.*
> >>   - releases in pip should have stable (pinned deps). Upgrading pinned
> >>   releases to latest "working" stable set should be part of the release
> >>   process (possibly automated with poetry). We can try it out and decide
> >> if
> >>   we want to pin only direct dependencies or also the transitive ones (I
> >>   think including transitive dependencies is a bit more stable). *This
> way
> >>   we keep long-term "install-ability" of releases and make job of
> release
> >>   maintainer easier*.
> >>   - CI builds will use the stable dependencies from requirements.txt.
> >> *This
> >>   way we keep CI from dependency-triggered failures.*
> >>   - we add documentation on how to use pip --constraints mechanism by
> >>   anyone who would like to use airflow from PIP rather than sources, but
> >>   would like also to use other (up- or down- graded) versions of
> specific
> >>   dependencies. *This way we let active developers to work with airflow
> >>   and more recent/or older releases.*
> >>
> >> If we can have general consensus that we should try it, I might try to
> find
> >> some time next week to do so

Re: Pinning dependencies for Apache Airflow

2018-10-19 Thread Maxime Beauchemin

> releases in pip should have stable (pinned deps)
I think that's an issue. When setup.py (the only reqs that setuptools/pip
knows about) is restrictive, there's no way to change that in your
environment, install will just fail if you deviate (are there any
hacks/solutions around that that I don't know about???). For example if you
want a specific version of pandas in your env, and Airflow's setup.py has
another version of pandas pinned, you're out of luck. I think the only way
is to fork and make you own build at that point as you cannot alter
setup.py once it's installed. On the other hand, when a version range is
specified in setup.py, you're free to pin using your own reqs.txt within
the specified version range.

I think pinning in setup.py is just not viable. setup.py should have
version ranges based semantic versioning expectations. (lib>=1.1.2,
<2.0.0). Personally I think we should always have 2 bounds based on either
1-semantic versioning major release, or 2- a lower version than prescribed
by semver that we know breaks backwards compatibility features we require.

I think we have consensus around something like pip-tools to generate a
"deterministic" `requirements.txt`. A caveat is we may need 2:
requirements.txt and requirements3.txt for Python 3 as some package
versions can be flagged as only py2 or only py3.

Max


On Fri, Oct 19, 2018 at 1:47 AM Jarek Potiuk 
wrote:

> I think i might have a proposal that could be acceptable by everyone in the
> discussion (hopefully :) ).  Let me summarise what I am leaning towards
> now:
>
> I think we can have a solution where it will be relatively easy to keep
> both "open" and "fixed" requirements (open in setup.py, fixed in
> requirements.txt). Possibly we can use pip-tools or poetry (including using
> of the poetry-setup  which seem
> to be able to generate setup.py/constraints.txt/requirements.txt from
> poetry setup). Poetry is still "new" so it might not work, then we can try
> to get similar approach with pip-tools or our own custom solution. Here are
> the basic assumptions:
>
>- we can leave master with "open" requirements which makes it
>potentially unstable with potential conflicting dependencies. We will
> also
>document how to generate stable set of requirements (hopefully
>automatically) and a way how to install from master using those. *This
>addresses needs of people using master for active development with
> latest
>libraries.*
>- releases in pip should have stable (pinned deps). Upgrading pinned
>releases to latest "working" stable set should be part of the release
>process (possibly automated with poetry). We can try it out and decide
> if
>we want to pin only direct dependencies or also the transitive ones (I
>think including transitive dependencies is a bit more stable). *This way
>we keep long-term "install-ability" of releases and make job of release
>maintainer easier*.
>- CI builds will use the stable dependencies from requirements.txt.
> *This
>way we keep CI from dependency-triggered failures.*
>- we add documentation on how to use pip --constraints mechanism by
>anyone who would like to use airflow from PIP rather than sources, but
>would like also to use other (up- or down- graded) versions of specific
>dependencies. *This way we let active developers to work with airflow
>and more recent/or older releases.*
>
> If we can have general consensus that we should try it, I might try to find
> some time next week to do some "real work". Rather than implement it and
> make a pull request immediately, I think of a Proof Of Concept branch
> showing how it would work (with some artificial going back to older
> versions of requirements). I thought about pre-flaskappbuilder upgrade in
> one commit and update to post-flaskappbuilder upgrade in second, explaining
> the steps I've done to get to it. That would be much better for the
> community to discuss if that's the right approach.
>
> Does it sound good ?
>
> J.
>
> On Wed, Oct 17, 2018 at 2:21 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
> lamb...@coupang.com> wrote:
>
> > On 10/17/18, 12:24 AM, "William Pursell" 
> > wrote:
> >
> > I'm jumping in a bit late here, and perhaps have missed some of the
> > discussion, but I haven't seen any mention of the fact that pinning
> > versions in setup.py isn't going to solve the problem.  Perhaps it's
> > my lack of experience with pip, but currently pip doesn't provide any
> > guarantee that the version of a dependency specified in setup.py will
> > be the version that winds up being installed.  Is this a known issue
> > that is being intentionally ignored because it's hard (and out of
> > scope) to solve?  I agree that versions should be pinned in setup.py
> > for stable releases, but I think we need to be aware that this won't
> > solve the problem.
> >
> > So the problem is going to be

Re: Pinning dependencies for Apache Airflow

2018-10-07 Thread Maxime Beauchemin

pip-tools can definitely help here to ship a reference [locked]
`requirements.txt` that can be used in [all or part of] the CI. It's
actually kind of important to get CI to fail when a new [backward
incompatible] lib comes out and break things while allowing version ranges.

I think there may be challenges around pip-tools and projects that run in
both python2.7 and python3.6. You sometimes need to have 2 requirements.txt
lock files.

Max

On Sun, Oct 7, 2018 at 5:06 AM Jarek Potiuk 
wrote:

> It's a nice one :). However I think when/if we go to pinned dependencies
> the way poetry/pip-tools do it, this will be suddenly lot-less useful It
> will be very easy to track dependency changes (they will be always
> committed as a change in the .lock file or requirements.txt) and if someone
> has a problem while upgrading a dependency (always consciously, never
> accidentally) it will simply fail during CI build and the change won't get
> merged/won't break the builds of others in the first place :).
>
> J.
>
> On Sun, Oct 7, 2018 at 6:26 AM Deng Xiaodong  wrote:
>
> > Hi folks,
> >
> > On top of this discussion, I was thinking we should have the ability to
> > quickly monitor dependency release as well. Previously, it happened for a
> > few times that CI kept failing for no reason and eventually turned out it
> > was due to dependency release. But it took us some time, sometimes a few
> > days, to realise the failure was because of dependency release.
> >
> > To partially address this, I tried to develop a mini tool to help us
> check
> > the latest release of Python packages & the release date-time on PyPi.
> So,
> > by comparing it with our CI failure history, we may be able to
> troubleshoot
> > faster.
> >
> > Output Sample (ordered by upload time in desc order):
> >Latest Version  Upload Time
> > Package Name
> > awscli1.16.28
> 2018-10-05T23:12:45
> > botocore1.12.18  2018-10-05T23:12:39
> > promise   2.2.1
> 2018-10-04T22:04:18
> > Keras 2.2.4
>  2018-10-03T20:59:39
> > bleach3.0.0
> 2018-10-03T16:54:27
> > Flask-AppBuilder 1.12.02018-10-03T09:03:48
> > ... ...
> >
> > It's a minimal tool (not perfect yet but working). I have hosted this
> tool
> > at https://github.com/XD-DENG/pypi-release-query.
> >
> >
> > XD
> >
> > On Sat, Oct 6, 2018 at 12:25 AM Jarek Potiuk 
> > wrote:
> >
> > > Hello Erik,
> > >
> > > I understand your concern. It's a hard one to solve in general (i.e.
> > > dependency-hell). It looks like in this case you treat Airflow as
> > > 'library', where for some other people it might be more like 'end
> > product'.
> > > If you look at the "pinning" philosophy - the "pin everything" is good
> > for
> > > end products, but not good for libraries. In the case you have Airflow
> is
> > > treated as a bit of both. And it's perfectly valid case at that (with
> > > custom python DAGs being central concept for Airflow).
> > > However, I think it's not as bad as you think when it comes to exact
> > > pinning.
> > >
> > > I believe - a bit counter-intuitively - that tools like
> pip-tools/poetry
> > > with exact pinning result in having your dependencies upgraded more
> > often,
> > > rather than less - especially in complex systems where dependency-hell
> > > creeps-in. If you look at Airflow's setup.py now - It's a bit scary to
> > make
> > > any change to it. There is a chance it will blow at your face if you
> > change
> > > it. You never know why there is 0.3 < ver < 1.0 - and if you change it,
> > > whether it will cause chain reaction of conflicts that will ruin your
> > work
> > > day.
> > >
> > > On the contrary - if you change it to exact pinning in
> > > .lock/requirements.txt file (poetry/pip-tools) and have much simpler
> (and
> > > commented) exclusion/avoidance rules in your .in/.tml file, the whole
> > setup
> > > might be much easier to maintain and upgrade. Every time you prepare
> for
> > > release (or even once in a while for master) one person might
> consciously
> > > attempt to upgrade all dependencies to latest ones. It should be almost
> > as
> > > easy as letting poetry/pip-tools help with figuring out what are the
> > latest
> > > set of dependencies that will work without conflicts. It should be
> rather
> > > straightforward (I've done it in the past for fairly complex systems).
> > What
> > > those tools enable is - doing single-shot upgrade of all dependencies.
> > > After doing it you can make sure that all tests work fine (and fix any
> > > problems that result from it). And then you test it thoroughly before
> you
> > > make final release. You can do it in separate PR - with automated
> testing
> > > in Travis which means that you are not disturbing work of others
> > > (compilation/building + unit tests are guaranteed to work before you
> > merge
> > > it) while doing it. It's all conscious

Re: Manual validation operator

2018-10-05 Thread Maxime Beauchemin

It's a bit of a hack, but to save up slots you could just have an
instantly-failing PythonOperator (just raise an exception in the callable)
that would go in a failed state. Marking it as "success" when the
conditions are met would act as a trigger.

On Fri, Oct 5, 2018 at 9:07 AM Brian Greene 
wrote:

> My first thought was this, but my understanding is   That if you had a
> large number of dags “waiting” the sensor would consume all the concurrency.
>
> And what if the user doesn’t approve?
>
> How about the dag you have as it’s last step writes to an api/db the
> status.
>
> Then 2 other dags (or one with a branch) can each have a sensor that’s
> watching for approved/unapproved values.  When it finds one (or a batch
> depending on how you write it), trigger the “next” dag.
>
> This leaves only 1-2 sensors running and would enable your process without
> anyone using the airflow UI (assuming they have some other way to mark
> “approval”).  This avoids the “process by error and recover” logic it seems
> like you’d like to get out of.  (Which makes sense to me)
>
> B
>
> Sent from a device with less than stellar autocorrect
>
> > On Oct 4, 2018, at 10:17 AM, Alek Storm  wrote:
> >
> > Hi Björn,
> >
> > We also sometimes require manual validation, and though we haven't yet
> > implemented this, I imagine you could store the approved/unapproved
> status
> > of the job in a database, expose it via an API, and write an Airflow
> sensor
> > that continuously polls that API until the status becomes "approved", at
> > which point the DAG execution will continue.
> >
> > Best,
> > Alek Storm
> >
> > On Thu, Oct 4, 2018 at 10:05 AM Björn Pollex
> >  wrote:
> >
> >> Hi all,
> >>
> >> In some of our workflows we require a manual validation step, where some
> >> generated data has to be reviewed by an authorised person before the
> >> workflow can continue. We currently model this by using a custom dummy
> >> operator that always fails. After the review, we manually mark it as
> >> success and clear the downstream tasks. This works, but it would be
> nice to
> >> have better representation of this in the UI. The customisation points
> for
> >> plugins don’t seem to offer any way of customising UI for specific
> >> operators.
> >>
> >> Does anyone else have similar use cases? How are you handling this?
> >>
> >> Cheers,
> >>
> >>Björn Pollex
> >>
> >>
>

Re: Airflow Docs - RTD vs Apache Site

2018-10-05 Thread Maxime Beauchemin

A few thoughts:
* we absolutely have to serve a project site off of `airflow.apache.org`,
that's an ASF requirement
* maybe `airflow.apache.org` could be setup as a proxy to
readthedocs-latest (?) [I'm on vacation and have very slow internet, so
didn't research whether that's a documented use-case, we could also ask
Apache-INFRA about it]
* we could (and really should) split the project site and the documentation
as two different sites, that assumes we'd have someone drive creating a
proper, professional looking project site that would link out to the docs
on "Read the Docs". Creating a project site is not that much work and could
be a rewarding project for someone in the community. Many static site
builder framework work off of "markdown" format, and it's possible to
auto-convert RST (the format we use) to markdown automatically. I'd be nice
to take fresh screenshots of the UI while at it!

Max

Max

On Wed, Oct 3, 2018 at 6:13 AM Kaxil Naik  wrote:

> Hi all,
>
> Continuing discussion from Slack, many users have had the problem with
> looking at a wrong version of the documentation. Currently, our docs on
> apache.airflow.org don't properly state version. Although we have
> specified
> this info on our Github readme and confluence, there has still been lots of
> confusion among the new users who try to google for the docs and are
> pointed to airflow.apache.org site which doesn't have version info.
>
> The problem currently with a.a.o site is it needs to be manually built and
> only has stable version docs. We can do 2 things if we don't want to
> redirect a.a.o with RTD: (1) Maintain History on our static a.a.o site (2)
> Point a.a.o site to RTD docs, so a.a.o would point to RTD docs i.e. add the
> domain to RTD site
>
> Ash has also suggested another approach:
>
> > Apache Infra run a jenkins instance (or other build bot type things) that
> > we might be able to use for autobuilding docs if we want?
>
>
>
> Let's discuss this and decide on a single-approach that is user-friendly.
>
> NB: I will be busy for a month, hence won't be able to actively help with
> this, so please feel free to contribute/commit after an approach is
> finalized.
>
> Regards,
> Kaxil
>

Re: execution_date - can we stop the confusion?

2018-10-01 Thread Maxime Beauchemin

ase is a distinct
> > > > > event
> > > >
> > > > > from when Tasks associated with that DagRun start executing. Plus
> > > > > DagRuns
> > > >
> > > > > themselves don't actually "run" - Tasks are the only thing that
> > > > > really
> > > > > gets
> > > >
> > > > > run by Airflow.
> > > > > What is actually desired here?
> > > > >
> > > > > -   The right bound of the schedule interval?
> > > > > -   The time the DagRun was created?
> > > > > -   The time that any Tasks associated with a DagRun were first
> > > > > considered
> > > > >
> > > >
> > > > > by the scheduler?
> > > > >
> > > > > -   The time that any Tasks associated with a DagRun were first
> > > > > scheduled?
> > > > >
> > > >
> > > > > -   The time that any Tasks associated with a DagRun were actually
> > > > > started
> > > > >
> > > >
> > > > > by a worker?
> > > > > The lack of clarity and completeness around these suggestions,
> > > > > alongside
> > > >
> > > > > inane declarations like "This name won't cause people to get
> > > > > confused" is
> > > >
> > > > > hardly a good way to get people to take suggestions seriously.
> > > > > Chris
> > > > > On Wed, Sep 26, 2018 at 7:37 PM George Leslie-Waksman
> > > > >  > > >
> > > > > wrote:
> > > > >
> > > > > > This comes up a lot. I've seen it on this mailing list multiple
> > > > > > times
> > > > > > and
> > > >
> > > > > > it's something that I have to explicitly call out to every single
> > > > > > person
> > > >
> > > > > > that I've helped train up on Airflow.
> > > > > > If we take a moment to set aside why things are the way they are,
> > > > > > what
> > > > > > the
> > > >
> > > > > > documentation says, and how experienced users feel things should
> > > > > > behave;
> > > >
> > > > > > there still remains the fact that a lot of new users get confused
> > > > > > by how
> > > >
> > > > > > "execution_date" works.
> > > > > > Whether it's a problem, whether we need to do something, and what
> > > > > > we
> > > > > > could
> > > >
> > > > > > do are all separate questions but I think it's important that we
> > > > > > acknowledge and start from:
> > > > > > A lot of new users get confused by how "execution_date" works.
> > > > > > I recognize that some of this is a learning curve issue and some
> > > > > > of
> > > > > > this is
> > > >
> > > > > > a mindset issue but it begs the question: do enough users benefit
> > > > > > from
> > > > > > the
> > > >
> > > > > > current structure to justify the harm to new users?
> > > > > > --George
> > > > > > On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
> > > > > > br...@heisenbergwoodworking.com> wrote:
> > > > > >
> > > > > > > It took a minute to grok, but in the larger context of how af
> > > > > > > works it
> > > >
> > > > > > > makes perfect sense the way it is. Changing something so
> > > > > > > fundamentally
> > > >
> > > > > > > breaking to every dag in existence should bring a comparable
> > > > > > > benefit.
> > > >
> > > > > > > Beyond the avoiding teaching a concept you disagree with, what
> > > > > > > benefits
> > > >
> > > > > > > does the proposal bring to offset the cost of change?
> > > > > > > I’m gonna make a meme - “do you even airflow bro?”
> > > > > > > Sent from a device with less than stellar autocorrect
> > > > > > >
> > > > > > > > On S

Flask App Builder [FAB] support

2018-09-28 Thread Maxime Beauchemin

The new [experimental] web UI is based on Flask App Builder [FAB] to which
we contributed security fixes recently.

It's been hard to get the main maintainer of the project's attention to
release to Pypi. For context, I have write access to the repository, but no
access to release to Pypi.

Please show your support by adding reactions or nice, supportive comments
on  https://github.com/dpgaspar/Flask-AppBuilder/issues/808

If you think it's a very useful and cool project (which it is!) and have a
bit of extra time you can commit, you can offer help maintaining the
project as well as clearly the two people with write access to the repo
have their hands full.

Thanks!

Max

Re: execution_date - can we stop the confusion?

2018-09-26 Thread Maxime Beauchemin

I think if you have a functional mindset (as in "functional data engineering
")
as opposed to a cron mindset, using the left bound of the time interval
makes a lot of sense. Things like your daily table partition keys align
with your Airflow execution_date.

The main thing is that whatever we do we cannot break backwards
compatibility. Offering both views (left bound/right bound), as it's been
proposed before, either as an environment setting or a user personal
preference is even more confusing to me personally. Users would have to
switch context as they help each other or change environments.

Also note that your intuition may differ from other people's intuition, and
that "unlearning" something is way harder than learning something.

My personal take on this is to make this a rite of passage. This is just
one of the many thing you have to learn when learning Airflow.

Max

On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin  wrote:

> Hi Bolke
>
> Speaking as a consultant who is constantly training other teams how to use
> airflow, I do frequently see this confusion.
> Another one is how the batch_date is always batch_date + interval or as the
> docs make it quite clear
>
> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
> AFTER
> the start date, at the END of the period."
>
> Renaming it would make it simpler for newbies, but essentially they will
> need to understand how Airflow behaves, execution_date being the batch
> execution date not the run_date of the DAG
>
> I am actually in the process of writing a blog post
> 
> about this which I could use peoples feedback
>
> If it helps, I find that explaining how backfills work and why they are
> important will drive home what the execution_date is :)
>
>
> Regards
> Sam
>
>
>
> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin  wrote:
>
> > I dont think this makes sense and I dont that think anyone had a real
> > issue with this. Execution date has been clearly documented  and is part
> of
> > the core principles of airflow. Renaming will create more confusion.
> >
> > Please note that I do think that as an anonymous user you cannot speak
> for
> > any "new airflow user". That is a contradiction to me.
> >
> > Thanks
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 26 Sep 2018, at 07:59, airflowuser  .INVALID>
> > wrote:
> > >
> > > One of the most annoying, hard to understand and against all common
> > sense is the execution_date behavior. I assume that any new Airflow user
> > has been struggling with it.
> > > The amount of questions with answers referring to :
> > https://airflow.apache.org/scheduler.html?scheduling-triggers  is
> > uncountable.
> > >
> > > Most people mistakenly think that execution_date is the datetime which
> > the DAG started to run.
> > >
> > > I suggest the following changes:
> > > 1. Renaming the execution_date to something else like: run_stamped
> >  This name won't cause people to get confused.
> > > 2. Adding a new variable which indicated the actual datetime when the
> > DAG run was generated. call it execution_start_date. People seem to want
> > the information when the DAG actually started to be executed/run.
> > >
> > > This is only naming changes. No need to actual change the behavior -
> > This will only make things simpler as when user encounter  run_stamped
> he
> > won't be confused by the name like execution_date
> >
>

Re: Airflow: Apache Graduation

2018-09-20 Thread Maxime Beauchemin

Yeah let's make it happen! I'm happy to set some time aside to help with
the final push.

Max

On Thu, Sep 20, 2018 at 9:53 AM Sid Anand  wrote:

> Folks! (specifically Bolke, Fokko, Ash)
> What's needed to graduate from Apache?
>
> Can we make 1.10.1 be about meeting our licensing needs to allow us to
> graduate?
>
> -s
>

Re: Connection Management in Multi-tenancy Scenario

2018-09-19 Thread Maxime Beauchemin

Another clear solution is for connection management to go through the
[upcoming] REST API we've been talking about. Then of course we'll need one
permission per connection and a "all_connections" perm that can be added to
roles (much like DAGs but for connections).

Max

On Wed, Sep 19, 2018 at 7:25 AM Ash Berlin-Taylor  wrote:

> You are correct that currently all DAGs can access all connections and
> variables.
>
> The other thing to bear in mind: currently PythonOperators have an active
> connection to the metadata DB where connections are stored, so at best this
> is "co-operative" security, to prevent one team from accessing another
> team's connections, and not a hard barrier against an even mildly
> determined attacker.
>
> As for the implementation of it: it would be worth looking to see if we
> can use the Permissions model built in to FAB (Flask App Builder) that we
> are using in the RBAC-based UI. This would allow for much more granular
> permissions, and provides a pre-existing management UI for it to.
>
> I don't know if this would make the work dependent on the (in progress?)
> DAG-level access controls.
>
> -ash
>
> > On 19 Sep 2018, at 15:00, Deng Xiaodong  wrote:
> >
> > Hi folks,
> >
> > Thinking of a scenario: I may have multiple users in the same Airflow
> > instance. I can use filter_by_owner feature so that each user can only
> see
> > their own DAGs. But what if their DAGs are using different data sources,
> > say owner A is using mysql_conn_a, and owner B is using mysql_conn_b, and
> > we don't want to allow them to access each other's database?
> >
> > Seems like all DAG (no matter who is the owner) can access all defined
> > connections? or have I missed something?
> >
> > If my suspicion is making sense, I think it would be necessary to have
> > values "*if_protect*" and "*owner*" for each connection. When
> "if_protect"
> > == True, only DAGs whose owner == "owner" would be able to use this
> > connection. I would like to take this up to prepare a PR.
> >
> > Thanks.
> >
> > XD
>
>

Re: Database referral integrity

2018-09-18 Thread Maxime Beauchemin

The database migration creating the FK will/would need to have something
that either creates dummy missing PKs first, or delete the orphaned keys to
insure the operation of creating the FK doesn't error out. Seems like
adding dummy keys is a better approach. Then you'll have to make sure that
everywhere where FKs are created that there are no edge cases of missing
PKs. Then some delete operations in some cases may have to "cascade"
properly.

The Django Admin had this nice confirm screen on delete that would show you
clearly the scope of the cascading operation when deleting objects. To my
knowledge Flask-Admin and FAB don't have such a feature. Personally I
wouldn't allow cascade unless such a feature is implemented in some way.
Note that SQLAlchemy has builtin semantics for specifying how/whether
cascading should take place.

Personally I think referential integrity is overrated in some cases,
especially when using meaningful "business keys" (as opposed to
auto-increted "surrogate" keys) as PKs. It also slows down insert
operations. For data warehousing (this clearly doesn't apply to the Airflow
metadata store), the best practice on most db engine is to **not** enforce
FK constraints as it slows down inserts in fact tables and straight out
prevents bulk loading.

Another approach is to avoid deleting in general, especially referenced
fks, and setting up some activity/visibility flag to false instead.

Max

On Tue, Sep 18, 2018 at 10:47 AM Driesprong, Fokko 
wrote:

> I'm in favor of having referential integrity. It will add some load in
> having to enforce the referential integrity, but it will also make sure
> that the database stays clean. Also in Airflow we use transactions which
> will make sure that the integrity checks are not validated on every
> statement, but after the commit. I'm happy to help with this as well.
>
> Cheers, Fokko
>
> Op di 18 sep. 2018 om 11:07 schreef Bolke de Bruin :
>
> > Adding these kind of checks which work for integrity well make database
> > access pretty slow. In addition it isnt there because in the past there
> was
> > no strong connection between for example tasks and dagruns, it was more
> or
> > less just coincidental. There also so some bisecting tools that probably
> > have difficulty functioning in a new regime. In other words it is not an
> > easy change and it will have operational challenges.
> >
> > > On 18 Sep 2018, at 11:03, Ash Berlin-Taylor  wrote:
> > >
> > > Ooh good spot.
> > >
> > > Yes I would be in favour of adding these, but as you say we need to
> > thing about how we might migrate old data.
> > >
> > > Doing this at 2.0.0 and providing a cleanup script (or doing it as part
> > of the migration?) is probably the way to go.
> > >
> > > -ash-
> > >
> > >> On 17 Sep 2018, at 19:56, Stefan Seelmann 
> > wrote:
> > >>
> > >> Hi,
> > >>
> > >> looking into the DB schema there is almost no referral integrity
> > >> enforced at the database level. Many foreign key constraints between
> > >> dag, dag_run, task_instance, xcom, dag_pickle, log, etc would make
> sense
> > >> IMO.
> > >>
> > >> Is there a particular reason why that's not implemented?
> > >>
> > >> Introducing it now will be hard, probably any real-world setup has
> some
> > >> violations. But I'm still in favor of this additional safety net.
> > >>
> > >> Kind Regards,
> > >> Stefan
> > >
> >
> >
>

Re: Guidelines on Contrib vs Non-contrib

2018-09-18 Thread Maxime Beauchemin

+1 for deprecating operators/hooks as plugins, let's use Python's good old
python packages and maybe python "entry points" if we want to inject them
in "airflow.operators"/"airflow.hooks" (which is probably not necessary)

On Tue, Sep 18, 2018 at 2:12 AM Ash Berlin-Taylor  wrote:

> Operators and hooks don't need any special plugin system - simply having
> them as as separate Python modules which are imported using normal python
> semantics is enough.
>
> In fact now that I think about it: I want to deprecate the plugins
> registering hooks/operators etc and limit it to only bits which a simple
> python import can't manage - which I think is only anything that needs to
> be registered with another system, such as custom routes in the web UI.
>
> I'll draft an AIP for this soon.
>
> -ash
>
>
> > On 18 Sep 2018, at 00:50, George Leslie-Waksman 
> wrote:
> >
> > Given we have a plugin system, could we alternatively move away from
> > keeping non-core supported code outside of the core project/repo?
> >
> > It would hugely decrease the surface area of the main repository and
> > testing infrastructure to get most of the contrib code out to its own
> place.
> >
> > Further, it would decrease the committer burden of having to
> approve/merge
> > code that is not supposed to be their responsibility.
> >
> > On Mon, Sep 17, 2018 at 4:37 PM Tim Swast 
> wrote:
> >
> >>> Individual operators and hooks living in separate repositories on
> github
> >> (or possibly other Apache projects), which are then distributed by pip
> and
> >> installed as libraries seems like it would scale better.
> >>
> >> Pandas did this about a year ago, and it's seemed to have worked well.
> For
> >> example, pandas.read_gbq is a very thin wrapper around
> pandas_gbq.read_gbq
> >> (distributed as a separate package). It has made it easier for me to
> track
> >> issues corresponding to my area of expertise.
> >>
> >> On Sun, Sep 16, 2018 at 1:25 PM Jakob Homan  wrote:
> >>
>  My understanding as a contributor is that if a hook/operator is in
> >> core,
> >>> it
>  means that a committer is willing to take personal responsibility to
>  maintain it (or at least help maintain it), and everything else goes
> in
>  contrib.
> >>>
> >>> That's not correct.  All of the code is owned by the entire
> >>> community[1]; no one person is responsible for any of it.  There's no
> >>> silos, fiefdoms, walled gardens, etc.  If the community cannot support
> >>> a piece of code it should be deprecated and subsequently removed.
> >>>
> >>> Contrib sections are almost always problematic for this reason.
> >>> Hadoop ended up abandoning its.  Because Airflow acts as a gathering
> >>> point for so many disparate technologies (databases, storage systems,
> >>> compute engines, etc.), trying to keep all of them corralled and up to
> >>> date will be very difficult.  Individual operators and hooks living in
> >>> separate repositories on github (or possibly other Apache projects),
> >>> which are then distributed by pip and installed as libraries seems
> >>> like it would scale better.
> >>>
> >>> -Jakob
> >>>
> >>> [1]
> https://blogs.apache.org/foundation/entry/success-at-apache-a-newbie
> >>>
> >>> On 15 September 2018 at 13:29, Jeff Payne  wrote:
>  How many operators are added to contrib per month? Is it too many to
> >>> make the decision case by case? If so, then the above mentioned rule
> >> sounds
> >>> fairly reasonable. However, if that's the rule, shouldn't a bunch of
> >>> existing modules be moved from contrib to core?
> 
>  Get Outlook for Android
> 
>  
>  From: Taylor Edmiston 
>  Sent: Saturday, September 15, 2018 1:13:47 PM
>  To: dev@airflow.incubator.apache.org
>  Subject: Re: Guidelines on Contrib vs Non-contrib
> 
>  My understanding as a contributor is that if a hook/operator is in
> >> core,
> >>> it
>  means that a committer is willing to take personal responsibility to
>  maintain it (or at least help maintain it), and everything else goes
> in
>  contrib.
> 
>  *Taylor Edmiston*
>  Blog  | LinkedIn
>   | Stack Overflow
>   | Developer
> >>> Story
>  
> 
> 
> 
>  On Sat, Sep 15, 2018 at 2:02 PM Kaxil Naik 
> >> wrote:
> 
> > Hi, all (mainly contributors),
> >
> > Can we decide on a common guideline on when a hook/operator should go
> >>> under
> > contrib vs core?
> >
> > Regards,
> >
> > *Kaxil Naik*
> > *Big Data Consultant *@ *Data Reply UK*
> > *Certified *Google Cloud Data Engineer | *Certified* Apache Spark &
> >>> Neo4j
> > Developer
> > *Phone: *+44 (0) 74820 88992
> > *LinkedIn*: https://www.linkedin.com/in/kaxil
> >
> >>>
> >> --
> >> *  •  **T

Re: Sep Airflow Bay Area Meetup @ Google

2018-09-17 Thread Maxime Beauchemin

Lyft could probably host if we want to schedule something last minute while
you and your crew are in town @Bolke. Maybe a one day get together + some
hacking. Do you want to start another thread to assess interest?

Max

On Fri, Sep 14, 2018 at 11:42 PM Feng Lu  wrote:

> Not going to happen for this time, we don't receive enough interest from
> the community.
>
>
> On Wed, Sep 12, 2018 at 7:57 AM Bolke de Bruin  wrote:
>
> > btw how are we doing on the “one day” hackathon?
> >
> > > On 12 Sep 2018, at 16:49, Bolke de Bruin  wrote:
> > >
> > > Hi feng,
> > >
> > > I can do “Elegant pipelining with Airflow” recycle of pydata 2018
> > amsterdam (that I did together with Fokko).
> > >
> > > Cheers
> > > Bolke
> > >
> > >> On 4 Sep 2018, at 22:13, Feng Lu  wrote:
> > >>
> > >> We are 3 weeks away from the meetup and still have a few lightening
> > talks
> > >> open, please take the chance and share your cool ideas/work ;)
> > >> Meanwhile, speakers could you please send me and Trishka (
> > tris...@google.com)
> > >> your slides?
> > >>
> > >> Thank you.
> > >>
> > >> Feng
> > >>
> > >> On Sun, Aug 12, 2018 at 9:46 PM Maxime Beauchemin <
> > >> maximebeauche...@gmail.com> wrote:
> > >>
> > >>> Hey Feng,
> > >>>
> > >>> Sign me up for a session on "Challenges ahead - taking airflow to the
> > next
> > >>> level". I'm planning on recycling the content from the talk @Google
> > next
> > >>> Friday.
> > >>>
> > >>> Max
> > >>>
> > >>> On Fri, Aug 10, 2018 at 3:22 PM Feng Lu 
> > wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> We still have 1-2 regular sessions and 4-5 lightening sessions
> > available,
> > >>>> please send in your talks ;)
> > >>>> Here's a quick summary on the talks I have received:
> > >>>>
> > >>>> Regular sessions:
> > >>>> Ben Gregory (Astronomer): Running Cloud Native Airflow.
> > >>>> Feng Lu (Google): Managing Airflow As a Service: Best Practices,
> > >>> Experience
> > >>>> and Roadmap
> > >>>> Fokko Driesprong (GoDataDriven): Apache Airflow in the Google Cloud:
> > >>>> Backfilling streaming data using Dataflow
> > >>>>
> > >>>> Lightening Session:
> > >>>> Barni Seetharaman (Google): Deploy Airflow on Kubernetes using
> Airflow
> > >>>> Operator
> > >>>>
> > >>>> Session type TBD:
> > >>>> Manish Ranjan (Tile): Functional yet cost-effective Data Engineering
> > With
> > >>>> Airflow
> > >>>>
> > >>>> Thanks and looking forward to the meetup(120+ sign-ups to date)!
> > >>>>
> > >>>>
> > >>>> Feng
> > >>>>
> > >>>> On Thu, Jul 19, 2018 at 2:26 PM Feng Lu  wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> Hope you are enjoying your summer!
> > >>>>>
> > >>>>> This is Feng Lu from Google and we'll host the next Airflow meetup
> in
> > >>>> our Sunnyvale
> > >>>>> campus. We plan to add a *lightening session* this time for people
> to
> > >>>>> share their airflow ideas, work in progress, pain points, etc.
> > >>>>> Here's the meetup date and schedule:
> > >>>>>
> > >>>>> -- Sep 24 (Monday)  --
> > >>>>> 6:00PM meetup starts
> > >>>>> 6:00 - 8:00PM light dinner /mix-n-mingle
> > >>>>> 8:00PM - 9:40PM: 5 sessions (20 minutes each)
> > >>>>> 9:40PM to 10:10PM: 6 lightening sessions (5 minutes each)
> > >>>>> 10:10PM to 11:00PM: drinks and social hour
> > >>>>>
> > >>>>> I've seen a lot of interesting discussions in the dev mailing-list
> on
> > >>>>> security, scalability, event interactions, future directions,
> hosting
> > >>>>> platform and others. Please feel free to send your talk proposal to
> > us
> > >>> by
> > >>>>> replying this email.
> > >>>>>
> > >>>>> The Cloud Composer team is also going to share their experience
> > running
> > >>>>> Apache Airflow as a managed solution and service roadmap.
> > >>>>>
> > >>>>> Thank you and looking forward to hearing from y'all soon!
> > >>>>>
> > >>>>> p.s., if folks are interested, we can also add a one-day Airflow
> > >>>> hackathon
> > >>>>> prior to the meet-up on the same day, please let us know.
> > >>>>>
> > >>>>> Feng
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >
> >
> >
>

Re: Duplicate key unique constraint error

2018-09-12 Thread Maxime Beauchemin

Can you share the full python stack trace?

On Wed, Sep 12, 2018 at 5:31 PM Abhishek Sinha 
wrote:

> Got the following error on Airflow 1.8.2 version:
>
> duplicate key value violates unique constraint "task_instance_pkey"
>
> DETAIL: Key (task_id, dag_id, execution_date)=(PB_BPNZ, master_v2,
> 2018-09-12 03:00:37) already exists.n [SQL: 'INSERT INTO task_instance
> (task_id, dag_id, execution_date, start_date, end_date, duration, state,
> try_number, hostname, unixname, job_id, pool, queue, priority_weight,
> operator, queued_dttm, pid) VALUES (%(task_id)s, %(dag_id)s,
> %(execution_date)s, %(start_date)s, %(end_date)s, %(duration)s, %(state)s,
> %(try_number)s, %(hostname)s, %(unixname)s, %(job_id)s, %(pool)s,
> %(queue)s, %(priority_weight)s, %(operator)s, %(queued_dttm)s, %(pid)s)']
> [parameters: ({'hostname': u'', 'end_date': None, 'task_id': 'PB_BPNZ',
> 'queued_dttm': None, 'execution_date': datetime.datetime(2018, 9, 12, 3, 0,
> 37), 'pid': None, 'state': None, 'try_number': 0, 'queue': 'default',
> 'operator': None, 'unixname': 'infoworksuser', 'pool': None, 'duration':
> None, 'priority_weight': 65, 'start_date': None, 'dag_id': 'master_v2',
> 'job_id': None})]
>
>
>
>
>
>
>
> Regards,
>
> Abhishek
>
>

Re: Visual Correlation in Superset

2018-09-12 Thread Maxime Beauchemin

The main thing that "scoped filters" and "visual correlation" have in
common is the fact that they imply some sort of cross chart communication.
Christine I tagged you here so you'd see the diagram that Grace shared.

Let's take this to dev@. I think we need an independent SIP for chart
messaging (pub-sub). Filtering events may or may not be pubsub events. I'd
vote for thinking of them independently for now.

Max

On Tue, Sep 11, 2018 at 2:22 PM Christine Chambers 
wrote:

> It is a bit too soon to call the implementation. Still in the process of
> writing a design for scoped filters though I do like where Grace is going
> with the idea. :)
>
> I see visual correlation (VC) as an extreme case of crossfilter
> <http://square.github.io/crossfilter/> in which the "crossed" section is
> a single grain on the time series (based on the aggregation level of the
> time series). We can think of scoped filters as what can be used to back
> the VC (i.e. the actual slicing). My thought is a scoped filter can be
> created in multiple ways (at least two):
> 1) descriptively through the UI (the MVP IMO)
> 2) implicitly through interactions like the one in VC (e.g. hovering over
> a grain of time)
>
> The model I have forming in my head is a pubsub one where scoped filters
> get published by the ways mentioned above and interested parties (e.g.
> charts in dashboards) subscribe to them. I'm hesitant to pin down the role
> `refreshExcept` plays right now. My gut feeling is I'll refactor/rewrite a
> large portion of it. :)
>
> All of these said, Jamshed, I don't think scoped filters will solve the
> particular hover interaction issue you're running into. Assuming scoped
> filters were implemented today, I'd expect the "show tooltips in other
> charts" bit to be callback functions each chart in the dashboard provides
> (these callback functions will take a scoped filter as one of their
> arguments). Btw, nvd3 is very imperative, to make it play nice with
> redux/react, instead of reaching in for the inner nvd3 object, you could
> consider adding a react wrapper around it and provide a callback function
> to show tooltips.
>
> What's your timeline like for implementing VC? Trying to figure out if/how
> I can help you with the slicing bit.
>
> Cheers,
> Christine
>
> On Tue, Sep 11, 2018 at 12:59 PM Grace Guo  wrote:
>
>> BTW, use scoped filter to implement visual correlation in dashboard is
>> just my thought…I didn’t discuss with Christine :)
>> This feature can totally independent of scoped filter too. I am open to
>> all proposals.
>>
>> - Grace
>>
>>
>> On Sep 11, 2018, at 12:44 PM, Jamshed Rahman  wrote:
>>
>> Thanks for sending the Scoped Filter info. I'm not sure if this is inline
>> with what I'm trying to do though. The "Chart" list objects available in
>> Dashboard.jsx is of type ChartPropShape. They don't have references to the
>> inner nvd3 object, which is required for hover/tooltip rendering. I will
>> need to access to /Chart/Chart.jsx objects for the nvd3 reference.
>>
>> Christine, any thoughts? :-)
>>
>> Jamshed
>>
>> On Mon, Sep 10, 2018 at 6:19 PM Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>
>>> We should take this to dev@ btw
>>>
>>> Max
>>>
>>> On Mon, Sep 10, 2018 at 6:18 PM Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>>
>>>> + cc Christine Chambers  who's currently working
>>>> on scoped filters
>>>>
>>>> Max
>>>>
>>>> On Mon, Sep 10, 2018 at 4:37 PM Grace Guo  wrote:
>>>>
>>>>>
>>>>>
>>>>> Here is my thought that use *Scoped filter* to resolve this problem.
>>>>>
>>>>> *refreshExcept *hold the logic to find the list of charts that should
>>>>> be affected by filter change.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> - Grace
>>>>>
>>>>> On Sep 10, 2018, at 12:01 PM, Jamshed Rahman 
>>>>> wrote:
>>>>>
>>>>> That was also my idea but I need some help understanding how charts
>>>>> can subscribe to dashboard events using the current react/redux model. Is
>>>>> there an example of a chart action subscribed to dashboard state? If not,
>>>>> where would be the right place to put it? Sorry I'm still learning
>>>>> react/redux and trying to figure this out so any direction would be

Re: Cold-case PRs

2018-09-10 Thread Maxime Beauchemin

It doesn't deal with Jiras, just PRs and GH issues (which we don't use...)

Max

On Mon, Sep 10, 2018 at 6:58 PM Sid Anand  wrote:

> Max,
> How do these manage the JIRAs?
>
> -s
>
> On Mon, Sep 10, 2018 at 6:14 PM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > I've used https://github.com/bstriner/github-bot-close-inactive-issues
> in
> > the past to auto-close issues / PRs based on a policy around inactivity.
> It
> > worked alright.
> >
> > There's also https://github.com/probot/stale which seems to be one of
> the
> > leading solutions, but it may require an Apache INFRA ticket to make the
> > integration work (if they'll allow it).
> >
> > Max
> >
> > On Mon, Sep 10, 2018 at 5:21 PM Sid Anand  wrote:
> >
> > > Folks!
> > > It's around that time (over 200 PRs with some 2 years old) when we need
> > to
> > > consider closing some abandoned PRs. We have a lot more active
> > maintainers
> > > now today than we had 1 or 2 years ago, we are better able to keep up
> > with
> > > new PR demand, but it's still not perfect.
> > >
> > > I've created a GitHub Label : *Potential_Cold_Case_PR.*
> > >
> > > I've started labeling some old PRs likewise and pinging the submitters.
> > > Anyone in the community can help with labeling and pinging submitters.
> If
> > > we (maintainers) find the PR abandoned (i.e. no updates from submitters
> > for
> > > a few days), we can close the PRs.
> > >
> > > Additionally, if the JIRAs themselves are obsolete and no longer apply,
> > we
> > > can consider closing the JIRAs as obsolete.
> > >
> > > Here's one example :
> > > https://github.com/apache/incubator-airflow/pulls/vijaysbhat
> > >
> > > -s
> > >
> >
>

Re: Cold-case PRs

2018-09-10 Thread Maxime Beauchemin

I've used https://github.com/bstriner/github-bot-close-inactive-issues in
the past to auto-close issues / PRs based on a policy around inactivity. It
worked alright.

There's also https://github.com/probot/stale which seems to be one of the
leading solutions, but it may require an Apache INFRA ticket to make the
integration work (if they'll allow it).

Max

On Mon, Sep 10, 2018 at 5:21 PM Sid Anand  wrote:

> Folks!
> It's around that time (over 200 PRs with some 2 years old) when we need to
> consider closing some abandoned PRs. We have a lot more active maintainers
> now today than we had 1 or 2 years ago, we are better able to keep up with
> new PR demand, but it's still not perfect.
>
> I've created a GitHub Label : *Potential_Cold_Case_PR.*
>
> I've started labeling some old PRs likewise and pinging the submitters.
> Anyone in the community can help with labeling and pinging submitters. If
> we (maintainers) find the PR abandoned (i.e. no updates from submitters for
> a few days), we can close the PRs.
>
> Additionally, if the JIRAs themselves are obsolete and no longer apply, we
> can consider closing the JIRAs as obsolete.
>
> Here's one example :
> https://github.com/apache/incubator-airflow/pulls/vijaysbhat
>
> -s
>

Re: TriggerDagRunOperator sub tasks are scheduled to run after few hours

2018-09-07 Thread Maxime Beauchemin

Is the issue timezone related? Personally I've only used Airflow in
UTC-aligned environments so I can't help much on this topic. Bolke as
contributed timezone awareness to the codebase in the past, I'm not sure
what the common caveats may be.

Max

On Fri, Sep 7, 2018 at 4:29 AM Goutam Kumar Sahoo <
goutamkumar.sa...@infosys.com> wrote:

> HI Experts
>
>
>
> In our project, we are trying to replicate the existing  job scheduling
> implemented in Microsoft Orchestrater to new scheduler called “Apache
> Airflow” .
>
>
>
> During this replication process we have an requirement to create Master
> DAG which in turn will call Other child DAGs based on different condition.
> We referred to the existing example (example_trigger_controller_dag,
> example_trigger_target_dag) available in Airflow and tried running them
> manually.
>
>
>
> When we triggered the controller DAG manually , it is got triggered
> immediately with execution time created in local timezone (PDT) whereas the
> child DAG is scheduled to run few hours later. Even though child DAGRUN is
> showing as running, it is doing nothing and waiting for the time to
> satisfy.  Our requirement is to trigger the Target/Child DAG run as soon it
> is triggered by controller DAG
>
>
>
> If you see here , the trigger time and execution time is having a 7 hours
> GAP.
>
>
>
> We have tried to solve this many different ways but couldn’t succeed. We
> have asked for help in many forums but didn’t get any satisfactory answer
> form any of them
>
>
>
>
>
> Please help us ASAP as this issue is one of the show stopper for us .
>
>
>
>
>
>
>

Re: Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin

Nice, thanks for fixing this!

On Wed, Aug 29, 2018 at 1:53 PM Kaxil Naik  wrote:

> I have fixed the issue on https://airflow.apache.org/ , added a comment on
> confluence as well. Will try to fix ReadTheDocs environment (Have opened a
> Jira for it) as it can't install all the dependencies that depend on C
> Modules (
>
> https://read-the-docs.readthedocs.io/en/latest/faq.html#i-get-import-errors-on-libraries-that-depend-on-c-modules
> )
>
> Regards,
> Kaxil
>
> On Wed, Aug 29, 2018 at 9:15 PM Kaxil Naik  wrote:
>
> > Will have a look and resolve it.
> >
> > On Wed, Aug 29, 2018 at 8:25 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Looks like both.
> >>
> >> On Wed, Aug 29, 2018 at 12:18 PM Kaxil Naik 
> wrote:
> >>
> >> > Hi Max,
> >> >
> >> > Did you see that on readthedocs or airflow.apache one?
> >> >
> >> > On Wed, 29 Aug 2018, 20:15 Maxime Beauchemin, <
> >> maximebeauche...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hey committers,
> >> > >
> >> > > I noticed that some of the operators are missing from the API
> >> reference
> >> > > part of the docs (HiveOperator for instance). I'm guessing a
> committer
> >> > > generated / pushed the docs with some libs missing and that the
> >> operators
> >> > > depending on those missing libs got skipped.
> >> > >
> >> > > We may have to improve the doc-generation wiki page or make a
> >> bulletproof
> >> > > shell script that ensures all libs are installed prior to generating
> >> the
> >> > > docs.
> >> > >
> >> > > Max
> >> > >
> >> >
> >>
> >
> >
> > --
> > *Kaxil Naik*
> > *Big Data Consultant *@ *Data Reply UK*
> > *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
> > Developer
> > *Phone: *+44 (0) 74820 88992
> > *LinkedIn*: https://www.linkedin.com/in/kaxil
> >
>
>
> --
> *Kaxil Naik*
> *Big Data Consultant *@ *Data Reply UK*
> *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
> Developer
> *Phone: *+44 (0) 74820 88992
> *LinkedIn*: https://www.linkedin.com/in/kaxil
>

Re: Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin

Looks like both.

On Wed, Aug 29, 2018 at 12:18 PM Kaxil Naik  wrote:

> Hi Max,
>
> Did you see that on readthedocs or airflow.apache one?
>
> On Wed, 29 Aug 2018, 20:15 Maxime Beauchemin, 
> wrote:
>
> > Hey committers,
> >
> > I noticed that some of the operators are missing from the API reference
> > part of the docs (HiveOperator for instance). I'm guessing a committer
> > generated / pushed the docs with some libs missing and that the operators
> > depending on those missing libs got skipped.
> >
> > We may have to improve the doc-generation wiki page or make a bulletproof
> > shell script that ensures all libs are installed prior to generating the
> > docs.
> >
> > Max
> >
>

Missing operators in the docs

2018-08-29 Thread Maxime Beauchemin

Hey committers,

I noticed that some of the operators are missing from the API reference
part of the docs (HiveOperator for instance). I'm guessing a committer
generated / pushed the docs with some libs missing and that the operators
depending on those missing libs got skipped.

We may have to improve the doc-generation wiki page or make a bulletproof
shell script that ensures all libs are installed prior to generating the
docs.

Max

Re: PR Review Dashboard?

2018-08-28 Thread Maxime Beauchemin

Alright let's keep things this way at least for now so that we can move
forward with the Spark approach.

Max

On Tue, Aug 28, 2018 at 10:39 AM Bolke de Bruin  wrote:

> I’m not in favor of moving to GitHub issues. While JIRA is not perfect it
> actually moves discussion to the mailing list. With GitHub issues the stuff
> just gets lost imho.
>
> I do like our changelogs a lot better now.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 27 aug. 2018 om 05:53 heeft Holden Karau  het
> volgende geschreven:
> >
> > Awesome, so I'll give this a shot with JIRA for now and if we end up
> moving to GH we can move it over. I've got a fork of the dashboard I've
> been using in Apache Beam as well as the Spark one so it shouldn't take me
> too long to generalize it again.
> >
> >> On Sun, Aug 26, 2018 at 8:50 PM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> >> I love the idea. I took a quick look at /databricks/spark-pr-dashboard
> >> <https://github.com/databricks/spark-pr-dashboard> and that looks like
> a
> >> nice easy start. It would be interesting to try to make a generic
> project
> >> out of it that would work on top of any project/repo. Basically just
> >> refactor all of the Spark-specific code and configurations into a
> >> `config.py` (assuming there's not too much frontend Spark-specific
> >> code...).
> >>
> >> Of course this tool assumes Jira+Github which works for Airflow, but
> >> probably isn't as common of a setup to really generalize beyond Apache.
> It
> >> seems like by embracing Github issues and dropping Jira we could be
> >> building something much more relevant.
> >>
> >> Max
> >>
> >> On Fri, Aug 24, 2018 at 7:27 PM Holden Karau 
> wrote:
> >>
> >> > Update: we can do this with the dashboard code. Since this would
> modify
> >> > the JIRAs I’d love a sign-off on turning that feature on from someone
> on
> >> > the PMC (or at least a week with no PMC folks saying no).
> >> >
> >> > On Thu, Aug 23, 2018 at 8:00 AM Holden Karau 
> wrote:
> >> >
> >> >> I mean a few ASF projects update JIRA tickets based on PRs
> automatically.
> >> >> Switching from JIRA to GH issues (or back) is super painful, so I’d
> >> >> probably do more incremental improvements personally, I just don’t
> have the
> >> >> time to do something like that.
> >> >>
> >> >> I’ll take a look at some the K8s tools (over in Beam they’re looking
> at
> >> >> one of the review tagging tools out of K8s) if the Spark one is too
> >> >> difficult to adapt to our use case.
> >> >>
> >> >> On Thu, Aug 23, 2018 at 2:34 AM Eamon Keane 
> >> >> wrote:
> >> >>
> >> >>> Kubernetes is a good place to look as they've invested a lot in
> github
> >> >>> bots
> >> >>> and label based workflows. E.g. the cherry-picking script and doc is
> >> >>> here:
> >> >>>
> >> >>>
> >> >>>
> https://github.com/kubernetes/kubernetes/blob/master/hack/cherry_pick_pull.sh
> >> >>>
> >> >>>
> https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md
> >> >>>
> >> >>> And more general overview here:
> >> >>>
> >> >>>
> https://github.com/kubernetes/community/tree/master/contributors/devel
> >> >>>
> >> >>> On Thu, Aug 23, 2018 at 5:11 AM Maxime Beauchemin <
> >> >>> maximebeauche...@gmail.com> wrote:
> >> >>>
> >> >>> > I've heard many times in the past about a GH/Jira syncing tool but
> >> >>> never
> >> >>> > seen it in action. Personally my vote is to move issues to GH and
> drop
> >> >>> > Jira. Though in the process this will break the release helper
> script
> >> >>> here:
> >> >>> >
> >> >>>
> https://github.com/apache/incubator-airflow/blob/master/dev/airflow-jira
> >> >>> >
> >> >>> > We'll be working on a Github label-driven release baking magic
> script
> >> >>> for
> >> >>> > Superset, maybe we could use the same tooling on both Airflow and
> >> >>> Superset.
> >> >>>

Re: Lazily load input for Airflow operators

2018-08-27 Thread Maxime Beauchemin

This is reasonable, it could be nice to have a generic way to replace
operators kwargs with callables. In the meantime you can try this hack
deriving an operator inline with your DAG definition. In this hack, the
callable receives the operator's context object which is nice, it provides
a handle on a lot of things defined here:
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1886
(same
as what's in the jinja template context).

class DerivedFooOperator(FooOperator):

def _bar(self, context):
return datetime.datetime() # only gets evaluated at run time

def execute(self, context):
self.bar = self._bar(context)
super(DerivedFooOperator, self).execute(context)

If `bar` is a required arg, you'll have to pass a dummy static value on
initialization, but it will get overwritten at runtime.

You can imagine having a more generic class mixin to do this. Or maybe
BaseOperator could have a `kwarg_overrides_callables` that would be a dict
of string: callable that would execute somewhere in between `__init__` and
`execute` and do the magic. Or how about a `pre_execute(context): pass`
BaseOperator method as a nice hook to allow for this kind of stuff without
having to call `super`.

Max

On Mon, Aug 27, 2018 at 2:29 PM Victor Jimenez 
wrote:

> TL;DR Is there any recommended way to lazily load input for Airflow
> operators?
>
>
> I could not found a way to do this. While I faced this limitation while
> using the Databricks operator, it seems other operators might potentially
> lack such a functionality. Please, keep reading for more details.
>
>
> ---
>
>
> When instantiating a DatabricksSubmitRunOperator (
> https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/databricks_operator.py)
> users need to pass the description of the job that will later be executed
> on Databricks.
>
> The job description is only needed at execution time (when the hook is
> called). However, the json parameter must already have the full job
> description when constructing the operator. This may present a problem if
> computing the job description needs to execute expensive operations (e.g.,
> querying a database). The expensive operation will be invoked every single
> time the DAG is reprocessed (which may happen quite frequently).
>
> It would be good to have an equivalent mechanism to the python_callable
> parameter in the PythonOperator. In this way, users could pass a function
> that would generate the job description only when the operator is actually
> executed. I discussed this with Andrew Chen (from Databricks), and he
> agrees it would be an interesting feature to add.
>
>
> Does this sound reasonable? Is this use case supported in some way that I
> am unaware of?
>
>
> You can find the issue I created here:
> https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-2964
>

Re: Airflow 1.10.0 is released on PyPI

2018-08-27 Thread Maxime Beauchemin

Thanks to everyone who contributed to this release, and special thanks to
those who worked on packaging and releasing 1.10 !

I took a quick look at the change log and it looks like this release adds *780
commits* on top of `1.9.0` which was released last January. Let's take a
moment to appreciate how much work went into this, good work everyone!

Max

On Mon, Aug 27, 2018 at 9:33 AM Naik Kaxil  wrote:

> Dear Airflow community,
>
>
>
> Airflow 1.10.0 was just released on PyPI (`pip install apache-airflow`).
>
>
>
> https://pypi.python.org/pypi/apache-airflow
>
>
>
> The documentation is available on:
>
>- https://airflow.apache.org
>- https://airflow.readthedocs.io/en/stable/ and
>https://airflow.readthedocs.io/en/1.10.0/
>
>
>
> The source release as well as the binary "sdist" release are available
> here:
>
>
>
>
> https://dist.apache.org/repos/dist/release/incubator/airflow/1.10.0-incubating/
>
>
>
> Find the CHANGELOG here for more details:
>
>
>
> https://github.com/apache/incubator-airflow/blob/master/CHANGELOG.txt
>
>
>
> Cheers,
>
> Kaxil
>
>
>
>
> Kaxil Naik
>
> Data Reply
> 2nd Floor, Nova South
> 160 Victoria Street, Westminster
> London SW1E 5LB - UK
> phone: +44 (0)20 7730 6000
> k.n...@reply.com
> www.reply.com
>
> [image: Data Reply]
>

Re: PR Review Dashboard?

2018-08-26 Thread Maxime Beauchemin

I love the idea. I took a quick look at /databricks/spark-pr-dashboard
<https://github.com/databricks/spark-pr-dashboard> and that looks like a
nice easy start. It would be interesting to try to make a generic project
out of it that would work on top of any project/repo. Basically just
refactor all of the Spark-specific code and configurations into a
`config.py` (assuming there's not too much frontend Spark-specific
code...).

Of course this tool assumes Jira+Github which works for Airflow, but
probably isn't as common of a setup to really generalize beyond Apache. It
seems like by embracing Github issues and dropping Jira we could be
building something much more relevant.

Max

On Fri, Aug 24, 2018 at 7:27 PM Holden Karau  wrote:

> Update: we can do this with the dashboard code. Since this would modify
> the JIRAs I’d love a sign-off on turning that feature on from someone on
> the PMC (or at least a week with no PMC folks saying no).
>
> On Thu, Aug 23, 2018 at 8:00 AM Holden Karau  wrote:
>
>> I mean a few ASF projects update JIRA tickets based on PRs automatically.
>> Switching from JIRA to GH issues (or back) is super painful, so I’d
>> probably do more incremental improvements personally, I just don’t have the
>> time to do something like that.
>>
>> I’ll take a look at some the K8s tools (over in Beam they’re looking at
>> one of the review tagging tools out of K8s) if the Spark one is too
>> difficult to adapt to our use case.
>>
>> On Thu, Aug 23, 2018 at 2:34 AM Eamon Keane 
>> wrote:
>>
>>> Kubernetes is a good place to look as they've invested a lot in github
>>> bots
>>> and label based workflows. E.g. the cherry-picking script and doc is
>>> here:
>>>
>>>
>>> https://github.com/kubernetes/kubernetes/blob/master/hack/cherry_pick_pull.sh
>>>
>>> https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md
>>>
>>> And more general overview here:
>>>
>>> https://github.com/kubernetes/community/tree/master/contributors/devel
>>>
>>> On Thu, Aug 23, 2018 at 5:11 AM Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>>
>>> > I've heard many times in the past about a GH/Jira syncing tool but
>>> never
>>> > seen it in action. Personally my vote is to move issues to GH and drop
>>> > Jira. Though in the process this will break the release helper script
>>> here:
>>> >
>>> https://github.com/apache/incubator-airflow/blob/master/dev/airflow-jira
>>> >
>>> > We'll be working on a Github label-driven release baking magic script
>>> for
>>> > Superset, maybe we could use the same tooling on both Airflow and
>>> Superset.
>>> > The idea is that the script would use labels like `target:apache-1.11`
>>> on
>>> > PRs to bake releases. The tool would take as input a base SHA and
>>> release
>>> > minor release number, and would craft a release branch, fetch and
>>> > cherry-pick all the right commits in the right order based on labels,
>>> > generate release tags (on minor versions) and output state into
>>> > release-info files (listing the base, all cherries, all tags, ...). The
>>> > tricky part is resolving merge conflicts while auto-picking cherries,
>>> but
>>> > the script would guide the operator through it .
>>> >
>>> > Curious to hear about how other projects do it. I think it generally
>>> > involves a lot of manual work. Let me know if you know of open source
>>> > tooling to deal with release management.
>>> >
>>> > Max
>>> >
>>> > On Wed, Aug 22, 2018 at 6:25 PM Holden Karau 
>>> > wrote:
>>> >
>>> > > Thanks for the reminder, forgot to ask at coffee but I'll ask.
>>> > >
>>> > > On Wed, Aug 22, 2018, 1:52 AM Driesprong, Fokko >> >
>>> > > wrote:
>>> > >
>>> > > > Hi Holden,
>>> > > >
>>> > > > Just curious if you got a hold of someone at the coffee machine :-)
>>> > > >
>>> > > > Cheers, Fokko
>>> > > >
>>> > > > Op di 7 aug. 2018 om 09:17 schreef Holden Karau <
>>> hol...@pigscanfly.ca
>>> > >:
>>> > > >
>>> > > > > The JIRA/Github integration tooling I’m a little more fuzzy on
>>> but
>>> > I’m
>>> > > >

Re: PR Review Dashboard?

2018-08-22 Thread Maxime Beauchemin

I've heard many times in the past about a GH/Jira syncing tool but never
seen it in action. Personally my vote is to move issues to GH and drop
Jira. Though in the process this will break the release helper script here:
https://github.com/apache/incubator-airflow/blob/master/dev/airflow-jira

We'll be working on a Github label-driven release baking magic script for
Superset, maybe we could use the same tooling on both Airflow and Superset.
The idea is that the script would use labels like `target:apache-1.11` on
PRs to bake releases. The tool would take as input a base SHA and release
minor release number, and would craft a release branch, fetch and
cherry-pick all the right commits in the right order based on labels,
generate release tags (on minor versions) and output state into
release-info files (listing the base, all cherries, all tags, ...). The
tricky part is resolving merge conflicts while auto-picking cherries, but
the script would guide the operator through it .

Curious to hear about how other projects do it. I think it generally
involves a lot of manual work. Let me know if you know of open source
tooling to deal with release management.

Max

On Wed, Aug 22, 2018 at 6:25 PM Holden Karau  wrote:

> Thanks for the reminder, forgot to ask at coffee but I'll ask.
>
> On Wed, Aug 22, 2018, 1:52 AM Driesprong, Fokko 
> wrote:
>
> > Hi Holden,
> >
> > Just curious if you got a hold of someone at the coffee machine :-)
> >
> > Cheers, Fokko
> >
> > Op di 7 aug. 2018 om 09:17 schreef Holden Karau :
> >
> > > The JIRA/Github integration tooling I’m a little more fuzzy on but I’m
> > > doing coffee with some of the folks who probably know the details this
> > week
> > > and I’ll report back.
> > >
> > > On Tue, Aug 7, 2018 at 12:15 AM Driesprong, Fokko  >
> > > wrote:
> > >
> > > > Hi Holden,
> > > >
> > > > Thanks for reaching out. Recently we've moved to Apache Gitbox (
> > > > https://gitbox.apache.org/), so we use the Github UI directly
> instead
> > of
> > > > having to merge using a CLI (
> > > >
> https://github.com/apache/incubator-airflow/blob/master/dev/airflow-pr
> > ).
> > > >
> > > > Not sure if we're already up to the game of the dashboard, which
> looks
> > > > awesome btw. But as you also mentioned in your live Airflow PR, we're
> > > > missing some automation of communication between Jira and Github. For
> > > > example, as you mentioned, when a PR is opened, the status
> > automagically
> > > > changes to In Progress. Do you have any pointer of how this is set up
> > at,
> > > > for example the Spark or Beam project? So we can replicate this in
> > > Airflow.
> > > >
> > > > Cheers, Fokko
> > > >
> > > > 2018-08-07 7:28 GMT+02:00 Holden Karau :
> > > >
> > > > > Hi Y'all,
> > > > >
> > > > > One of the comments from my livestream was asking if the code for
> the
> > > > Spark
> > > > > PR review dashboard  is OSS (it is
> > > > > ), and I have a
> > fork
> > > > up
> > > > > for Beam, and I was wondering if folks in Airflow would find
> > something
> > > > like
> > > > > this useful? If so I'd be happy to set that up (if not no stress).
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Holden :)
> > > > >
> > > > > --
> > > > > Cell : 425-233-8271
> > > > >
> > > >
> > > --
> > > Twitter: https://twitter.com/holdenkarau
> > >
> >
>

Re: Why not mark inactive DAGs in the main scheduler loop?

2018-08-22 Thread Maxime Beauchemin

I'd rather the scheduler delegate that to one of the minions (subprocess)
if possible. We should keep everything we can off the main thread.

BTW I've been speaking about renaming the scheduler to "supervisor" for a
while now. While renaming may be a bit tricky (updating all references in
the code), we should think of the scheduler as more of a supervisor as it
takes on all sorts of supervision-related tasks.

Tangent: we need to start thinking about allowing for a distributed
scheduler too, and I'm thinking we need to be careful around the tasks that
shouldn't be parallelized (this may or may not be one of them).  We'll need
to do very basic leader election and taking/releasing locks while running
these tasks. I'm thinking we can just set flags in the database to do that.

Max

On Wed, Aug 22, 2018 at 12:19 PM Taylor Edmiston 
wrote:

> I'm not super familiar with this part of the scheduler.  What exactly are
> the implications of doing this mid-loop vs at scheduler termination?
> Is there a use case where DAGs hit this besides having been deleted?
>
> The deactivate_stale_dags call doesn't appear to be super expensive or
> anything like that.
>
> This seems like a reasonable idea to me.
>
> *Taylor Edmiston*
> Blog  | CV
>  | LinkedIn
>  | AngelList
>  | Stack Overflow
> 
>
>
>
> On Wed, Aug 22, 2018 at 2:32 PM Dan Davydov 
> wrote:
>
> > I see some PRs creating endpoints to delete DAGs and other things related
> > to manually deleting DAGs from the DB, but is there a good reason why we
> > can't just move the deactivating DAG logic into the main scheduler loop?
> >
> > The scheduler already has some code like this, but it only runs when the
> > Scheduler terminates:
> >   if all_files_processed:
> > self.log.info(
> > "Deactivating DAGs that haven't been touched since %s",
> > execute_start_time.isoformat()
> > )
> > models.DAG.deactivate_stale_dags(execute_start_time)
> >
>

Re: [RESULT][VOTE] Release Airflow 1.10.0

2018-08-21 Thread Maxime Beauchemin

I can, what's your PyPI ID?

Max

On Mon, Aug 20, 2018 at 2:11 PM Driesprong, Fokko 
wrote:

> Thanks Bolke!
>
> I've just pushed the artifacts to Apache Dist:
>
> https://dist.apache.org/repos/dist/release/incubator/airflow/1.10.0-incubating/
>
> I don't have any access to pypi, this means that I'm not able to upload the
> artifacts over there. Anyone in the position to grand me access or upload
> it to pypi?
>
> Thanks! Cheers, Fokko
>
> 2018-08-20 20:06 GMT+02:00 Bolke de Bruin :
>
> > Hi Guys and Gals,
> >
> > The vote has passed! Apache Airflow 1.10.0 is official.
> >
> > As I am AFK for a while can one of the committers please rename according
> > to the release docs and push it to the relevant locations (pypi and
> Apache
> > dist)?
> >
> > Oh and maybe start a quick 1.10.1?
> >
> > Cheers
> > Bolke
> >
> > Sent from my iPhone
> >
> > Begin forwarded message:
> >
> > > From: Bolke de Bruin 
> > > Date: 20 August 2018 at 20:00:28 CEST
> > > To: gene...@incubator.apache.org, dev@airflow.incubator.apache.org
> > > Subject: [RESULT][VOTE] Release Airflow 1.10.0
> > >
> > > The vote to release Airflow 1.10.0-incubating, having been open for 8
> > > days is now closed.
> > >
> > > There were three binding +1s and no -1 votes.
> > >
> > > +1 (binding):
> > > Justin Mclean
> > > Jakob Homan
> > > Hitesh Shah
> > >
> > > The release is approved.
> > >
> > > Thanks to all those who voted.
> > >
> > > Bolke
> > >
> > > Sent from my iPhone
> > >
> > > Begin forwarded message:
> > >
> > >> From: Bolke de Bruin 
> > >> Date: 20 August 2018 at 19:56:23 CEST
> > >> To: gene...@incubator.apache.org
> > >> Subject: Re: [VOTE] Release Airflow 1.10.0 (new vote based on rc4)
> > >>
> > >> Appreciated Hitesh. Do you know how to add headers to .MD files? There
> > seems to be no technical standard way[1]. Is there a way to solve this
> > elegantly?
> > >>
> > >> Cheers
> > >> Bolke
> > >>
> > >> [1] https://alvinalexander.com/technology/markdown-comments-
> > syntax-not-in-generated-output
> > >>
> > >>
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On 20 Aug 2018, at 19:48, Hitesh Shah  wrote:
> > >>>
> > >>> +1 (binding)
> > >>>
> > >>> Ran through the basic checks.
> > >>>
> > >>> Minor nit which can be fixed in the next release: there are a bunch
> of
> > >>> documentation files which could have a license header added (e.g.
> .md,
> > >>> .rst, )
> > >>>
> > >>> thanks
> > >>> Hitesh
> > >>>
> >  On Mon, Aug 20, 2018 at 4:08 AM Bolke de Bruin 
> > wrote:
> > 
> >  Sorry Willem that should be of course. Apologies.
> > 
> >  Sent from my iPhone
> > 
> > > On 20 Aug 2018, at 13:07, Bolke de Bruin 
> wrote:
> > >
> > > Hi William
> > >
> > > You seem to be missing a "4" at the end of the URL? Ah it seems
> that
> > my
> >  original email had a quirk. Would you mind using the below?
> > >
> > > https://github.com/apache/incubator-airflow/releases/tag/1.10.0rc4
> > >
> > > Thanks!
> > > Bolke
> > >
> > > Sent from my iPhone
> > >
> > >> On 20 Aug 2018, at 13:03, Willem Jiang 
> > wrote:
> > >>
> > >> Hi,
> > >>
> > >> The Git tag cannot be accessed.  I can only get the 404  error
> > there.
> > >>
> > >> https://github.com/apache/incubator-airflow/releases/tag/1.10.0rc
> > >>
> > >>
> > >> Willem Jiang
> > >>
> > >> Twitter: willemjiang
> > >> Weibo: 姜宁willem
> > >>
> > >>> On Sun, Aug 12, 2018 at 8:25 PM, Bolke de Bruin <
> bdbr...@gmail.com
> > >
> >  wrote:
> > >>>
> > >>> Hello Incubator PMC’ers,
> > >>>
> > >>> The Apache Airflow community has voted and approved the proposal
> to
> >  release
> > >>> Apache Airflow 1.10.0 (incubating) based on 1.10.0 Release
> > Candidate
> >  4. We
> > >>> now kindly request the Incubator PMC members to review and vote
> on
> > this
> > >>> incubator release.
> > >>>
> > >>> Airflow is a platform to programmatically author, schedule, and
> > monitor
> > >>> workflows. Use Airflow to author workflows as directed acyclic
> > graphs
> > >>> (DAGs) of tasks. The airflow scheduler executes your tasks on an
> > array
> >  of
> > >>> workers while following the specified dependencies. Rich command
> > line
> > >>> utilities make performing complex surgeries on DAGs a snap. The
> > rich
> >  user
> > >>> interface makes it easy to visualize pipelines running in
> > production,
> > >>> monitor progress, and troubleshoot issues when needed. When
> > workflows
> >  are
> > >>> defined as code, they become more maintainable, versionable,
> > testable,
> >  and
> > >>> collaborative.
> > >>>
> > >>> After a successful IPMC vote Artifacts will be available at:
> > >>>
> > >>> https://www.apache.org/dyn/closer.cgi/incubator/airflow <
> > >>> https://www.apache.org/dyn/closer.cgi/incubator/airflow>
> > >>>
> > >>> Public keys are available

Re: Sep Airflow Bay Area Meetup @ Google

2018-08-12 Thread Maxime Beauchemin

Hey Feng,

Sign me up for a session on "Challenges ahead - taking airflow to the next
level". I'm planning on recycling the content from the talk @Google next
Friday.

Max

On Fri, Aug 10, 2018 at 3:22 PM Feng Lu  wrote:

> Hi all,
>
> We still have 1-2 regular sessions and 4-5 lightening sessions available,
> please send in your talks ;)
> Here's a quick summary on the talks I have received:
>
> Regular sessions:
> Ben Gregory (Astronomer): Running Cloud Native Airflow.
> Feng Lu (Google): Managing Airflow As a Service: Best Practices, Experience
> and Roadmap
> Fokko Driesprong (GoDataDriven): Apache Airflow in the Google Cloud:
> Backfilling streaming data using Dataflow
>
> Lightening Session:
> Barni Seetharaman (Google): Deploy Airflow on Kubernetes using Airflow
> Operator
>
> Session type TBD:
> Manish Ranjan (Tile): Functional yet cost-effective Data Engineering With
> Airflow
>
> Thanks and looking forward to the meetup(120+ sign-ups to date)!
>
>
> Feng
>
> On Thu, Jul 19, 2018 at 2:26 PM Feng Lu  wrote:
>
> > Hi all,
> >
> > Hope you are enjoying your summer!
> >
> > This is Feng Lu from Google and we'll host the next Airflow meetup in
> our Sunnyvale
> > campus. We plan to add a *lightening session* this time for people to
> > share their airflow ideas, work in progress, pain points, etc.
> > Here's the meetup date and schedule:
> >
> > -- Sep 24 (Monday)  --
> > 6:00PM meetup starts
> > 6:00 - 8:00PM light dinner /mix-n-mingle
> > 8:00PM - 9:40PM: 5 sessions (20 minutes each)
> > 9:40PM to 10:10PM: 6 lightening sessions (5 minutes each)
> > 10:10PM to 11:00PM: drinks and social hour
> >
> > I've seen a lot of interesting discussions in the dev mailing-list on
> > security, scalability, event interactions, future directions, hosting
> > platform and others. Please feel free to send your talk proposal to us by
> > replying this email.
> >
> > The Cloud Composer team is also going to share their experience running
> > Apache Airflow as a managed solution and service roadmap.
> >
> > Thank you and looking forward to hearing from y'all soon!
> >
> > p.s., if folks are interested, we can also add a one-day Airflow
> hackathon
> > prior to the meet-up on the same day, please let us know.
> >
> > Feng
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Replacement ShortCircuitOperator

2018-08-10 Thread Maxime Beauchemin

Hey,

I agree, I always thought this would be handled through a BaseOperator flag
`only_run_latest=True`. There are some potentially confusing / incompatible
scenarios like:
* can't have both depend_on_past and only_run_latest (obviously...)
* can't have any downstream tasks of only_run_latest have depend_on_past
* same 2 statements also apply to "wait_for_downstream"
* more?

Max

On Fri, Aug 10, 2018 at 12:43 AM Kevin Campbell  wrote:

> Hi everyone,
>
> We had some issues with the ShortCircuitOperator, so had to come up with an
> alternative implementation.
>
> I've written up the details and the workaround in this blog posting:
>
>
>
> https://blog.diffractive.io/2018/08/07/replacement-shortcircuitoperator-for-airflow/
>
> Seems that to fix this in airflow properly, it needs some design changes. I
> think the sanest solution would be to have the scheduler handle skip
> operations, rather than implement this inside an operator. That would mean
> either having the scheduler read xcom data, or elevating the status of the
> python member to a first class attribute on a task.
>
> Anyway, hoping this posting might at least save some users some headaches
> if they want the ShortCircuitOperator to behave better.
>
> Comments and feedback appreciated.
>
> Regards,
> Kevin
>

Re: Plan to change type of dag_id from String to Number?

2018-08-09 Thread Maxime Beauchemin

The change on perf for the DAG table would be extremely negligible.

Maybe for task_instances (large table with millions of rows, 3 fields
composite key) it *could* be a decent idea. Though you'd then need to have
two indexes to store and maintain and we may have to change the code to
actually use and reference that new more efficient pk in places where it's
more efficient to use that index (some of it SQLAlchemy would do right out
of the box).

This mostly affects the index size (btree(id) is much smaller than
btree(dag_id, task_id, execution_date)), not the key lookup time much as it
is log(n). We'd still have to use the composite btree when we want to do
range scans, which we use frequently to get sets of tasks for a dag or
specific dag task. Since lookups are log(n), and that we need to maintain
that composite btree anyways for range scans, I don't see where that would
really help. It would be a better index (less pages, less memory usage,
...) if we didn't need that other composite one, which we do.

Max

On Thu, Aug 9, 2018 at 8:05 AM Vardan Gupta 
wrote:

> Point well taken on backward compatibility, we will have to take this
> change very diligently, if implemented.
>
> On Thu, Aug 9, 2018 at 7:29 PM Юли Волкова  wrote:
>
> > Because in case what you described nothing about backward compatibility.
> > You think what all who use need to change all theirs DAG's? It's very
> > strange, because you propose one of the most critical change and it will
> > side everyone. If you want id - call it dag_metadata_id and add it. But
> not
> > propose change what hasn't backward compatibility. It's to strange.
> >
> > On Thu, Aug 9, 2018 at 7:04 AM vardangupta...@gmail.com <
> > vardangupta...@gmail.com> wrote:
> >
> > >
> > >
> > > On 2018/08/09 11:55:11, Ash Berlin-Taylor  wrote:
> > > > Absolutely - there will still need to be a human-readable DAG id,
> even
> > > we end up with an auto-icrementing integer ID column internally and for
> > > table join performance reasons.
> > > >
> > > > -ash
> > > >
> > > > > On 9 Aug 2018, at 12:35, Юли Волкова  wrote:
> > > > >
> > > > > How will you understand what your DAG 2 doing enter to it? For
> > > each of
> > > > > 100, for example?
> > > > > Especially, if you are not a developer, who create it. You are a
> > > support
> > > > > team and have 120 DAGs.
> > > > >
> > > > > The first time, when want to also send the answer to dev-mail list.
> > > Please,
> > > > > don't do it.
> > > > >
> > > > > I think it's will be really bad to all who use dag_id like a saying
> > > name of
> > > > > dag. If I will be looked at 0329313 this does not say anything
> useful
> > > for
> > > > > me and it will be very very complicated to identify for which
> process
> > > dag
> > > > > using.  It could be another id for the indexes in DB if it's real
> > > problem
> > > > > for somebody. But, please, do not change dag_id.
> > > > >
> > > > > On Mon, Aug 6, 2018 at 1:32 AM vardangupta...@gmail.com <
> > > > > vardangupta...@gmail.com> wrote:
> > > > >
> > > > >> Hi Everyone,
> > > > >>
> > > > >> Do we have any plan to change type of dag_id from String to
> Number,
> > > this
> > > > >> will make queries on metadata more performant, proposal could be
> > > generating
> > > > >> an auto-incremental value in dag table and this id getting used in
> > > rest of
> > > > >> the other tables?
> > > > >>
> > > > >>
> > > > >> Regards,
> > > > >> Vardan Gupta
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > _
> > > > >
> > > > > С уважением, Юлия Волкова
> > > > > Тел. : +7 (911) 116-71-82
> > > >
> > > >
> > >
> > > Thanks Ash for your reply, I am aligned with what you're saying.
> > >
> > > I was not proposing to take away human readable dag_id instead I was
> > > thinking, why can't we create another field like dag_name which will
> hold
> > > this information at all front facing sites while dag_id is changed to
> > > integer, this will help in making joins work faster in metastore.
> Though,
> > > currently dag_id is indexed but still indexing int (4 bytes) vs
> > > varchar(250) are going to take more index blocks and therefore more
> look
> > up
> > > time. Also, if dag_id is not trivial to change to int, let it be
> present
> > > and let's introduce another col which is actually integer in type and
> let
> > > joining happen on this column across all tables.
> > >
> >
> >
> > --
> > _
> >
> > С уважением, Юлия Волкова
> > Тел. : +7 (911) 116-71-82
>

Re: Custom authentication with RBAC

2018-08-08 Thread Maxime Beauchemin

You can define your own AirflowSecurityManager based on FAB's
SecurityManager
http://flask-appbuilder.readthedocs.io/en/latest/security.html docs.

We should publish docs on how to do this.

Max

On Wed, Aug 8, 2018 at 2:31 PM Gabriel Silk 
wrote:

> Hello Airflow devs,
>
> It seems that it is not possible to use a custom auth backend with the new
> RBAC web server, like it was with the old.
>
> In the old webserver, you could simple set "webserver.auth_backend" to a
> classname and implement any logic you like.
>
> The absence of this feature is a blocker for adapting RBAC.
>
> Is there any easy fix for this? Is it possible to extend FAB in a similar
> way?
>
> Thanks!
>

Re: Basic modeling question

2018-08-08 Thread Maxime Beauchemin

There's also the hack of using templating to skip executions. Say for a
BashOperator:

{% if execution_date.weekday() == 1 %}
echo "skipping today"
{% else %}
./run_workload.sh
{% endif %}

On Wed, Aug 8, 2018 at 4:27 PM Gabriel Silk 
wrote:

> Alexis, do you mean you would have done this using an ExternalTaskSensor?
> Or is there some other way to depend on a range of tasks?
>
> On Wed, Aug 8, 2018 at 3:35 PM, Alexis Rolland  >
> wrote:
>
> > Not sure if it’s optimal compared to what James proposes, but I would
> have
> > simply made the weekly and monthly rollup tasks as downstream tasks of
> the
> > daily log ingestion tasks they depend on. Then I would have used trigger
> > rules ‘all_done’ to ensure those rollup tasks start when their parent
> tasks
> > are completed.
> >
> > https://airflow.incubator.apache.org/concepts.html#trigger-rules
> >
> > (daily log ingestion) > (daily rollup)
> > (daily log ingestion) > (weekly rollup + TriggerRule.all_done)
> > (daily log ingestion) > (monthly rollup + TriggerRule.all_done)
> >
> > Cheers
> >
> > Alexis
> >
> > On 9 Aug 2018, at 02:57, James Meickle  > INVALID> wrote:
> >
> > It sounds like you want something like this?
> >
> > root_operator = DummyOperator()
> >
> > def offset_operator(i):
> >  my_sql_query = "SELECT * FROM ds_add(execution_date, {offset})
> > ;".format(offset=i)
> >  sql_operator = SQLOperator(task_id="offset_by_{}".format(i)",
> > query=my_sql_query)
> >  return sql_operator
> >
> > offset_operators = list(offset_operator(i) for i in range(7))
> > root_operator >> offset_operators
> >
> > # Daily just waits on today, no offset
> > do_daily_work = DummyOperator()
> > offset_operators[0] >> do_daily_work
> >
> > # Weekly waits on today AND the six prior offsets
> > do_weekly_work = DummyOperator()
> > offset_operators >> do_weekly_work
> >
> > IOW, every day you wait for that day's data to be available, and then run
> > the daily job; you also wait for the previous six days data to be
> > available, and when it is, run the weekly job.
> >
> > n.b. - if you do it this way you will have up to 7 tasks polling the
> "same"
> > data point, which is slightly wasteful. But it's also not much code or
> > mental effort to write it this way.
> >
> > On Wed, Aug 8, 2018 at 2:44 PM Gabriel Silk  > mailto:gs...@dropbox.com.invalid>>
> > wrote:
> >
> > My main concern is how to express the fact that the weekly rollup depends
> > on the previous 7 days worth of data, and ensure that it does not run
> until
> > the tasks that generate those 7 days of data have run, assuming that
> tasks
> > can run non-sequentially.
> >
> > It's easy enough when you have the following situation:
> >
> > (daily log ingestion) <-- (daily rollup)
> >
> > In any given DAG run, you are guaranteed to have the data needed for
> (daily
> > rollup), because the dependency that generated its data just ran.
> >
> > But I'm not sure how best to model it when you have all of the following:
> >
> > (daily log ingestion) <-- (daily rollup)
> > (daily log ingestion) <-- (weekly rollup)
> > (daily log ingestion) <-- (monthly rollup)
> >
> >
> >
> > On Wed, Aug 8, 2018 at 11:29 AM, Taylor Edmiston  > >
> > wrote:
> >
> > Gabriel -
> >
> > Ah, I missed your earlier comment about weekly/monthly rollups also being
> > on a daily cadence.  So is your concern e.g., more about reducing the
> > redundant process of the weekly rollup tasks for the days of that range
> > that already processed in the previous DAG run(s)?  Or mainly about the
> > dependency of not executing the first weekly at all until the first 7
> > daily
> > rollups worth of data have built up?
> >
> > *Taylor Edmiston*
> > Blog  | CV
> >  | LinkedIn
> >  | AngelList
> >  | Stack Overflow
> > 
> >
> >
> > On Wed, Aug 8, 2018 at 2:14 PM, James Meickle  > mailto:jmeic...@quantopian.com>.
> > invalid> wrote:
> >
> > If you want to run (daily, rolling weekly, rolling monthly) backups on
> > a
> > daily basis, and they're mostly the same but have some additional
> > dependencies, you can write a DAG factory method, which you call three
> > times. Certain nodes only get added to the longer-than-daily backups.
> >
> > On Wed, Aug 8, 2018 at 2:03 PM Gabriel Silk  > mailto:gs...@dropbox.com.invalid>
> >
> > wrote:
> >
> > Thanks Andy and Taylor for the suggestions --
> >
> > I see how that would work for the case where you want a weekly rollup
> > that
> > runs on a weekly cadence.
> >
> > But what about a rolling weekly or monthly rollup that runs each day?
> >
> > On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper <
> > andy.coo...@astronomer.io>
> > wrote:
> >
> > To expand on Taylor's idea
> >
> > I recently wrote a ScheduleBlackoutSensor that woul

Re: The need for LocalTaskJob

2018-08-06 Thread Maxime Beauchemin

Yes clearly this area needs TLC. Thanks for getting the ball rolling.

Max

On Sat, Aug 4, 2018 at 1:58 PM Ash Berlin-Taylor <
ash_airflowl...@firemirror.com> wrote:

>
> > On 4 Aug 2018, at 21:25, Bolke de Bruin  wrote:
> >
> > We can just execute “python” just fine. Because it will run in a
> separate interpreter no issues will come from sys.modules as that is not
> inherited. Will still parse DAGs in a separate process then. Forking (@ash)
> probably does not work as that does share sys.modules.
>
> Some sharing of modules was my idea - if we are careful about what modules
> we load, and we only load the airflow core pre fork, and don't parse any
> DAG pre-fork, then forking sharing currently loaded modules is a good thing
> for speed. Think of it like the preload_app option to a gunicorn worker,
> where the master loads the app and then forks.
>
> > [snip]
> >
> > I’m writing AIP-2
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-2+Simplify+process+launching
> to work this out.
>
> Sounds good. I'm not proposing we try my forking idea yet, and your
> proposal is a definite improvement from where we are now.
>
> >
> > B.
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 4 aug. 2018 om 19:40 heeft Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> het volgende geschreven:
> >>
> >> Comments inline.
> >>
> >>> On 4 Aug 2018, at 18:28, Maxime Beauchemin 
> wrote:
> >>>
> >>> Let me confirm I'm understanding this right, we're talking specifically
> >>> about the CeleryExecutor not starting and `airflow run` (not --raw)
> >>> command, and fire up a LocalTaskJob instead? Then we'd still have the
> >>> worker fire up the `airflow run --raw` command?
> >>>
> >>> Seems reasonable. One thing to keep in mind is the fact that shelling
> out
> >>> guarantees no `sys.module` caching, which is a real issue for slowly
> >>> changing DAG definitions. That's the reason why we'd have to reboot the
> >>> scheduler periodically before it used sub-processes to evaluate DAGs.
> Any
> >>> code that needs to evaluate a DAG should probably be done in a
> subprocess.
> >>
> >>>
> >>> Shelling out also allows for doing things like unix impersonation and
> >>> applying CGROUPS. This currently happens between `airflow run` and
> `airflow
> >>> run --raw`. The parent process also does heartbeat and listen for
> external
> >>> kill signal (kill pills).
> >>>
> >>> I think what we want is smarter executors and only one level of bash
> >>> command: the `airflow run --raw`, and ideally the system that fires
> this up
> >>> is not Airflow itself, and cannot be DAG-aware (or it will need to get
> >>> restarted to flush the cache).
> >>
> >> Rather than shelling out to `airflow run` could we instead fork and run
> the CLI code directly? This involves parsing the config twice, loading all
> of the airflow and SQLAlchemy deps twice etc. This I think would account
> for a not-insignificant speed difference for the unit tests. In the case of
> impersonation we'd probably have no option but to exec `airflow`, but
> most(?) people don't use that?
> >>
> >> Avoiding the extra parsing pentalty and process when we don't need it
> might be worth it for test speed up alone. And we've already got
> impersonation covered in the tests so we'll know that it still works.
> >>
> >>>
> >>> To me that really brings up the whole question of what should be
> handled by
> >>> the Executor, and what belongs in core Airflow. The Executor needs to
> do
> >>> more, and Airflow core less.
> >>
> >> I agree with the sentiment that Core should do less and Executors more
> -- many parts of the core are reimplementing what Celery itself could do.
> >>
> >>
> >>>
> >>> When you think about how this should all work on Kubernetes, it looks
> >>> something like this:
> >>> * the scheduler, through KubeExecutor, calls the k8s API, tells it to
> fire
> >>> up and Airflow task
> >>> * container boots up and starts an `airflow run --raw` command
> >>> * k8s handles heartbeats, monitors tasks, knows how to kill a running
> task
> >>> * the scheduler process (call it supervisor), talks with k8s through
> >>> KubeExecutor
> >>> and handles zombie cleanup and sending kill pills
> >>>
> >>> Now because Celery doesn't offer as many guarantees it gets a bit more
> >>> tricky. Is there even a way to send a kill pill through Celery? Are
> there
> >>> other ways than using a parent process to accomplish this?
> >>
> >> It does
> http://docs.celeryproject.org/en/latest/userguide/workers.html#revoke-revoking-tasks
> (at least it does now)
> >>
> >>>
> >>> At a higher level, it seems like we need to move more logic from core
> >>> Airflow into the executors. For instance, the heartbeat construct
> should
> >>> probably be 100% handled by the executor, and not an assumption in the
> core
> >>> code base.
> >>>
> >>> I think I drifted a bit, hopefully that's still helpful.
> >>>
> >>> Max
>
>

Re: The need for LocalTaskJob

2018-08-04 Thread Maxime Beauchemin

Let me confirm I'm understanding this right, we're talking specifically
about the CeleryExecutor not starting and `airflow run` (not --raw)
command, and fire up a LocalTaskJob instead? Then we'd still have the
worker fire up the `airflow run --raw` command?

Seems reasonable. One thing to keep in mind is the fact that shelling out
guarantees no `sys.module` caching, which is a real issue for slowly
changing DAG definitions. That's the reason why we'd have to reboot the
scheduler periodically before it used sub-processes to evaluate DAGs. Any
code that needs to evaluate a DAG should probably be done in a subprocess.

Shelling out also allows for doing things like unix impersonation and
applying CGROUPS. This currently happens between `airflow run` and `airflow
run --raw`. The parent process also does heartbeat and listen for external
kill signal (kill pills).

I think what we want is smarter executors and only one level of bash
command: the `airflow run --raw`, and ideally the system that fires this up
is not Airflow itself, and cannot be DAG-aware (or it will need to get
restarted to flush the cache).

To me that really brings up the whole question of what should be handled by
the Executor, and what belongs in core Airflow. The Executor needs to do
more, and Airflow core less.

When you think about how this should all work on Kubernetes, it looks
something like this:
* the scheduler, through KubeExecutor, calls the k8s API, tells it to fire
up and Airflow task
* container boots up and starts an `airflow run --raw` command
* k8s handles heartbeats, monitors tasks, knows how to kill a running task
* the scheduler process (call it supervisor), talks with k8s through
KubeExecutor
and handles zombie cleanup and sending kill pills

Now because Celery doesn't offer as many guarantees it gets a bit more
tricky. Is there even a way to send a kill pill through Celery? Are there
other ways than using a parent process to accomplish this?

At a higher level, it seems like we need to move more logic from core
Airflow into the executors. For instance, the heartbeat construct should
probably be 100% handled by the executor, and not an assumption in the core
code base.

I think I drifted a bit, hopefully that's still helpful.

Max

On Sat, Aug 4, 2018 at 6:42 AM Dan Davydov 
wrote:

> Alex (cc'd) brought this up to me about this a while ago too, and I agreed
> with him. It is definitely something we should do, I remember there were
> some things that were a bit tricky about removing the intermediate process
> and would be a bit of work to fix (something about the tasks needing to
> heartbeat the parent process maybe?).
>
> TLDR: No blockers from me, just might be a bit of work to implement.
>
>
> On Sat, Aug 4, 2018 at 9:15 AM Bolke de Bruin  wrote:
>
> > Hi Max, Dan et al,
> >
> > Currently, when a scheduled task runs this happens in three steps:
> >
> > 1. Worker
> > 2. LocalTaskJob
> > 3. Raw task instance
> >
> > It uses (by default) 5 (!) different processes:
> >
> > 1. Worker
> > 2. Bash + Airflow
> > 3. Bash + Airflow
> >
> > I think we can merge worker and LocalTaskJob as the latter seems exist
> > only to track a particular task. This can be done within the worker
> without
> > side effects. Next to thatI think we can limit the amount of (airflow)
> > processes to 2 if we remove the bash dependency. I don’t see any reason
> to
> > depend on bash.
> >
> > Can you guys shed some light on what the thoughts were around those
> > choices? Am I missing anything on why they should exist?
> >
> > Cheers
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
>

Re: Use 'watch' feature of Github instead of this list?

2018-08-03 Thread Maxime Beauchemin

We have an open issue with Apache Infra about this that you can track here:
https://issues.apache.org/jira/browse/INFRA-16854

On Fri, Aug 3, 2018 at 11:29 AM Trent Robbins  wrote:

> Hi All,
>
> Is it possible that people who want to see a notification for every issue
> can subscribe to notifications using the 'watch' feature of GitHub rather
> than mirroring each notification on this list through GitBox?
>
> Thanks!
>
>
>
> Best,
>
> Trent Robbins
> Strategic Consultant for Open Source Software
> Tau Informatics LLC
> desk: 415-404-9452
> tr...@tauinformatics.com
> https://www.linkedin.com/in/trentrobbins
>

Re: Apache Airflow welcome new committer/PMC member : Feng Tao (a.k.a. feng-tao)

2018-08-03 Thread Maxime Beauchemin

Well deserved, welcome aboard!

On Fri, Aug 3, 2018 at 9:07 AM Mark Grover 
wrote:

> Congrats Tao!
>
> On Fri, Aug 3, 2018, 08:52 Jin Chang  wrote:
>
> > Congrats, Tao!!
> >
> > On Fri, Aug 3, 2018 at 8:20 AM Taylor Edmiston 
> > wrote:
> >
> > > Congratulations, Feng!
> > >
> > > *Taylor Edmiston*
> > > Blog  | CV
> > >  | LinkedIn
> > >  | AngelList
> > >  | Stack Overflow
> > > 
> > >
> > >
> > > On Fri, Aug 3, 2018 at 7:31 AM, Driesprong, Fokko  >
> > > wrote:
> > >
> > > > Welcome Feng! Awesome to have you on board!
> > > >
> > > > 2018-08-03 10:41 GMT+02:00 Naik Kaxil :
> > > >
> > > > > Hi Airflow'ers,
> > > > >
> > > > >
> > > > >
> > > > > Please join the Apache Airflow PMC in welcoming its newest member
> and
> > > > >
> > > > > co-committer, Feng Tao (a.k.a. feng-tao<
> https://github.com/feng-tao
> > >).
> > > > >
> > > > >
> > > > >
> > > > > Welcome Feng, great to have you on board!
> > > > >
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Kaxil
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Kaxil Naik
> > > > >
> > > > > Data Reply
> > > > > 2nd Floor, Nova South
> > > > > 160 Victoria Street, Westminster
> > > > >  > > > Westminster+%0D%0ALondon+SW1E+5LB+-+UK&entry=gmail&source=g>
> > > > > London SW1E 5LB - UK
> > > > > phone: +44 (0)20 7730 6000
> > > > > k.n...@reply.com
> > > > > www.reply.com
> > > > >
> > > > > [image: Data Reply]
> > > > >
> > > >
> > >
> >
>

Re: We've migrated to Github to repo!

2018-07-31 Thread Maxime Beauchemin

What I meant by changing history is mutating one or many SHAs in the
branch, an operation that would require force-pushing, which merging
doesn't do. Personally I prefer "Squash & Merge" as it makes for a
merge-commit free `git log` and having a linear branch history in master
that aligns with when things were introduced to the branch.

It's possible to disable some of these options from the repo (only if
you're an Admin, meaning we'd have to involve INFRA to change that). But
it's good to have options for the cases I mentioned above.

So committers, use "Squash and Merge"! It matches our previous process when
using the defaults in the now defunct `scripts/airflow-pr`

[I'm really hoping I'm not starting a merge vs rebase workflow debate
here...]

Max

On Tue, Jul 31, 2018 at 12:37 PM Driesprong, Fokko 
wrote:

> Hi Max,
>
> You're right. I just started plowing though my mailbox and merged a commit
> without squash and merge, but it changes history as you mention.
> Nice thing of Github is if you change it, it remembers your preference
> which is Squash and Merge :-)
>
> Love the Gitbox so far, great work!
>
> Cheers, Fokko
>
> 2018-07-31 21:34 GMT+02:00 Maxime Beauchemin :
>
> > "Squash & Merge" (the default) does the right thing (squashes the
> multiple
> > commit and replays the resulting commit on top of master), we should use
> > that most of the times. We'd only want to merge if we wanted to preserve
> > history from within the PR (multiple collaborators or multiple important
> > commits that we want to keep detailed in master for instance).
> >
> > I'm not sure how to verify whether the `master` branch is protected on
> this
> > setup (without pushing to it as a test, which I'd rather not do). We
> should
> > make sure that it is though as changing history on master can cause all
> > sorts of problems.
> >
> > Max
> >
> > On Tue, Jul 31, 2018 at 9:21 AM Sid Anand  wrote:
> >
> > > The other benefit of using Option 3 over Option 1 is that you maintain
> > the
> > > history of who committed and who authored in one line in the Git log--
> > i.e.
> > > "bob33 authored and ashb committed 3 hours ago" instead of just "ashb
> > > committed" for a merge commit followed by the commit(s) from bob33.
> > >
> > > On Tue, Jul 31, 2018 at 9:11 AM Sid Anand  wrote:
> > >
> > > > Ash,
> > > > This is pretty cool. I just merged one PR from GH directly.
> > > >
> > > > Interestingly, I still used the `dev/airflow-pr work_local` to test
> out
> > > > the PR, but merging directly in the GitHub UI afterwards definitely
> > > avoided
> > > > my needing to do another `dev/airflow-pr merge` CLI command.
> > > >
> > > > There are 3 options in the UI: The default is "Create a merge commit"
> > > > (Option 1). I think the ones we want is the "Rebase & Merge" (Option
> > 3),
> > > > which requires that PR submitters squash their commits. Otherwise, we
> > > could
> > > > use "Squash & Merge" (Option 2), though I am not clear if Squash &
> > Merge
> > > is
> > > > more like option 1 or option 3.
> > > >
> > > > -s
> > > >
> > > > On Mon, Jul 30, 2018 at 7:19 PM Andrew Phillips <
> aphill...@qrmedia.com
> > >
> > > > wrote:
> > > >
> > > >> > We should ask Apache infra to send the GH notifs to another
> mailing
> > > >> > list.
> > > >>
> > > >> Over at jclouds, we created a "notifications@" list for this
> purpose
> > > >> (well, actually we renamed "issues@" to "notifications@"), and send
> > > >> messages there:
> > > >>
> > > >> https://issues.apache.org/jira/browse/INFRA-7180
> > > >> https://mail-archives.apache.org/mod_mbox/jclouds-notifications/
> > > >>
> > > >> Regards
> > > >>
> > > >> ap
> > > >>
> > > >
> > >
> >
>

Re: We've migrated to Github to repo!

2018-07-31 Thread Maxime Beauchemin

"Squash & Merge" (the default) does the right thing (squashes the multiple
commit and replays the resulting commit on top of master), we should use
that most of the times. We'd only want to merge if we wanted to preserve
history from within the PR (multiple collaborators or multiple important
commits that we want to keep detailed in master for instance).

I'm not sure how to verify whether the `master` branch is protected on this
setup (without pushing to it as a test, which I'd rather not do). We should
make sure that it is though as changing history on master can cause all
sorts of problems.

Max

On Tue, Jul 31, 2018 at 9:21 AM Sid Anand  wrote:

> The other benefit of using Option 3 over Option 1 is that you maintain the
> history of who committed and who authored in one line in the Git log-- i.e.
> "bob33 authored and ashb committed 3 hours ago" instead of just "ashb
> committed" for a merge commit followed by the commit(s) from bob33.
>
> On Tue, Jul 31, 2018 at 9:11 AM Sid Anand  wrote:
>
> > Ash,
> > This is pretty cool. I just merged one PR from GH directly.
> >
> > Interestingly, I still used the `dev/airflow-pr work_local` to test out
> > the PR, but merging directly in the GitHub UI afterwards definitely
> avoided
> > my needing to do another `dev/airflow-pr merge` CLI command.
> >
> > There are 3 options in the UI: The default is "Create a merge commit"
> > (Option 1). I think the ones we want is the "Rebase & Merge" (Option 3),
> > which requires that PR submitters squash their commits. Otherwise, we
> could
> > use "Squash & Merge" (Option 2), though I am not clear if Squash & Merge
> is
> > more like option 1 or option 3.
> >
> > -s
> >
> > On Mon, Jul 30, 2018 at 7:19 PM Andrew Phillips 
> > wrote:
> >
> >> > We should ask Apache infra to send the GH notifs to another mailing
> >> > list.
> >>
> >> Over at jclouds, we created a "notifications@" list for this purpose
> >> (well, actually we renamed "issues@" to "notifications@"), and send
> >> messages there:
> >>
> >> https://issues.apache.org/jira/browse/INFRA-7180
> >> https://mail-archives.apache.org/mod_mbox/jclouds-notifications/
> >>
> >> Regards
> >>
> >> ap
> >>
> >
>

Re: We've migrated to Github to repo!

2018-07-30 Thread Maxime Beauchemin

We should ask Apache infra to send the GH notifs to another mailing list.

Max

On Mon, Jul 30, 2018 at 11:35 AM Ash Berlin-Taylor <
ash_airflowl...@firemirror.com> wrote:

> It appears we also have comments on Github issues being auto-duplicated to
> the dev mailing list -- this will increase the email volume on that list.
>
> Would we like to keep that feature or disable it?
>
> -ash
>
> > On 30 Jul 2018, at 20:24, Ash Berlin-Taylor <
> ash_airflowl...@firemirror.com> wrote:
> >
> > Hi everyone, but especially committers: we have now moved to Github, and
> should be able to commit directly there (and close issues too hopefully).
> >
> > If you are a committer and haven't yet linked your Github account with
> ASF go to https://gitbox.apache.org/setup/ and follow the instructions.
> >
> > If there is anything not working leave a comment on
> https://issues.apache.org/jira/browse/INFRA-16602 - I am travelling so
> can't test much right now.
> >
> > -ash
>
>

Re: Airflow's JS code (and dependencies) manageable via npm and webpack

2018-07-23 Thread Maxime Beauchemin

Are there any gaps on the new vs old UI at this point?

For many that upgrade will require some work to set up authentication on
the new UI. This is fairly well documented in FAB.

Max

On Mon, Jul 23, 2018 at 3:08 AM Bolke de Bruin  wrote:

> I think it should be removed now. 1.10.X should be the last release seri s
> that supports the old www. Do we need to vote on this?
>
> Great work Verdan!
>
> Verstuurd vanaf mijn iPad
>
> > Op 23 jul. 2018 om 10:23 heeft Driesprong, Fokko 
> het volgende geschreven:
> >
> > Nice work Verdan.
> >
> > The frontend really needed some love, thank you for picking this up.
> Maybe
> > we should also think deprecating the old www. Keeping both of the UI's is
> > something that takes a lot of time. Maybe after the release of 1.10 we
> can
> > think of moving to Airflow 2.0, and removing the old UI.
> >
> >
> > Cheers, Fokko
> >
> > 2018-07-23 10:02 GMT+02:00 Naik Kaxil :
> >
> >> Awesome. Thanks @Verdan
> >>
> >> On 23/07/2018, 07:58, "Verdan Mahmood" 
> wrote:
> >>
> >>Heads-up!! This frontend change has been merged in master branch
> >> recently.
> >>This will impact the users working on Airflow RBAC UI only. That
> means:
> >>
> >>*If you are a contributor/developer of Apache Airflow:*
> >>You'll need to install and build the frontend packages if you want to
> >> run
> >>the web UI.
> >>Please make sure to read the new section, "Setting up the node / npm
> >>javascript environment"
> >><https://github.com/apache/incubator-airflow/blob/master/
> >> CONTRIBUTING.md#setting-up-the-node--npm-javascript-
> >> environment-only-for-www_rbac>
> >>
> >>in CONTRIBUTING.md
> >>
> >>*If you are using Apache Airflow in your production environment:*
> >>Nothing will impact you, as every new build of Apache Airflow will
> >> come up
> >>with pre-built dependencies.
> >>
> >>Please let me know if you have any questions. Thank you
> >>
> >>Best,
> >>*Verdan Mahmood*
> >>
> >>
> >>On Sun, Jul 15, 2018 at 6:52 PM Maxime Beauchemin <
> >>maximebeauche...@gmail.com> wrote:
> >>
> >>> Glad to see this is happening!
> >>>
> >>> Max
> >>>
> >>> On Mon, Jul 9, 2018 at 6:37 AM Ash Berlin-Taylor <
> >>> ash_airflowl...@firemirror.com> wrote:
> >>>
> >>>> Great! Thanks for doing this. I've left some review comments on
> >> your PR.
> >>>>
> >>>> -ash
> >>>>
> >>>>> On 9 Jul 2018, at 11:45, Verdan Mahmood <
> >> verdan.mahm...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> Hey Guys, 
> >>>>>
> >>>>> In an effort to simplify the JS dependencies of Airflow
> >>>>> 
> >>>>> ,
> >>>>> I've
> >>>>> introduce
> >>>>> d
> >>>>> npm and webpack for the package management. For now, it only
> >> implements
> >>>>> this in the www_rbac version of the web server.
> >>>>> 
> >>>>>
> >>>>> Pull Request: https://github.com/apache/
> >> incubator-airflow/pull/3572
> >>>>>
> >>>>> The problem with the
> >>>>> existing 
> >>>>> frontend (
> >>>>> JS
> >>>>> ) code of Airflow is that most of the custom JS is written
> >>>>> with
> >>>>> in the html files, using the Flask's (Jinja) variables in that
> >> JS. The
> >>>> next
> >>>>> step of this effort would be to extract that custom
> >>>>> JS
> >>>>> code in separate JS files
> >>>>> ,
> >>>>> use the dependencies in those files using require or import
> >>>>>  and introduce the JS automated test suite eventually. 
> >>>>> (At the moment, I'm simply using the CopyWebPackPlugin to copy
> >> the
> >>>> required
> >>>>> dependencies for use)
> >>>>> .
> >>>>>
> >>>>> There are also some dependencies which are directly modified in
> >> the
> &g

Re: PR for refactoring Airflow SLAs

2018-07-17 Thread Maxime Beauchemin

I did a quick scan and it looks like great work, thanks for your
contribution! I'm guessing the committers are all either busy or
vacationing at this moment. Let's make sure this gets properly reviewed and
merged.

Related to this is the thought of having a formal flow for improvement
proposals so that we can do some design review upfront, and couple
contributors with committers early to make sure the process goes through
smoothly. It hurts to have to have quality contributions ignored. Clearly
we need to onboard more committers to insure quality work gets merged while
also providing steady, high quality releases.

In the meantime I'd advise you to ping regularly to make sure this PR gets
the attention it deserves and prevent it from getting buried in the pile.

Max

On Tue, Jul 17, 2018 at 5:27 AM James Meickle
 wrote:

> Hi all,
>
> I'd still love to get some eyes on this one if anyone has time. Definitely
> needs some direction as to what is required before merging, since this is a
> higher-level API change...
>
> -James M.
>
> On Mon, Jul 9, 2018 at 11:58 AM, James Meickle 
> wrote:
>
> > Hi folks,
> >
> > Based on my earlier email to the list, I have submitted a PR that splits
> > `sla=` into three independent SLA parameters, as well as heavily
> > restructuring other parts of the SLA feature:
> >
> > https://github.com/apache/incubator-airflow/pull/3584
> >
> > This is my first Airflow PR and I'm still learning the codebase, so
> > there's likely to be flaws with it. But I'm most interested in the
> general
> > compatibility of this feature with the rest of Airflow. We want this for
> > our purposes at Quantopian, but we'd really prefer to get it into Airflow
> > core rather than running a fork forever!
> >
> > Let me know your thoughts,
> >
> > -James M.
> >
>

Re: Using large numbers of sensors, resource consumption

2018-07-15 Thread Maxime Beauchemin

There have been conversations in the past around the idea of adding an
`evaluation_method` argument in BaseSensor that would allow for different
options:
1. the current approach which is taking up a slot and poking periodically
(heavy on slot usage)
2. one approach closer to fail/retry approach, likely introducing a new
state representing that it's waiting for the next sensing event (heavy on
overhead, MQ traffic, ...)
3. one where the scheduler itself runs the "poke" method in line, in many
cases it represents very little overhead for the scheduler to run that
task, and the scheduler is already insulated (DAG is parsed in a sub
process) (heavy on the scheduler machine). I think it's reasonable to do
this even without a distributed scheduler, especially for cheap-to-check
sensors.

Another way to mitigate resources that we used at Airbnb is to have a
dedicated sensor queue with machines that are provisioned more aggressively
(say 16 or 32 slots per CPU core), and route the cheap sensing tasks to
those machines.

Max

On Thu, Jul 12, 2018 at 11:51 AM Pedro Machado  wrote:

> Thanks, Ash, Alexander, and Stefan for your replies.
>
> I am relatively new to airflow and not familiar with the code base. I like
> the idea of having a more efficient sensor.
>
> The async approach makes sense, but I don't know how well it would fit
> within the existing architecture.
>
> I like that Stefan's "reschedule" approach can fit the current architecture
> and could be implemented sooner. From the user point of view, my only
> feedback is that the UI should not show sensors that are still running as
> failed or up for retry as that would draw attention to things that are
> running as expected. I'll add this comment to the JIRA issue.
>
> Thanks!
>
> Pedro
>
>
> On Tue, Jul 10, 2018 at 9:44 AM Stefan Seelmann 
> wrote:
>
> > I also have that requirement and I'm working on a proposal for
> > rescheduling tasks. My current PoC can be found at [1] which uses
> > up_for_retry state which has some problems. I started to make some
> > changes, I hope can make a first proposal this week.
> >
> > The basic idea is:
> > * A new "reschedule" flag for sensors, if set to True it will raise an
> > AirflowRescheduleException (with the new schedule date) that causes a
> > reschedule
> > * Reschedule requests are recorded in new `task_reschedule` table and
> > visualized in the Gantt view.
> > * A new TI dependency that checks if a task is ready to be re-scheduled
> >
> > Advantages:
> > * This change is backward compatible. Existing sensors behave like
> > before. But it's possible to set the "reschedule" flag.
> > * The timeout and poke_interval are still respected and used to
> > calculate the next schedule time
> > * Custom sensor implementations can even define the next sensible
> > schedule date.
> > * This mechanism can also be used by non-sensor operators
> >
> > Kind Regards,
> > Stefan
> >
> > [1]
> https://github.com/seelmann/incubator-airflow/tree/reschedule-sensor-3
> >
> > On 07/10/2018 04:05 PM, Pedro Machado wrote:
> > > I have a few DAGs that use time sensors to wait until data is ready,
> > which
> > > can be several days.
> > >
> > > I have one daily DAG where, for each execution date, I have to repull
> the
> > > data for the next 7 days to capture changes (late arriving revenue
> data).
> > > This DAG currently starts 7 TimeDeltaSensors for each execution days
> with
> > > delays that range from 0 to 6 days.
> > >
> > > I was wondering what the recommendation is for cases like this where a
> > > large number of sensors is needed.
> > >
> > > Are there ways to reduce the footprint of these sensors so that they
> use
> > > less CPU and memory?
> > >
> > > I noticed that in one of the DAGs that Germain Tanguy had in the
> > > presentation he shared today a sensor was set to time out every 30
> > seconds
> > > but had a large retry count so instead of running constantly, it runs
> > every
> > > 15 minutes for 30 seconds and then dies.
> > >
> > > Are other people using this pattern? Do you have other suggestions?
> > >
> > > Thanks,
> > >
> > > Pedro
> > >
> >
> >
>

Re: [DISCUSS] AIP - Time for Airflow Improvement Proposals?

2018-07-15 Thread Maxime Beauchemin

+1

On Tue, Jul 10, 2018 at 1:09 PM Sid Anand  wrote:

> +1
>
> On Tue, Jul 10, 2018 at 1:02 PM George Leslie-Waksman
>  wrote:
>
> > +1
> >
> > On Tue, Jul 10, 2018 at 11:50 AM Jakob Homan  wrote:
> >
> > > Lots of Apache projects use ?IPs - Whatever Improvement Proposal - to
> > > document and gather consensus on large changes to the code base.  Some
> > > examples:
> > >* Kafka Improvement Proposals (KIP) -
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > >   * Flink Improvement Proposal (FLIP) -
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >   * Spark Improvement Proposal (SPIP) -
> > > https://spark.apache.org/improvement-proposals.html
> > >
> > > We've got a few changes that have been discussed, either on the
> > > list/JIRA (good) or in private (bad -
> > > https://incubator.apache.org/guides/committer.html#mailing_lists) that
> > > are of a magnitude that they may benefit from some version of this
> > > process.  Examples:
> > >* The in-progress plan to refactor out connectors and hooks
> > > (AIRFLOW-2732)
> > >* K8S deployment operator proposal
> > >* Initial Design for Supporting fine-grained Connection encryption
> > >
> > >
> > > The benefits of this approach is that the design is hosted somewhere
> > > less ephemeral and more editable than email.  It also provides a
> > > framework for documenting and confirming consensus through the whole
> > > community.
> > >
> > >What do y'all think?
> > >
> > > -Jakob
> > >
> >
>

Re: Airflow's JS code (and dependencies) manageable via npm and webpack

2018-07-15 Thread Maxime Beauchemin

Glad to see this is happening!

Max

On Mon, Jul 9, 2018 at 6:37 AM Ash Berlin-Taylor <
ash_airflowl...@firemirror.com> wrote:

> Great! Thanks for doing this. I've left some review comments on your PR.
>
> -ash
>
> > On 9 Jul 2018, at 11:45, Verdan Mahmood 
> wrote:
> >
> > Hey Guys, 
> >
> > In an effort to simplify the JS dependencies of Airflow
> > 
> > ,
> > I've
> > introduce
> > d
> > npm and webpack for the package management. For now, it only implements
> > this in the www_rbac version of the web server.
> > 
> >
> > Pull Request: https://github.com/apache/incubator-airflow/pull/3572
> >
> > The problem with the
> > existing 
> > frontend (
> > JS
> > ) code of Airflow is that most of the custom JS is written
> > with
> > in the html files, using the Flask's (Jinja) variables in that JS. The
> next
> > step of this effort would be to extract that custom
> > JS
> > code in separate JS files
> > ,
> > use the dependencies in those files using require or import
> >  and introduce the JS automated test suite eventually. 
> > (At the moment, I'm simply using the CopyWebPackPlugin to copy the
> required
> > dependencies for use)
> > .
> >
> > There are also some dependencies which are directly modified in the
> codebase
> >  or are outdated
> > . I couldn't found the
> >  correct
> > npm versions of those libraries. (dagre-d3.js and gantt-chart-d3v2.js).
> > Apparently dagre-d3.js that we are using is one of the gist or is very
> old
> > version
> >  not supported with webpack 4
> > , while the gantt-chart-d3v2 has been modified according to Airflow's
> > requirements
> >  I believe
> > .
> >  Used the existing libraries for now. 
> >
> > I am currently working in a separate branch to upgrade the DagreD3
> > library, and updating the custom JS related to DagreD3 accordingly. 
> >
> > This PR also introduces the pypi_push.sh
> > <
> https://github.com/apache/incubator-airflow/pull/3572/files#diff-8fae684cdcc8cc8df2232c8df16f64cb
> >
> > script that will generate all the JS statics before creating and
> uploading
> > the package.
> > 
> > Please let me know if you guys have any questions or suggestions and I'd
> > be happy to answer that. 
> >
> > Best,
> > *Verdan Mahmood*
> > (+31) 655 576 560
>
>

Re: What information is passed around different components of Airflow?

2018-07-06 Thread Maxime Beauchemin

The MQ (rabbit / redis / ...) gets the `airflow run {dag_id} {task_id}
{...}` command to execute, and I think the worker runs it blindly as far as
I remember it. It's not ideal as far as security goes since if the MQ is
compromised, there's an open vector to the workers. Eventually it would be
safer to send some sort of payload (say JSON) and build a command on the
worker side based on that payload. Regardless you should limit network
access to the MQ to the cluster itself and ideally you should secure Celery
with SSL signed messages. More information on how to secure Celery here:
http://docs.celeryproject.org/en/latest/userguide/security.html

Max

On Thu, Jul 5, 2018 at 8:44 AM James Meickle
 wrote:

> Airflow logs are stored on the worker filesystem. When a worker starts, it
> runs a subprocess that serves logs via Flask:
>
> https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985
>
> If you use the remote logging feature, the logs are (instead? also?) stored
> in S3.
>
> Postgres stores most everything you see in the UI. Task and DAG state, user
> accounts, privileges (in RBAC), variables, connections, etc.
>
> I believe the rabbitmq information is just task states and names, and that
> the workers fetch most of what they need from the database. But if you
> intercepted it you could manipulate which tasks are being run, so I'd still
> treat it as sensitive.
>
> On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam  wrote:
>
> > Hi,
> >
> > We run Apache Airflow as a set of k8s deployments inside of a GKE
> cluster,
> > similar to the way specified in Mumoshu's github repo:
> > https://github.com/mumoshu/kube-airflow.
> >
> > We are investigating securing our use of Airflow and are wondering about
> > some of Airflow's implementation details. Specifically, we run some tasks
> > where the workers have access to sensitive data. Some of the data can
> make
> > its way into the task logs. However, we want to make sure isn't passed
> > around eg. to the scheduler/database/message queue, and if it is, it
> should
> > be encrypted in any network traffic (eg. via mutual tls).
> >
> > - Does airflow pass around logs to the postgres db, or rabbitmq?
> > - Is the information in postgres mainly operational in nature?
> > - Is the information in rabbitmq mainly operational in nature?
> > - What about the scheduler?
> > - Anything else we're missing?
> >
> > Any ideas are appreciated!
> >
> > Thanks in advance!
> >
>

Re: Deprecating Run task from Airflow webUI

2018-07-01 Thread Maxime Beauchemin

Few thoughts:
* in our environment at Lyft, cleared tasks do get picked up by the
scheduler. Is there an issue opened for the bug you are referring to? is
that on 1.9.0?
* "clearing" in both the web ui and CLI also flips the DagRun state back to
running as the intent of clearing is usually to get the scheduler to pick
it up, there may be caveats when the DagRun doesn't exist, or is a DagRun
of the "backfill" type. Maybe `clear` should make sure that there's a
proper DagRun
* "clear" is confusing as a verb to use when the actual intent is to
reprocess...
* another [backward compatible] option is to put the feature behind a
feature flag and set that flag to False in your environment

Whether the web server and the upcoming REST API should be able to talk to
an executor is bigger question. We may want Airflow to allow running
arbitrary DAG (through the upcoming DagFetcher abstraction) on arbitrary
executors, and the REST API (web server) may need to communicate with the
executor for that purpose. Though that's far enough ahead that it's mostly
unrelated to your current concern.

I'll go a bit further and say that eventually we should have a way to run
local, "in-development" DAGs on a remote executor in an adhoc fashion. In
next-gen Airflow that would go through publishing a DAG to the DagFetcher
abstraction (say git://{org}/{repo}/{gitref_say_a_branch}/{path/to/dag.py}
), and run say `airflow test` or `airflow backfill` through the REST API
and get that to run remote on k8s (through k8sexecutor) for instance. I
think this may require the REST api talking to the executor.

Max

On Fri, Jun 29, 2018 at 8:03 PM Feng Lu  wrote:

> Re-attaching the image..
>
> On Fri, Jun 29, 2018 at 4:54 PM Feng Lu  wrote:
>
>> Hi all,
>>
>> Please take a look at our proposal to deprecate Run task in Airflow
>> webUI.
>>
>> *What?*
>> Deprecate Run task support in the Airflow webUI and make it a no-op for
>> now.
>>
>> [image: xGAcOrcLJE4.png]
>> 
>> *Why?*
>>
>>1. It only works with CeleryExecutor
>>
>> 
>>and renders an inconsistent experience for other types of Executors.
>>2. It requires Airflow webserver to have direct connection with the
>>message backend of CeleryExecutor, and opens more vulnerability in the
>>system. In many cases, users may want to restrict access to the celery
>>messaging backend as much as possible.
>>
>> *Mitigation:*
>> This Run task feature is mainly for the purpose of re-executing of a
>> previously running task which got stuck in running and deleted manually.
>> It's currently a two step process:
>> 1. Navigate to the task instance view page and delete the running task
>> that's stuck
>> 2. Go back to DAG/task view and click "Run"
>>
>> We proposed to combine the two steps, after a running task is deleted,
>> the Airflow scheduler will automatically re-schedule (which it does today)
>> and re-queue the task (there's a bug that needs to be fixed).
>>
>> *Fix:*
>> The scheduler currently doesn't not automatically re-queue the task
>> despite the task instance has changed from running to scheduled state. The
>> heartbeat check incorrectly returns a success in this case. The root cause
>> is that LocalTaskJob doesn't set the job state to failed (details
>> )
>> when a running task is externally deleted and confuses the heartbeat
>> check
>> 
>> .
>> Once this is fixed, a killed running task instance will be
>> auto-scheduled/enqueued for execution, verified locally.
>>
>> Thank you.
>>
>> Feng
>>
>>
>>

Re: Scheduler crashed due to mysql connectivity errors

2018-06-29 Thread Maxime Beauchemin

I'd open an issue with the full stack trace. There should be exception
handling wrapping the scheduler loop so I'm curious to see which part isn't
handled properly.

In the meantime I would highly recommend using something like `runit` to
restart the process if it exits for some reason. The fact that most people
use these supervisor wrapper is probably the reason it never got detected
or addressed.

Max

On Thu, Jun 28, 2018 at 11:02 PM ramandu...@gmail.com 
wrote:

> Hi All,
> We are using airflow 1.9 and are observing scheduler crashes due to mysql
> connectivity related issues.
> like
> "scheduler is crashed because of OperationalError:
> (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL
> server during query') (Background on this error at:
> http://sqlalche.me/e/e3q8)"
> Is there a way or config settings to make scheduler more resilient to
> Mysql errors.
>
> Thanks,
> Raman Gupta
>

Re: conn_id breaking change; once more with feeling

2018-06-29 Thread Maxime Beauchemin

> Breaking changes could be maintained via deprecation warnings for a
number of releases to avoid deterring users, whilst pushing towards a
cleaner interface.

No one wants to go and alter hundreds of DAGs, thousands of operator calls.
I know for a fact that the task would be monumental at both Lyft and Airbnb
and could lead to upgrades never happening, or cause many incidents (change
causes incidents!). It also mean a maintainer committing to changing all
operators to support both, handing deprecation warning, and cleaning up
eventually, I don't think you'll find someone to sign up for that task.
There are also a lot of operators in the wild (living in pipeline repos)
that may just never conform.

For operators specifically, as mentioned earlier in the thread, the reason
why it is the way it is is for `default_args` to set all the conn_ids at
the top of the script. Personally I feel strongly about not breaking this.
It makes DAG scripts look nice, and changing would be hard and very
disruptive (all DAGs in existence would have to adapt and would end up
looking worse by repeating conn_ids for each task instead of a global
setting).

As for hooks, why not just formalizing the fact that the first argument of
the constructor is a `conn_id` and using positional arguments when
instantiating hooks? There are generic operators out there that built upon
this assumption already.

Max



On Fri, Jun 29, 2018 at 9:04 AM julianderui...@gmail.com <
julianderui...@gmail.com> wrote:

> I would be very much in favour of moving to the naming approach you
> propose (conn_id for hooks, src_conn_id and dest_conn_id for operators with
> multiple connections), which I think is much more consistent than the
> current naming conventions. The added advantage of this naming is also that
> it makes it much easier in the future to work towards more generic
> operators/hooks where we (for example) copy from one database or file
> system without caring which file systems are involved. This avoids the
> wildfire of the current Airflow codebase in which we end up with an
> operator for every different combination of file systems.
>
> Breaking changes could be maintained via deprecation warnings for a number
> of releases to avoid deterring users, whilst pushing towards a cleaner
> interface.
>
> On 2018/05/30 01:19:37, "Daniel (Daniel Lamblin) [BDP - Seoul]" <
> lamb...@coupang.com> wrote:
> > The short of this email is: can we please name all the connection id
> named parameters to all hooks and operators as similarly as possible. EG
> just `conn_id`?
> >
> > So, when we started using Airflow I _had_ thought that minor versions
> would be compatible for a user's DAG, assuming no use of anything marked
> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
> would all run dags from prior versions in that linage, each with possible
> stability and feature improvements and each with possibly more operators,
> hooks, executors (even) etc.
> >
> > This is (now) obviously not the case, and it's the group's choice about
> what should and what should not break on a release-by-release basis. I
> think a clear policy would be appropriate for full Apache status, but then
> I may have missed where the policy is defined.  Though, if defined as not
> giving stability to the user's dags for most version changes isn't in my
> opinion going to grow confidence for Airflow being something you can grow
> with.
> >
> > Not to be overly dramatic, but currently the tiny change that the
> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
> `s3_conn_id='still_any_string'` means that basically I have to maintain a
> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
> freeze before during and after an update so that we can update all the uses
> and be ready to roll back the update if something else breaks (also some
> people access the not-actually-private `_a_key and _s_key` and would need
> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
> run a second airflow setup at 1.9.0 (in each and every staged environment)
> and move the 8k dags piecemeal whe[never] people get the spare time and
> inclination. I mean, we might. It looks like 1.10.0 changes some config
> around s3 logging one more time… so maybe it's better to skip it?
> >
> > I saw the couple of PRs where the project itself had to make the changes
> to usages of the named field. There was momentary and passing concern that
> users' dags would need to do the same. In the PRs, of the options discussed
> (shims, supporting the deprecated name as deprecated, doing a hard rename),
> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
> this discussed elsewhere?]
> >
> > And so I wondered why is there this virtual Hungarian notation on all
> the `conn_id`s?
> > A hook generally takes one `conn_id`, most operators take only one. In
> these cases couldn't the named parameter have been `conn_id`? When an
> operator needs a cou

Re: Securing Connections

2018-06-29 Thread Maxime Beauchemin

It certainly sounds doable and similar to the DAG-level access controls in
many ways (see the soon to be merged PR
). The new `airflow
sync_perm` CLI command could insure the existence of one perm per "conn_id"
as well as a "all_conn_id" perm.

Now RBAC is a web-only construct at the moment and I think it makes sense
to keep it this way and build upon this assumption. This means that to
check a perm, you need APIs that live only in the new web app: the RBAC
related models are defined by FAB and are available through the
SecurityManager (a FAB construct). This means re-writing the CLI to be
lightweight and operate through REST, authenticate and all that good stuff.
This makes things like a local backfill a bit complicated to think through,
but the solution is probably for the local backfill to operate simply with
a lower-level REST api.

On the path to success we need to have a CLI that can operate without
knowing the decryption key, and the end goal is a CLI that doesn't connect
to the metadata database at all.

Note that we could stub the FAB RBAC models in "Airflow core (models.py)"
but personally I think leaving that on the web only and going through the
(yet-to-be-built) REST API is the way to go.

Also note that the current DAG-level access control only implements the web
restrictions at the moment, none of it is applied at the CLI level, that
has yet to be done.

Another thought: it may make sense to break off `airflow-cli` as its own
package though there are pros/cons here.

Max

On Fri, Jun 29, 2018 at 9:19 AM Naik Kaxil  wrote:

> I would like to get thoughts on how you guys secure connections i.e.
> Role-based control of connection. For example I don’t want Person A to use
> connection X, or in other words I only want Person B to have access of
> connection X.
>
>
>
> With RBAC in the master, it is possible but how do you guys achieve it in
> version 1.9.0?
>
>
>
> Regards,
>
> Kaxil
>
>
> Kaxil Naik
>
> Data Reply
> 2nd Floor, Nova South
> 160 Victoria Street, Westminster
> London SW1E 5LB - UK
> phone: +44 (0)20 7730 6000
> k.n...@reply.com
> www.reply.com
>
> [image: Data Reply]
>

Re: Apache Airflow 1.10.0b3

2018-06-27 Thread Maxime Beauchemin

It would be so nice to have a fast test suite. Having to wait for Travis
for up to an hour makes many workflows (like working on a release) super
painful.

I spoke with folks at Astronomer recently about moving all operators and
hooks to another Python package that airflow would import. This would allow
for independent test suites and to have a more regular release cadence on
hooks and operators. What do you think?

Max

On Wed, Jun 27, 2018 at 11:18 PM Bolke de Bruin  wrote:

> Arghhh. The downside of doing this late at night and wanting to go to
> bed... :-). Will make a new one
>
> Sent from my iPhone
>
> > On 28 Jun 2018, at 00:07, Chris Fei  wrote:
> >
> > Great, thank you! I just took this for a quick spin and it looks like
> > there's DB migration task missing. The task you committed just recently,
> > 9635ae0956e7_index_faskfail.py, has a down_revision of 856955da8476
> > which can't be found when running airflow initdb (seehttps://
> github.com/apache/incubator-airflow/tree/v1-10-test/airflow/migrations/versions
> ).
> > Chris
> >
> >
> >> On Wed, Jun 27, 2018, at 5:09 PM, Bolke de Bruin wrote:
> >> Hi All,
> >>
> >> I have created a sdist package that is available at:
> >>
> >>
> http://people.apache.org/~bolke/apache-airflow-1.10.0b3+incubating.tar.gz
> >> <
> http://people.apache.org/~bolke/apache-airflow-1.10.0b3+incubating.tar.gz>>
>
> >> In order to distinguish it from an actual (apache) release it is:
> >>
> >> 1. Marked as beta (python package managers do not install beta
> >>   versions by default - PEP 440)> 2. It is not signed
> >> 3. It is not at an official apache distribution location
> >>
> >> You can also put something like this in a requirements.txt file:
> >>
> >> git+
> >>
> https://github.com/apache/incubator-airflow@v1-10-test#egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3
> >> <
> https://github.com/apache/incubator-airflow@v1-10-test#egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3
> >
> >> ]>  >> airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s-
> >> 3]
> >> <
> https://github.com/rodrigc/incubator-airflow@master#egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3][1]
> >
> 
> >> and then "pip install -r requirements.txt”.
> >>
> >> I hope that after this beta we can go to RC and start voting on 1.10.>
> >> Cheers
> >> Bolke
> >
> >
> > Links:
> >
> >  1.
> https://github.com/rodrigc/incubator-airflow@master#egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3]%20%3Chttps://github.com/rodrigc/incubator-airflow@master#egg=apache-airflow[celery,crypto,emr,hive,hdfs,ldap,mysql,postgres,redis,slack,s3]
> 
>

Re: airflow.exceptions.AirflowException dag_id not found

2018-06-11 Thread Maxime Beauchemin

One more thing is if one of your worker has a missing dependency required
for a specific DAG. For example you read configuration from zookeeper in
the DAG file, but only one worker is missing the Zookeeper client python
lib, but the scheduler has the lib. You can imagine that the scheduler will
send the job over to the worker, and the worker can't interpret the DAG
file.


On Mon, Jun 11, 2018 at 3:22 PM Stephane Bonneaud 
wrote:

> Max,
>
> Thank you for the quick response, that is very helpful and great material
> for my investigations!
>
> Thanks again,
> Stéphane
>
>
> > On Jun 11, 2018, at 3:11 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> >
> > DagBag import timeouts happen when people do more than just
> "configuration
> > as code" in their module scope (say doing actual compute in module scope,
> > which is a no-no). They may also happen if you read things from flimsy
> > external systems that may introduce delays. Say you read pipeline
> > configuration from Zookeeper or from a database or network drive and
> > somehow that operation is timing out.
> >
> > Also with Airflow (at the moment) you are responsible to synchronize the
> > pipeline definitions (DAGS_FOLDER) on all machines across the cluster. If
> > they are not in sync you'll have problems with symptoms that may look
> like
> > "dag_id not found". That happens when the scheduler is aware of DAGs that
> > workers may not be aware of.
> >
> > Max
> >
> > On Mon, Jun 11, 2018 at 12:42 PM Stephane Bonneaud <
> steph...@fathomhealth.co>
> > wrote:
> >
> >> Hi there,
> >>
> >> We’re using Airflow in our startup and it’s been great in many ways,
> >> thanks for the work you guys are doing!
> >>
> >> Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs
> >> failing for unclear reasons, with no logs or the following error:
> >> "airflow.exceptions.AirflowException: dag_id could not be found”. This
> >> seems to happen when enough DAGs are running at the same time, though it
> >> can also happen more rarely here and there. But, the best way to
> reproduce
> >> the error with our setup is to run enough DAGs at once. Most of the
> time,
> >> clearing the DAG run or ops that have failed and letting the DAG re-run
> is
> >> enough to fix the problem.
> >>
> >> I found resources pointing to the dagbag_import_timeout, e.g.,
> >>
> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found
> >> <
> >>
> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found
> >>> .
> >> I did play with that parameter, and other parameters as well. And it
> does
> >> seem that they help, i.e., I can run more DAGs at once, but
> >>(1) if I run enough DAGs at once, I still see ops and DAGs
> >> failing, so the problem is not fixed ;
> >>(2) more importantly, I don’t fully understand the problem. I
> have
> >> some ideas on what is happening, but maybe I’m totally wrong?
> >>
> >> Any recommendations on how I should investigate that?
> >>
> >> Thank you very much!
> >> Have a nice rest of the day,
> >> Stéphane
> >> http://stephanebonneaud.com <http://stephanebonneaud.com/>
> >>
> >>
>
>

Re: airflow.exceptions.AirflowException dag_id not found

2018-06-11 Thread Maxime Beauchemin

DagBag import timeouts happen when people do more than just "configuration
as code" in their module scope (say doing actual compute in module scope,
which is a no-no). They may also happen if you read things from flimsy
external systems that may introduce delays. Say you read pipeline
configuration from Zookeeper or from a database or network drive and
somehow that operation is timing out.

Also with Airflow (at the moment) you are responsible to synchronize the
pipeline definitions (DAGS_FOLDER) on all machines across the cluster. If
they are not in sync you'll have problems with symptoms that may look like
"dag_id not found". That happens when the scheduler is aware of DAGs that
workers may not be aware of.

Max

On Mon, Jun 11, 2018 at 12:42 PM Stephane Bonneaud 
wrote:

> Hi there,
>
> We’re using Airflow in our startup and it’s been great in many ways,
> thanks for the work you guys are doing!
>
> Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs
> failing for unclear reasons, with no logs or the following error:
> "airflow.exceptions.AirflowException: dag_id could not be found”. This
> seems to happen when enough DAGs are running at the same time, though it
> can also happen more rarely here and there. But, the best way to reproduce
> the error with our setup is to run enough DAGs at once. Most of the time,
> clearing the DAG run or ops that have failed and letting the DAG re-run is
> enough to fix the problem.
>
> I found resources pointing to the dagbag_import_timeout, e.g.,
> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found
> <
> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found
> >.
> I did play with that parameter, and other parameters as well. And it does
> seem that they help, i.e., I can run more DAGs at once, but
> (1) if I run enough DAGs at once, I still see ops and DAGs
> failing, so the problem is not fixed ;
> (2) more importantly, I don’t fully understand the problem. I have
> some ideas on what is happening, but maybe I’m totally wrong?
>
> Any recommendations on how I should investigate that?
>
> Thank you very much!
> Have a nice rest of the day,
> Stéphane
> http://stephanebonneaud.com 
>
>

Re: Accessing execution_date inside a function

2018-06-08 Thread Maxime Beauchemin

If you look at the source code for the TimeDeltaSensor, it's about 10 lines
of code. You can easily derive `BaseSensorOperator` and write your own
DynamicTimeDeltaSensor that has its own logic, or one that receives a
callable `are_conditions_met(execution_date)` that receives the execution
date and returns True/False.

On Fri, Jun 8, 2018 at 2:04 PM Pedro Machado  wrote:

> I am using TimeDeltaSensor and would like to pass a delta value that
> depends on the execution_date. The delta depends on the day of the week of
> the execution_date.
>
> wait_for_data = TimeDeltaSensor(
> task_id="wait_for_data",
> delta=get_delta(),
> poke_interval=60 * 10,
> timeout=60 * 60 * (24 * 7 + 2),
> retries=10,
> dag=dag)
>
> Is it possible to access the execution_date from inside the get_delta
> function? How?
>
> Thanks,
>
> Pedro
>

Re: Is `airflow backfill` disfunctional?

2018-06-08 Thread Maxime Beauchemin

Ash I don't see how this could happen unless maybe the node doing the
backfill is using another metadata database.

In general we recommend for people to run --local backfills and have the
default/sandbox template for `airflow.cfg` use a LocalExecutor with
reasonable parallelism to make that behavior the default.

Given the [not-so-great] state of backfill, I'm guessing many have been
using the scheduler to do backfills. From that regard it would be nice to
have CLI commands to generate dagruns or alter the state of existing ones

Max

On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
ash_airflowl...@firemirror.com> wrote:

> Somewhat related to this, but likely a different issue:
>
> I've just had a case where a long (7hours) running backfill task ended up
> running twice somehow. We're using Celery so this might be related to some
> sort of Celery visibility timeout, but I haven't had a chance to be able to
> dig in to it in detail - it's 5pm on a Friday :D
>
> Has anyone else noticed anything similar?
>
> -ash
>
>
> > On 8 Jun 2018, at 01:22, Tao Feng  wrote:
> >
> > Thanks everyone for the feedback especially on the background for
> backfill.
> > After reading the discussion, I think it would be safest to add a flag
> for
> > auto rerun failed tasks for backfill with default to be false. I have
> > updated the pr accordingly.
> >
> > Thanks a lot,
> > -Tao
> >
> > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> mark.whitfi...@nytimes.com>
> > wrote:
> >
> >> I've been doing some work setting up a large, collaborative Airflow
> >> pipeline with a group that makes heavy use of backfills, and have been
> >> encountering a lot of these issues myself.
> >>
> >> Other gripes:
> >>
> >> Backfills do not obey concurrency pool restrictions. We had been making
> >> heavy use of SubDAGs and using concurrency pools to prevent deadlocks
> (why
> >> does the SubDAG itself even need to occupy a concurrency slot if none of
> >> its constituent tasks are running?), but this quickly became untenable
> when
> >> using backfills and we were forced to mostly abandon SubDAGs.
> >>
> >> Backfills do use DagRuns now, which is a big improvement. However, it's
> a
> >> common use case for us to add new tasks to a DAG and backfill to a date
> >> specific to that task. When we do this, the BackfillJob will pick up
> >> previous backfill DagRuns and re-use them, which is mostly nice because
> it
> >> keeps the Tree view neatly organized in the UI. However, it does not
> reset
> >> the start time of the DagRun when it does this. Combined with a
> DAG-level
> >> timeout, this means that the backfill job will activate a DagRun, but
> then
> >> the run will immediately time out (since it still thinks it's been
> running
> >> since the previous backfill). This will cause tasks to deadlock
> spuriously,
> >> making backfills extremely cumbersome to carry out.
> >>
> >> *Mark Whitfield*
> >> Data Scientist
> >> New York Times
> >>
> >>
> >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> >> maximebeauche...@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the input, this is helpful.
> >>>
> >>> To add to the list, there's some complexity around concurrency
> management
> >>> and multiple executors:
> >>> I just hit this thing where backfill doesn't check DAG-level
> concurrency,
> >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> concurrency
> >>> limit and exits. Right after backfill reschedules right away and so on,
> >>> burning a bunch of CPU doing nothing. In this specific case it seems
> like
> >>> `airflow run` should skip that specific check when in the context of a
> >>> backfill.
> >>>
> >>> Max
> >>>
> >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin 
> wrote:
> >>>
> >>>> Thinking out loud here, because it is a while back that I did work on
> >>>> backfills. There were some real issues with backfills:
> >>>>
> >>>> 1. Tasks were running in non deterministic order ending up in regular
> >>>> deadlocks
> >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> >>>> could not be enforced. Ui could really display it, lots of minor other
> >>>> issues because of it.
>

Re: Is `airflow backfill` disfunctional?

2018-06-06 Thread Maxime Beauchemin

Thanks for the input, this is helpful.

To add to the list, there's some complexity around concurrency management
and multiple executors:
I just hit this thing where backfill doesn't check DAG-level concurrency,
fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
limit and exits. Right after backfill reschedules right away and so on,
burning a bunch of CPU doing nothing. In this specific case it seems like
`airflow run` should skip that specific check when in the context of a
backfill.

Max

On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin  wrote:

> Thinking out loud here, because it is a while back that I did work on
> backfills. There were some real issues with backfills:
>
> 1. Tasks were running in non deterministic order ending up in regular
> deadlocks
> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> could not be enforced. Ui could really display it, lots of minor other
> issues because of it.
> 3. Behavior was different from the scheduler, while subdagoperators
> particularly make use of backfills at the moment.
>
> I think with 3 the behavior you are observing crept in. And given 3 I
> would argue a consistent behavior between the scheduler and the backfill
> mechanism is still paramount. Thus we should explicitly clear tasks from
> failed if we want to rerun them. This at least until we move the
> subdagoperator out of backfill and into the scheduler (which is actually
> not too hard). Also we need those command line options anyway.
>
> Bolke
>
> Verstuurd vanaf mijn iPad
>
> > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim 
> > 
> het volgende geschreven:
> >
> > The request was for opposition, but I’d like to weigh in on the side of
> “it’s a better behavior [to have failed tasks re-run when cleared in a
> backfill"
> >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> maximebeauche...@gmail.com>, wrote:
> >> @Jeremiah Lowin  & @Bolke de Bruin 
> I
> >> think you may have some context on why this may have changed at some
> point.
> >> I'm assuming that when DagRun handling was added to the backfill logic,
> the
> >> behavior just happened to change to what it is now.
> >>
> >> Any opposition in moving back towards re-running failed tasks when
> starting
> >> a backfill? I think it's a better behavior, though it's a change in
> >> behavior that we should mention in UPDATE.md.
> >>
> >> One of our goals is to make sure that a failed or killed backfill can be
> >> restarted and just seamlessly pick up where it left off.
> >>
> >> Max
> >>
> >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng  wrote:
> >>>
> >>> After discussing with Max, we think it would be great if `airflow
> backfill`
> >>> could be able to auto pick up and rerun those failed tasks. Currently,
> it
> >>> will throw exceptions(
> >>>
> >>>
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> >>> )
> >>> without rerunning the failed tasks.
> >>>
> >>> But since it broke some of the previous assumptions for backfill, we
> would
> >>> like to get some feedback and see if anyone has any concerns(pr could
> be
> >>> found at https://github.com/apache/incubator-airflow/pull/3464/files).
> >>>
> >>> Thanks,
> >>> -Tao
> >>>
> >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> >>> maximebeauche...@gmail.com> wrote:
> >>>
> >>>> So I'm running a backfill for what feels like the first time in years
> >>> using
> >>>> a simple `airflow backfill --local` commands.
> >>>>
> >>>> First I start getting a ton of `logging.info` of each tasks that
> cannot
> >>> be
> >>>> started just yet at every tick flooding my terminal with the keyword
> >>>> `FAILED` in it, looking like a million of lines like this one:
> >>>>
> >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> >>> for
> >>>> ,
> >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> >>>> quires all upstream tasks to have succeeded, but found 1
> non-success(es).
> >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> 'upstream_failed':
> >>>> 0L,
> >>>> 'skipped': 0L, 'done': 0L}, upstream_

Airflow-related Talk: Functional Data Engineering, a set of Best Practices

2018-06-05 Thread Maxime Beauchemin

I'm taking the freedom to share my talk from DataEngConf 2018 with this
group since it's somewhat related to Airflow.

https://www.youtube.com/watch?v=4Spo2QRTz1k

Related blog post:
https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

Max

Re: Is `airflow backfill` disfunctional?

2018-06-05 Thread Maxime Beauchemin

@Jeremiah Lowin  & @Bolke de Bruin  I
think you may have some context on why this may have changed at some point.
I'm assuming that when DagRun handling was added to the backfill logic, the
behavior just happened to change to what it is now.

Any opposition in moving back towards re-running failed tasks when starting
a backfill? I think it's a better behavior, though it's a change in
behavior that we should mention in UPDATE.md.

One of our goals is to make sure that a failed or killed backfill can be
restarted and just seamlessly pick up where it left off.

Max

On Tue, Jun 5, 2018 at 3:25 PM Tao Feng  wrote:

> After discussing with Max, we think it would be great if `airflow backfill`
> could be able to auto pick up and rerun those failed tasks. Currently, it
> will throw exceptions(
>
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> )
> without rerunning the failed tasks.
>
> But since it broke some of the previous assumptions for backfill, we would
> like to get some feedback and see if anyone has any concerns(pr could be
> found at https://github.com/apache/incubator-airflow/pull/3464/files).
>
> Thanks,
> -Tao
>
> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > So I'm running a backfill for what feels like the first time in years
> using
> > a simple `airflow backfill --local` commands.
> >
> > First I start getting a ton of `logging.info` of each tasks that cannot
> be
> > started just yet at every tick flooding my terminal with the keyword
> > `FAILED` in it, looking like a million of lines like this one:
> >
> > [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> for
> > ,
> > dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> > quires all upstream tasks to have succeeded, but found 1 non-success(es).
> > upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
> > 0L,
> > 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> >
> > Good thing I triggered 1 month and not 2 years like I actually need, just
> > the logs here would be "big data". Now I'm unclear whether there's
> anything
> > actually running or if I did something wrong, so I decide to kill the
> > process so I can set a smaller date range and get a better picture of
> > what's up.
> >
> > I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
> note
> > that I'll need to find that log-flooding line and demote it to DEBUG in a
> > quick PR, no biggy.
> >
> > Now I restart with just a single schedule, and get an error `Dag
> {some_dag}
> > has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
> just
> > pickup where it left off. Maybe I need to run an `airflow clear` command
> > and restart? Ok, ran my clear command, same error is showing up. Dead
> end.
> >
> > Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
> > look like it... Maybe `airflow backfill` has some new switches to pick up
> > where it left off? Can't find it. Am I supposed to clear the DAG Runs
> > manually in the UI?  This is a pre-production, in-development DAG, so
> it's
> > not on the production web server. Am I supposed to fire up my own web
> > server to go and manually handle the backfill-related DAG Runs? Cannot to
> > my staging MySQL and do manually clear some DAG runs?
> >
> > So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
> > appears I can finally start over.
> >
> > Next thought was: "Alright looks like I need to go Linus on the mailing
> > list".
> >
> > What am I missing? I'm really hoping these issues specific to 1.8.2!
> >
> > Backfilling is core to Airflow and should work very well. I want to
> restate
> > some reqs for Airflow backfill:
> > * when failing / interrupted, it should seamlessly be able to pickup
> where
> > it left off
> > * terminal logging at the INFO level should be a clear, human consumable,
> > indicator of progress
> > * backfill-related operations (including restarts) should be doable
> through
> > CLI interactions, and not require web server interactions as the typical
> > sandbox (dev environment) shouldn't assume the existence of a web server
> >
> > Let's fix this.
> >
> > Max
> >
>

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin

Agreed, secured by default is ideal. Though I wouldn't want people to get
an unreasonable sense of safety and open their instance to the web.

I like the idea of generating a temporary key/token and exposing it in the
console where the process was started. Other option is to use the
database/password mechanism by default and add a `airflow create-user
--admin`  CLI command to generate a user. With the level of cluelessness
we're observing we should probably force a certain password complexity
level.

We should also state clearly in the docs that Airflow is not regularly
pen-tested and should not be exposed to the Internet.

For the record we had Airflow pen-tested at Airbnb by a third party in 2016
(or was it 2017?) and found/resolved half a dozen or so vulnerabilities or
so. Note that there's no recurring process in place, or any mechanisms to
prevent regressions beyond code review. Also note that the new [beta in
1.10] UI has not been pen tested (to my knowledge).

Max

On Tue, Jun 5, 2018 at 2:48 PM Bolke de Bruin  wrote:

> Tbh I like to go to a setup where it is secure by default. Airflow is
> getting more and more used so it also increases the attack surface. If you
> run “initdb” or “resetdb” it is easy to provide a generated password.
>
> I don’t see a reason anymore for having a unsecured version.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 5 jun. 2018 om 23:11 heeft Christopher Bockman 
> het volgende geschreven:
> >
> > +1 to being able to disable--we have authentication in place, but use a
> > separate solution that (probably?) Airflow won't realize is enabled, so
> > having a continuous giant warning banner would be rather unfortunate.
> >
> >> On Tue, Jun 5, 2018 at 2:05 PM, Alek Storm 
> wrote:
> >>
> >> This is a great idea, but we'd appreciate a setting that disables the
> >> banner even if those conditions aren't met - our instance is deployed
> >> without authentication, but is only accessible via our intranet.
> >>
> >> Alek
> >>
> >>
> >> On Tue, Jun 5, 2018, 3:35 PM James Meickle 
> >> wrote:
> >>
> >>> I think that a banner notification would be a fair penalty if you
> access
> >>> Airflow without authentication, or have API authentication turned off,
> or
> >>> are accessing via http:// with a non-localhost `Host:`. (Are there any
> >>> other circumstances to think of?)
> >>>
> >>> I would also suggest serving a default robots.txt to mitigate
> accidental
> >>> indexing of public instances (as most public instances will be
> >> accidentally
> >>> public, statistically speaking). If you truly want your Airflow
> instance
> >>> public and indexed, you should have to go out of your way to permit
> that.
> >>>
> >>> On Tue, Jun 5, 2018 at 1:51 PM, Maxime Beauchemin <
> >>> maximebeauche...@gmail.com> wrote:
> >>>
> >>>> What about a clear alert on the UI showing when auth is off? Perhaps a
> >>>> large red triangle-exclamation icon on the navbar with a tooltip
> >>>> "Authentication is off, this Airflow instance in not secure." and
> >>> clicking
> >>>> take you to the doc's security page.
> >>>>
> >>>> Well and then of course people should make sure their infra isn't open
> >> to
> >>>> the Internet. We really shouldn't have to tell people to keep their
> >>>> infrastructure behind a firewall. In most environments you have to do
> >>> quite
> >>>> a bit of work to open any resource up to the Internet (SSL certs,
> >> special
> >>>> security groups for load balancers/proxies, ...). Now I'm curious to
> >>>> understand how UMG managed to do this by mistake...
> >>>>
> >>>> Also a quick reminder to use the Connection abstraction to store
> >> secrets,
> >>>> ideally using the environment variable feature.
> >>>>
> >>>> Max
> >>>>
> >>>> On Tue, Jun 5, 2018 at 10:02 AM Taylor Edmiston 
> >>>> wrote:
> >>>>
> >>>>> One of our engineers wrote a blog post about the UMG mistakes as
> >> well.
> >>>>>
> >>>>> https://www.astronomer.io/blog/universal-music-group-airflow-leak/
> >>>>>
> >>>>> I know that best practices are well known here, but I second James'
> >>>>> suggestion that we add some docs, code, or

Re: PSA: Make sure your Airflow instance isn't public and isn't Google indexed

2018-06-05 Thread Maxime Beauchemin

What about a clear alert on the UI showing when auth is off? Perhaps a
large red triangle-exclamation icon on the navbar with a tooltip
"Authentication is off, this Airflow instance in not secure." and clicking
take you to the doc's security page.

Well and then of course people should make sure their infra isn't open to
the Internet. We really shouldn't have to tell people to keep their
infrastructure behind a firewall. In most environments you have to do quite
a bit of work to open any resource up to the Internet (SSL certs, special
security groups for load balancers/proxies, ...). Now I'm curious to
understand how UMG managed to do this by mistake...

Also a quick reminder to use the Connection abstraction to store secrets,
ideally using the environment variable feature.

Max

On Tue, Jun 5, 2018 at 10:02 AM Taylor Edmiston  wrote:

> One of our engineers wrote a blog post about the UMG mistakes as well.
>
> https://www.astronomer.io/blog/universal-music-group-airflow-leak/
>
> I know that best practices are well known here, but I second James'
> suggestion that we add some docs, code, or config so that the framework
> optimizes for being (nearly) production-ready by default and not just easy
> to start with for local dev.  Admittedly this takes some work to not add
> friction to the local onboarding experience.
>
> Do most people keep separate airflow.cfg files per environment like what's
> considered the best practice in the Django world?  e.g.
> https://stackoverflow.com/q/10664244/149428
>
> Taylor
>
> *Taylor Edmiston*
> Blog  | CV
>  | LinkedIn
>  | AngelList
>  | Stack Overflow
> 
>
>
> On Tue, Jun 5, 2018 at 9:57 AM, James Meickle 
> wrote:
>
> > Bumping this one because now Airflow is in the news over it...
> >
> > https://www.bleepingcomputer.com/news/security/contractor-
> > exposes-credentials-for-universal-music-groups-it-
> > infrastructure/?utm_campaign=Security%2BNewsletter&utm_
> > medium=email&utm_source=Security_Newsletter_co_79
> >
> > On Fri, Mar 23, 2018 at 9:33 AM, James Meickle 
> > wrote:
> >
> > > While Googling something Airflow-related a few weeks ago, I noticed
> that
> > > someone's Airflow dashboard had been indexed by Google and was
> accessible
> > > to the outside world without authentication. A little more Googling
> > > revealed a handful of other indexed instances in various states of
> > > security. I did my best to contact the operators, and waited for
> > responses
> > > before posting this.
> > >
> > > Airflow is not a secure project by default (https://issues.apache.org/
> > > jira/browse/AIRFLOW-2047), and you can do all sorts of mean things to
> an
> > > instance that hasn't been intentionally locked down. (And even then,
> you
> > > shouldn't rely exclusively on your app's authentication for providing
> > > security.)
> > >
> > > Having "internal" dashboards/data sources/executors exposed to the web
> is
> > > dangerous, since old versions can stick around for a very long time,
> help
> > > compromise unrelated deployments, and generally just create very bad
> > press
> > > for the overall project if there's ever a mass compromise (see: Redis
> and
> > > MongoDB).
> > >
> > > Shipping secure defaults is hard, but perhaps we could add best
> practices
> > > like instructions for deploying a robots.txt with Airflow? Or an impact
> > > statement about what someone could do if they access your Airflow
> > instance?
> > > I think that many people deploying Airflow for the first time might not
> > > realize that it can get indexed, or how much damage someone can cause
> via
> > > accessing it.
> > >
> >
>

Re: Dealing with data latency

2018-06-04 Thread Maxime Beauchemin

The common standard is to have the execution_date aligned with the
partition date in the database (say 2018-08-08) and contain data from
2018-08-08T00:00:000
to 2018-08-09T23:59:999.

The partition date and execution_date match and correspond to the left
bound of the time interval processed.

Then you'd use some sensors to make sure this cannot run until the desired
time or conditions are met.

Max

On Mon, Jun 4, 2018 at 5:46 AM Pedro Machado  wrote:

> Hi. What is the recommended way to deal with data latency? For example, I
> have a feed that is not considered final until 72 hours have passed after
> the end of the daily period.
>
> For example, Monday's data would be ready by Thursday at 23:59.
>
> Should I pull data based on the execution date minus a 72 hour offset or
> use the execution date and somehow delay the data pull for 72 hours?
>
> The latter would be more intuitive (data pull date = execution date) but I
> am not sure if it's a good pattern.
>
> Thanks,
>
> Pedro
>

Re: Enable Travis CI Auto Cancellation?

2018-05-31 Thread Maxime Beauchemin

Good call, I opened https://issues.apache.org/jira/browse/INFRA-16604 since
none of the committers have the required rights to do this.

Max

On Wed, May 30, 2018 at 9:26 PM Craig Rodrigues  wrote:

> Can someone who has administrator access to Travis CI enable
> Auto-Cancellation on branch and Pull Requests builds?
>
> See:
>
> https://blog.travis-ci.com/2017-09-21-default-auto-cancellation
>
> --
> Craig
>

Re: Disable Processing of DAG file

2018-05-29 Thread Maxime Beauchemin

The TLDR of how the processor works is:

while True:
* sets a multiprocessing queue with N processes (say 32)
* main process looks for the list of all .py files in DAGS_FOLDER
* fills in the queue with all .py
* each one of the 32 suprocess opens a file and interprets it (it's
insulated from the main process, a sys.exit() wouldn't affect the main
process), looks for DAG object in module namespace
* if it finds a DAG object, it looks for active DAG runs, and creates new
DAG runs if a new schedule is ready to start
* for each active DAG Run, it looks at all "runable" tasks and looks to see
if dependencies have been met
* returns a list of all tasks ready to get triggered to main process
* main process wait for a certain specified amount of time, accumulates
task instance list that are all ready to run
* the scheduling train leaves the station, prioritize tasks based
priority_weight and schedules where pool slots are availlable
* other supervisor-type tasks, like handling zombie tasks and such

A long long time ago we didn't have subprocesses and things like a DAG with
a `sys.exit()` would crash the scheduler, and modules imported in DAGs
files would get cached in `sys.modules` unless you'd force
`reload(my_submodule)`. There was (and still is) a flag on the scheduler
CLI command to force it to exit after a certain number of runs so that your
service would restart it in a loop and flush sys.modules .  But those days
are long gone, and there's no reason to do this anymore.

Max


On Mon, May 28, 2018 at 11:29 PM Ruiqin Yang  wrote:

> Hi folks,
> This config line
> <
> https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/default_airflow.cfg#L414
> >
> controls how often the scheduler scan the DAG folder and tries to discover/
> forget DAGs.
>
> For doing dag file processing part, scheduler does parse the DAG file
> everytime before it schedules tasks through DagFileProcessor.
>
> Cheers,
> Kevin Y
>
> On Mon, May 28, 2018 at 10:14 PM, Ananth Durai 
> wrote:
>
> > It is an interesting question. On a slightly related note, Correct me if
> > I'm wrong, AFAIK we require restarting airflow scheduler in order pick
> any
> > new DAG file changes by the scheduler. In that case, should the scheduler
> > do the DAGFileProcessing every time before scheduling the tasks?
> >
> > Regards,
> > Ananth.P,
> >
> >
> >
> >
> >
> >
> > On 28 May 2018 at 21:46, ramandu...@gmail.com 
> > wrote:
> >
> > > Hi All,
> > >
> > > We have a use case where there would be 100(s) of DAG files with
> schedule
> > > set to "@once". Currently it seems that scheduler processes each and
> > every
> > > file and creates a Dag Object.
> > > Is there a way or config to tell scheduler to stop processing certain
> > > files.
> > >
> > > Thanks,
> > > Raman Gupta
> > >
> >
>

Re: What are the rules / policies for graduating classes out of airflow.contrib?

2018-05-29 Thread Maxime Beauchemin

* At least one active committer that runs that code in their environment
and cares enough and has enough context to review / fix things if need be
* Decent code quality
* Decent unit test coverage
* Decent underlying libraries (no dependencies on unmaintained/unpopular
libs)

About the wiki I agree that it may make sense to bring more over to the
sphinx side of things. I think we'd be welcoming PR that would move docs
over that way.

Max

On Tue, May 29, 2018 at 11:17 AM Tim Swast  wrote:

> I'm investigating what is required for graduating operators / sensors / ...
> out of airflow.contrib, but I couldn't find any official docs on either the
> wiki, the docs, or GitHub.
>
> What are the requirements for moving something out of contrib?
>
> P.S. It seems like the wiki is pretty locked down. I don't seem to be able
> to make comments or minor edits.
>
> P.P.S. Maybe we should migrate some of the existing policy docs & how-to
> from the wiki to a  the Sphinx docs since I could figure out how to propose
> edits to those but not the wiki, ironically.
>
> *  •  **Tim Swast*
> *  •  *Software Friendliness Engineer
> *  •  *Google Cloud Developer Relations
> *  •  *Seattle, WA, USA
>

Re: conn_id breaking change; once more with feeling

2018-05-29 Thread Maxime Beauchemin

The main reason for the conn_id prefix is to facilitate the use of
`default_args`. Because of this you can set all your connections at the top
of your script and from that point on you just instantiate tasks without
re-stating connections. It's common for people to define multiple
"operating contexts" (production/staging/dev) where default_args and
connections are defined conditionally based on some env vars and having
different names for different conn_ids is key in that case.

Also as you mentioned many operators (all data transfers) take 2 conn_ids
which would require prefixing anyways.

And yes, minor releases should never invalidate DAGs except for rare
exceptions (say a new feature that is still pre-release, or never worked
properly in previous release for some reason). Breaking changes should come
with an UPDATE.md message aligned with the release. Pretty names doesn't
justify breaking stuff and forcing people to grep and mass replace things.
If someone wants a prettier name for an argument or anything else that's in
the "obviously-public API" (DAG objects, operators, setting deps, ...) they
should need to make the change backward compatible and start
logging.warning() about next major release deprecation.

I also think 2.0 should be as mild as possible on backward incompatible
changes or come with a compatibility layer (from airflow import LegacyDAG
as DAG). No one wants to go and mass update tons of scripts.

It should be the case too for the less-obviously public API (hooks, methods
not prefixed with `_` or `__`), though I think it may be reasonable in some
cases (say a method that really should have been defined as private) but
avoided as much as possible.

*Committers*, let's be vigilant about this. Breaking backward compatibility
is a big deal. An important part of code review is identifying backward
incompatible changes.

Max

On Tue, May 29, 2018 at 6:19 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamb...@coupang.com> wrote:

> The short of this email is: can we please name all the connection id named
> parameters to all hooks and operators as similarly as possible. EG just
> `conn_id`?
>
> So, when we started using Airflow I _had_ thought that minor versions
> would be compatible for a user's DAG, assuming no use of anything marked
> beta or deprecated, such that v1.7.1, 1.8.0, 1.8.1, 1.8.2 and 1.9.0 etc
> would all run dags from prior versions in that linage, each with possible
> stability and feature improvements and each with possibly more operators,
> hooks, executors (even) etc.
>
> This is (now) obviously not the case, and it's the group's choice about
> what should and what should not break on a release-by-release basis. I
> think a clear policy would be appropriate for full Apache status, but then
> I may have missed where the policy is defined.  Though, if defined as not
> giving stability to the user's dags for most version changes isn't in my
> opinion going to grow confidence for Airflow being something you can grow
> with.
>
> Not to be overly dramatic, but currently the tiny change that the
> `s3Hook(…)` takes `aws_conn_id='any_string'` vs
> `s3_conn_id='still_any_string'` means that basically I have to maintain a
> 1.8.2 setup in perpetuity, because no one (here) wants to briefly code
> freeze before during and after an update so that we can update all the uses
> and be ready to roll back the update if something else breaks (also some
> people access the not-actually-private `_a_key and _s_key` and would need
> to switch to still-not-private `_get_credentials()[…]`). Instead we'll just
> run a second airflow setup at 1.9.0 (in each and every staged environment)
> and move the 8k dags piecemeal whe[never] people get the spare time and
> inclination. I mean, we might. It looks like 1.10.0 changes some config
> around s3 logging one more time… so maybe it's better to skip it?
>
> I saw the couple of PRs where the project itself had to make the changes
> to usages of the named field. There was momentary and passing concern that
> users' dags would need to do the same. In the PRs, of the options discussed
> (shims, supporting the deprecated name as deprecated, doing a hard rename),
> it wasn't brought up if the rename to aws_conn_id was the best name. [Was
> this discussed elsewhere?]
>
> And so I wondered why is there this virtual Hungarian notation on all the
> `conn_id`s?
> A hook generally takes one `conn_id`, most operators take only one. In
> these cases couldn't the named parameter have been `conn_id`? When an
> operator needs a couple conn_ids, could it not have `src_conn_id` and
> `dest_conn_id` instead of locking in either s3 or aws, mysql or maria,
> dataflow or beam etc? Those are hypotheticals, I believe.
>
> Could I propose to plan to break absolutely all uses of
> `{named}_conn_id`s, before or by version 2.0, with an eye towards never
> again having to break a named parameter for the rest of 2.x's life? There's
> probably more named parameters that should be fixed pe

Re: Using Airflow with dataset dependant flows (not date)

2018-05-29 Thread Maxime Beauchemin

Hi,

Assuming the shape of your DAG is the same across runs, the prescribed way
is to go with the DAG with a schedule_interval=None and to create your DAG
Runs on demand. You can do so programmatically (using the ORM:
airflow.models.DagRun) (cli: airflow trigger_dag) or through REST.

If your DAG shape is different based on parameters, then conceptually it is
not a single DAG, its many, and you'd have to programmatically make
different DAG objects with a schedule_interval='@once'. Note that all these
external trigger options take a `run_id` argument, which as you guess is
some unique identifier for your run, stored in the DAG Run table. If you're
processing video files, that could be the path to the file you are
processing or whatever unique id makes sense to you.

(from memory) internally that `run_id` isn't really used as the key (though
there's a unique constraint on it in the DB), the arbitrary (dag_id,
execution_date) tuple is, because the system was built in a very
schedule-centric way at the beginning. BUT, the cli command that take the
`execution_date` argument could be retrofitted to also accept a `run_id`
instead, presumably easily.

Max

On Tue, May 29, 2018 at 5:59 PM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamb...@coupang.com> wrote:

> Hi Javier;
> I'm afraid I'm not familiar enough with the overall architecture of
> Airflow to propose the right set of changes, and to decompose the work into
> PRs that are independently staged. But as dataset based processing is one
> of the items keeping some teams in my company on an internal scheduler tool
> rather than moving to Airflow, I'm generally in support of the idea that
> Airflow might support that (it also supports "schedules" defined on the
> completion of a prior DAGs).
>
> I hope that someone with the current roadmap in mind can chime in on
> whether datasets have been discussed and whether they are off the table for
> a reason, or have a known set of prerequisite changes. And if not, then
> maybe someone could posit the right kind of design for adding datasets
> Is the roadmap:
> https://cwiki.apache.org/confluence/display/AIRFLOW/Roadmap or
> https://cwiki.apache.org/confluence/display/AIRFLOW/2017+Roadmap+Items
> the roadmap label I can't seem to filter by.
>
> From: Javier Domingo Cansino 
> Date: Tuesday, May 29, 2018 at 5:47 PM
> To: Daniel  
> Cc: "dev@airflow.incubator.apache.org" ,
> "d...@airflow.apache.org" 
> Subject: Re: Using Airflow with dataset dependant flows (not date)
>
> Hello Daniel,
>
> Thanks for your answer, I have been able to try your suggested solution,
> and as expected it works fine. However I have found that because the
> parametrization always comes with an execution_date, it can be misleading
> to users to have all runs still depending on that parameter. I could
> generate a cli on top of airflow that would hide the fact that we are
> circumventing the usecase through that wildcard parameter, however the
> maintenance of such tool, in addition that it would be "our" way of doing
> things, could actually cause more harm than actually using other system in
> our case.
>
> After diving into the code for a few days, and reading your suggestions
> today, I agree that new web views would be required for this kind of runs,
> however I still think the parametrization would still need to be more
> controlled (as in part of the system) to feel comfortable with the
> stability of the solution. I also agree with you that we would need to have
> a new kind of schedulers too, such as Kafka messages based or database
> changes tracking ones.
>
> You mention that a lot would have to be changed for that. What steps do
> you think we could do to decompose the problem in smaller and more
> affordable steps?
>
> Cheers, Javier
>
> On Mon, May 28, 2018 at 10:28 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
> lamb...@coupang.com> wrote:
> This seemed like a very clear explanation of the JIRA ticket and the idea
> of making dagruns depend not on a schedule but the arrival of a dataset.
> I think a lot would have to change if the execution date was changed to a
> parameterized value, and that's not the only thing that would have to
> change to support a dataset trigger.
>
> Thinking about the video encoding example, it seem the airflow way to kind
> of do that would be to have dataset dags be dependent on a dag that is
> frequently scheduled to run just a TriggerDagOperator which contains a
> python callable polling for the new datasets (or subscribing to a queue of
> updates about them) which then decides which DAG ID to trigger for the
> particular dataset, and what dag_run_obj.payload should be to inform it of
> the right dataset to run on.
> You might want to write a plugin that give a different kind of tree view
> for these types of DAGs that get triggered this way so that you can easily
> see the dataset and payload specifics in the overview of the runs.
>
> There's an example of triggering a dag with a

Re: Convert Dag Run from Backfill to Scheduled?

2018-05-29 Thread Maxime Beauchemin

Yes, clearly the DAG runs be can in inconsistent states with related task
instances and backfill processes. Here's a quick patch that helps a little:
https://github.com/apache/incubator-airflow/pull/3433

After writing the quick patch above I'm thinking it requires a bit more
thinking. The clear command is effectively a bit of a way to issue a
"scheduler-driven backfill", maybe we can deprecate clear and have a new
"airflow backfill --scheduler", which would effectively clear task
instances and create/set DAG runs in the right state.

Max

On Tue, May 29, 2018 at 5:58 PM Ruiqin Yang  wrote:

> This line
> <
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L935
> >
> is
> where the scheduler skips the backfill DAG runs. Despite what state the DAG
> run is in, tasks in DAG run starts with 'backfill_' would not be considered
> when scheduling.
>
> I agree with Dan Davydov's idea that we should at least have something like
> multiple DAG runs for one execution to distinguish different DAG runs like
> scheduled and backfilled. The situation Scott is facing here is not the
> only case that lack of multiple DAG run has caused (e.g. manually trigging
> a task in the UI should also create a seperate DAG run, otherwise the
> implementation logic is a bit wired).
>
> Cheers,
> Kevin Y
>
> On Tue, May 29, 2018 at 5:52 PM Scott Halgrim
>  wrote:
>
> > Well I’ve gone ahead and run the UPDATE query now, so the scheduler is
> > picking up tasks.
> >
> > When I cleared the tasks, every DAG run that had a cleared task in it was
> > set to running. Because I’d backfilled them all they were all `backfill_`
> > dag runs.  Inspection of various tasks via `task_failed_deps` indicated
> the
> > tasks had all their dependencies filled. After running the update query,
> > they’re all `scheduled__` dag runs.
> >
> > On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <
> > maximebeauche...@gmail.com>, wrote:
> > > While this may work it's clearly not the prescribed way to do this.
> > > Clearing should just work.
> > >
> > > I'm trying to understand why the scheduler is not picking up the
> cleared
> > > task. Clearing should remove the task instance state and set the state
> of
> > > the related DAG Run to running so that the scheduler picks those up.
> > > Perhaps there's a conflict between the backfill and scheduler-related
> DAG
> > > Runs? Which DAG runs are set to running? The backfill or
> > scheduler-related
> > > ones?
> > >
> > > Originally when I introduced DAG runs, backfill was operating without
> any
> > > consideration related to DAG runs (DAG runs were a scheduler-specific
> > > construct), later on Bolke added backfill-specific DAG runs and I'm not
> > > 100% sure how that works.
> > >
> > > Let's get to the bottom of this.
> > >
> > > Max
> > >
> > > On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang  wrote:
> > >
> > > > If you are sure the update query targets the desired rows, the
> behavior
> > > > should be the same.
> > > >
> > > > Scott Halgrim 于2018年5月25日
> > 周五下午4:23写道：
> > > >
> > > > > So far no ill effects from:
> > > > >
> > > > > update dag_run
> > > > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > > > where dag_id = 'daily'
> > > > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > > > and run_id like 'backfill_%'
> > > > > order by execution_date;
> > > > >
> > > > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <
> > scott.halg...@zapier.com
> > > > > ,
> > > > > wrote:
> > > > > > Oh wow, that will work? Thanks! Is there any reason for me not to
> > just
> > > > > run a mass UPDATE on those dag runs directly in the metadata
> > database?
> > > > > >
> > > > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang ,
> > > > wrote:
> > > > > > > Airflow is not going to schedule backfill DAG runs, by looking
> > at the
> > > > > dag
> > > > > > > run ID (which will start by 'backfill__'). If you want the
> > scheduler
> > > > to
> > > > > > > schedule those tasks, you can click the DAG run and edit its
> name
> > > > back
> > > > > to
> > > > > > > 'scheduled__'
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Kevin Y
> > > > > > >
> > > > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > > > scott.halg...@zapier.com.invalid> wrote:
> > > > > > >
> > > > > > > > I’ve got four months of dag runs that were scheduled dag
> runs,
> > > > then I
> > > > > > > > backfilled them. And now when I clear a task from one of
> those
> > the
> > > > > dag run
> > > > > > > > goes to “running,” but none of the tasks get scheduled
> (unless
> > I
> > > > > manually
> > > > > > > > backfill each of them)
> > > > > > > >
> > > > > > > > What I really should have done here was just cleared a
> mid-dag
> > task
> > > > > as
> > > > > > > > well as all downstream tasks for these dag runs, but, well,
> > now I’m
> > > > > here
> > > > > > > > and I’m wondering what the best way to fix this.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> >
>

Re: Convert Dag Run from Backfill to Scheduled?

2018-05-29 Thread Maxime Beauchemin

While this may work it's clearly not the prescribed way to do this.
Clearing should just work.

I'm trying to understand why the scheduler is not picking up the cleared
task. Clearing should remove the task instance state and set the state of
the related DAG Run to running so that the scheduler picks those up.
Perhaps there's a conflict between the backfill and scheduler-related DAG
Runs? Which DAG runs are set to running? The backfill or scheduler-related
ones?

Originally when I introduced DAG runs, backfill was operating without any
consideration related to DAG runs (DAG runs were a scheduler-specific
construct), later on Bolke added backfill-specific DAG runs and I'm not
100% sure how that works.

Let's get to the bottom of this.

Max

On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang  wrote:

> If you are sure the update query targets the desired rows, the behavior
> should be the same.
>
> Scott Halgrim 于2018年5月25日 周五下午4:23写道：
>
> > So far no ill effects from:
> >
> > update dag_run
> > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > where dag_id = 'daily'
> > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > and run_id like 'backfill_%'
> > order by execution_date;
> >
> > On May 25, 2018, 4:03 PM -0700, Scott Halgrim  >,
> > wrote:
> > > Oh wow, that will work? Thanks! Is there any reason for me not to just
> > run a mass UPDATE on those dag runs directly in the metadata database?
> > >
> > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang ,
> wrote:
> > > > Airflow is not going to schedule backfill DAG runs, by looking at the
> > dag
> > > > run ID (which will start by 'backfill__'). If you want the scheduler
> to
> > > > schedule those tasks, you can click the DAG run and edit its name
> back
> > to
> > > > 'scheduled__'
> > > >
> > > > Cheers,
> > > > Kevin Y
> > > >
> > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > scott.halg...@zapier.com.invalid> wrote:
> > > >
> > > > > I’ve got four months of dag runs that were scheduled dag runs,
> then I
> > > > > backfilled them. And now when I clear a task from one of those the
> > dag run
> > > > > goes to “running,” but none of the tasks get scheduled (unless I
> > manually
> > > > > backfill each of them)
> > > > >
> > > > > What I really should have done here was just cleared a mid-dag task
> > as
> > > > > well as all downstream tasks for these dag runs, but, well, now I’m
> > here
> > > > > and I’m wondering what the best way to fix this.
> > > > >
> > > > > Thanks!
> > > > >
> > > > >
> >
>

Re: Problem with the scheduler?

2018-05-24 Thread Maxime Beauchemin

Also note that for example when setting up monthly jobs, the job with an
execution_date of  `2018-02-01` will be triggered soon after the wall clock
hits `2018-03-01`, and that your start_date for the tasks in the DAG need
to be prior to that execution_date, not the time at which you're expecting
the job to trigger.

Max

On Thu, May 24, 2018 at 9:23 AM Victor Noagbodji <
vnoagbo...@amplify-nation.com> wrote:

> Hi Stephane,
>
> First cron string does not look correct. It says the 2nd of each month.
> Check here: https://crontab.guru/#0_7_/2_*_*
>
> However second looks okay to me. It says 2(-31)/2, every second day from 2
> till 31. Maybe the first string needs 1/2?
>
> Also, check this SO (sorry) answer:
> https://unix.stackexchange.com/questions/16093/how-can-i-tell-cron-to-run-a-command-every-other-day-odd-even/16094
>
> On May 23, 2018, at 5:28 PM, Stephane Bonneaud  > wrote:
>
> Hi,
>
> I came about an issue with Airflow that I am wonder how to deal with it
> and thought I should reach out to see if you know about this issue and/or
> have advice on how to deal with it.
>
> I have around 6 or 7 DAGs on Airflow that are scheduled to run at
> different intervals. Two of them should run every other day and, for some
> reason, they stopped running at some point, apparently not at the same
> time. So, they were running properly every other day, but then, they just
> stopped being scheduled, possibly around the same time, though I suspect
> that one was not scheduled anymore before the other one. Also, it does not
> seem to happen with my other DAGs, which are scheduled to run mostly daily,
> or bi-monthly. So maybe it has something to do with my scheduling string?
>
> Here is the scheduling string for each of the DAGs: '0 7 */2 * *’ and '0 7
> 2/2 * *’
> This seems to be the proper CRON string.
>
> Have you seen that problem before?
> How would you go about debugging this or do you have any advice/pointers
> for me?
>
> Thank you very much for your time and help,
> Have a wonderful rest of the day,
>
> Best regards,
> Stéphane
>
> http://stephanebonneaud.com
> (401) 580-0817
>
>

Is `airflow backfill` disfunctional?

2018-05-24 Thread Maxime Beauchemin

So I'm running a backfill for what feels like the first time in years using
a simple `airflow backfill --local` commands.

First I start getting a ton of `logging.info` of each tasks that cannot be
started just yet at every tick flooding my terminal with the keyword
`FAILED` in it, looking like a million of lines like this one:

[2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met for
,
dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
quires all upstream tasks to have succeeded, but found 1 non-success(es).
upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed': 0L,
'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']

Good thing I triggered 1 month and not 2 years like I actually need, just
the logs here would be "big data". Now I'm unclear whether there's anything
actually running or if I did something wrong, so I decide to kill the
process so I can set a smaller date range and get a better picture of
what's up.

I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a note
that I'll need to find that log-flooding line and demote it to DEBUG in a
quick PR, no biggy.

Now I restart with just a single schedule, and get an error `Dag {some_dag}
has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could just
pickup where it left off. Maybe I need to run an `airflow clear` command
and restart? Ok, ran my clear command, same error is showing up. Dead end.

Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
look like it... Maybe `airflow backfill` has some new switches to pick up
where it left off? Can't find it. Am I supposed to clear the DAG Runs
manually in the UI?  This is a pre-production, in-development DAG, so it's
not on the production web server. Am I supposed to fire up my own web
server to go and manually handle the backfill-related DAG Runs? Cannot to
my staging MySQL and do manually clear some DAG runs?

So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
appears I can finally start over.

Next thought was: "Alright looks like I need to go Linus on the mailing
list".

What am I missing? I'm really hoping these issues specific to 1.8.2!

Backfilling is core to Airflow and should work very well. I want to restate
some reqs for Airflow backfill:
* when failing / interrupted, it should seamlessly be able to pickup where
it left off
* terminal logging at the INFO level should be a clear, human consumable,
indicator of progress
* backfill-related operations (including restarts) should be doable through
CLI interactions, and not require web server interactions as the typical
sandbox (dev environment) shouldn't assume the existence of a web server

Let's fix this.

Max

Re: Airflow cli to remote host

2018-05-23 Thread Maxime Beauchemin

A quick side note to say that it's common to deploy one or many Airflow
sandboxes which are effectively the same configuration as a worker without
an actual worker instance working on it. It's similar to the concept of a
"gateway node" in Hadoop.

Users typically work in user space with a modified `airflow.cfg` that may
point to an alternate metadata database (to insulate production) that may
or may not have alternate connections registered to staging / dev
counterparts if existing, depending on policy. You'll typically find the
same Airflow package and python environment as the one used in production
with similar connectivity to other systems and databases. From there you
can run any cli commands and even fire up your own Airflow webserver that
you can tunnel into if need be.

For example at Lyft there's a simple cli application that will prepare your
remote home and hook things up (provide a working airflow.cfg, sync/clone
the pipeline repo for you, ...) so that it all works and feels similar to
other development workflows specific to Lyft. It basically automated the
whole "setting up a dev env" with the proper policies.

At Airbnb, the "data sandboxes" act as Airflow sandboxes that you can ssh
into, AND JupyterHub nodes where you can find the same home whether you ssh
or you access Jupyter.

In the Kubernetes world, it seems like there should be an easy way to order
or "lease" an Airflow sandbox that would have your home directory persisted
and mounted on that pod just for the time that you need it.

Max

On Wed, May 23, 2018 at 3:12 PM Luke Diment 
wrote:

> Fabric looks perfect for this.
> 
> From: Kyle Hamlin 
> Sent: Thursday, May 24, 2018 6:22 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow cli to remote host
>
> I'd suggest using something like Fabric  for
> this.
> This is is how I accomplish the same task.
>
> On Wed, May 23, 2018 at 2:19 PM Frank Maritato 
> wrote:
>
> > Hi All,
> >
> > I need to be able to run backfill for my jobs against our production
> > airflow server. Is there a way to run
> >
> > airflow backfill job_name -s 2018-05-01
> >
> > against a remote server? I didn’t see a -h option to specify a hostname.
> >
> > If not, is there a way through the ui to do this? I'd rather not have to
> > ssh into the production server to run these jobs.
> >
> > Thanks!
> > --
> > Frank Maritato
> >
> >
>
> --
> Kyle Hamlin
>
>
>
> The contents of this email and any attachments are confidential and may be
> legally privileged. If you are not the intended recipient please advise the
> sender immediately and delete the email and attachments. Any use,
> dissemination, reproduction or distribution of this email and any
> attachments by anyone other than the intended recipient is prohibited.
>

Re: 答复: Airflow REST API proof of concept.

2018-05-21 Thread Maxime Beauchemin

Personally I think we should keep the architecture as simple as possible
and use the same web server for REST and UI. As mentioned FAB (Flask App
Builder) manages authentication and RBAC, so we can have consistent access
rights in the UI and CLI.

Max

On Fri, May 11, 2018 at 5:42 AM Luke Diment 
wrote:

> Ok thanks that’s awesome we will have a look and check it out thanks...
>
> 🙂
>
> Sent from my iPhone
>
> > On 12/05/2018, at 12:38 AM, Driesprong, Fokko 
> wrote:
> >
> > Hi Luke,
> >
> > This is the REST api for the new UI:
> >
> https://github.com/apache/incubator-airflow/blob/master/airflow/www_rbac/api/experimental/endpoints.py
> >
> > RBAC = Role Based Access Control, the fine grained security model based
> on
> > the fabmanager. Recently we've added some endpoints to it. In the end
> also
> > all the GUI ajax calls should go by this API, instead of calling the
> flask
> > endpoints directly.
> >
> > Cheers, Fokko
> >
> >
> >
> >
> >
> >
> > 2018-05-11 14:34 GMT+02:00 Luke Diment :
> >
> >> Our build pipeline uses Jenkinsfile with Docker kubernetes and helm...we
> >> orchestrate deployment against our rest api and use junit to assert our
> >> results...fully programmatically against airflow...!
> >>
> >> Sent from my iPhone
> >>
> >>> On 12/05/2018, at 12:31 AM, Luke Diment 
> >> wrote:
> >>>
> >>> No it executes the backend airflow command line over HTTP giving
> >> developers room to freely interact with airflow programmatically...hence
> >> you can easily integration test your business logic...
> >>>
> >>> Sent from my iPhone
> >>>
>  On 12/05/2018, at 12:27 AM, Song Liu  wrote:
> 
>  So that this Java REST API server is talking to the meta db directly ?
>  
>  发件人: Luke Diment 
>  发送时间: 2018年5月11日 12:22
>  收件人: dev@airflow.incubator.apache.org
>  主题: Fwd: Airflow REST API proof of concept.
> 
>  FYI.
> 
>  Sent from my iPhone
> 
>  Begin forwarded message:
> 
>  From: Luke Diment  >> luke.dim...@westpac.co.nz>>
>  Date: 11 May 2018 at 1:02:43 PM NZST
>  To: "dev-ow...@airflow.incubator.apache.org >> airflow.incubator.apache.org>"  >> >
>  Subject: Fw: Airflow REST API proof of concept.
> 
> 
>  FYI.
> 
>  
>  From: Luke Diment
>  Sent: Thursday, May 10, 2018 4:33 PM
>  To: dev-subscr...@airflow.incubator.apache.org >> v-subscr...@airflow.incubator.apache.org>
>  Subject: Airflow REST API proof of concept.
> 
> 
>  Hi Airflow contributors,
> 
> 
>  I am a Java developer/full stack and lots of other stuff at Westpac
> >> Bank New Zealand.
> 
> 
>  We currently use Airflow for task scheduling for a rather large
> >> integration project for financial risk assessment.
> 
> 
>  During our development phase we started to understand that a REST API
> >> in front of Airflow would be a great idea.
> 
> 
>  We realise that you guys have detailed there will a REST API at some
> >> stage.
> 
> 
>  We have already built a proof of concept REST API implementation in
> >> Java (of course...;-))...
> 
> 
>  We were wondering if your contributor group would find this helpful or
> >> if there would be any reason to continue such an API in Java.
> 
> 
>  We look forward to your response.  We can share the code if needed...
> 
> 
>  Thanks,
> 
> 
>  Luke Diment.
> 
> 
> 
> 
> 
>  The contents of this email and any attachments are confidential and
> may
> >> be legally privileged. If you are not the intended recipient please
> advise
> >> the sender immediately and delete the email and attachments. Any use,
> >> dissemination, reproduction or distribution of this email and any
> >> attachments by anyone other than the intended recipient is prohibited.
> >>>
> >>>
> >>>
> >>> The contents of this email and any attachments are confidential and may
> >> be legally privileged. If you are not the intended recipient please
> advise
> >> the sender immediately and delete the email and attachments. Any use,
> >> dissemination, reproduction or distribution of this email and any
> >> attachments by anyone other than the intended recipient is prohibited.
> >>
> >>
> >>
> >> The contents of this email and any attachments are confidential and may
> be
> >> legally privileged. If you are not the intended recipient please advise
> the
> >> sender immediately and delete the email and attachments. Any use,
> >> dissemination, reproduction or distribution of this email and any
> >> attachments by anyone other than the intended recipient is prohibited.
> >>
>
>
>
> The contents of this email and any attachments are confidential and may be
> legally privileged. If you are not the intended recipient please advise the
> sender immediately and del

Re: Dags getting failed after 24 hours

2018-05-21 Thread Maxime Beauchemin

Even though it's possible to set and `execution_timeout` on any task and/or
a dagrun_timeout on DAG runs, by default it's all set to None (unless
you're somehow setting the DAG's default parameters in some other ways).

Maybe your have some OS-level policies on long-running processes in your
environment? Anything in the logs? SIGKILL?

Max

On Mon, May 21, 2018 at 5:13 AM ramandu...@gmail.com 
wrote:

> Hi All,
> We have a long running DAG which is expected to take around 48 hours. But
> we are observing that its get killed by Airflow scheduler after ~24 hrs. We
> are not setting any Dag/task execution timeout explicitly.
> Is there any default timeout value that get used. We are using
> LocalExecutor mode.
> We checked in the Airflow code but execution timeout values seem to be set
> to 'None'
>
> Thanks,
> Raman Gupta
>

Re: Improving Airflow SLAs

2018-05-03 Thread Maxime Beauchemin

About de-coupling the SLA management process, I've had conversations in the
direction of renaming the scheduler to "supervisor" to reflect the fact
that it's not just scheduling processes, it does a lot more tasks than just
that, SLA management being one of them.

I still think the default should be to require a single supervisor that
would do all the "supervision" work though. I'm generally against requiring
more types of nodes on the cluster. But perhaps the supervisor could have
switches to be started in modes where it would only do a subset of its
tasks, so that people can run multiple specialized supervisor nodes if they
want to.

For the record, I was thinking that renaming the scheduler to supervisor
would likely happen as we re-write it to enable multiple concurrent
supervisor processes. It turns out that parallelizing the scheduler hasn't
been as critical as I thought it would be originally, especially with the
current multi-process scheduler. Sounds like the community is getting a lot
of mileage out of this current multi-process scheduler.

Max

On Thu, May 3, 2018 at 7:31 AM, Jiening Wen  wrote:

> I would love to see this proposal gets implemented in airflow.
> In our case duration based SLA makes much more sense and I ended up adding
> a decorator to the execute method in our custom operators.
>
> Best regards,
> Jiening
>
> -Original Message-
> From: James Meickle [mailto:jmeic...@quantopian.com]
> Sent: Wednesday 02 May 2018 9:00 PM
> To: dev@airflow.incubator.apache.org
> Subject: Improving Airflow SLAs [External]
>
> At Quantopian we use Airflow to produce artifacts based on the previous
> day's stock market data. These artifacts are required for us to trade on
> today's stock market. Therefore, I've been investing time in improving
> Airflow notifications (such as writing PagerDuty and Slack integrations).
> My attention has turned to Airflow's SLA system, which has some drawbacks
> for our use case:
>
> 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> skipped for this execution date will still trigger emails/callbacks. This
> is a huge problem for us because we run almost no tasks on weekends (since
> the stock market isn't open).
>
> 2) Defining SLAs can be awkward because they are relative to the execution
> date instead of the task start time. There's no way to alert if a task runs
> for "more than an hour", for any non-trivial DAG. Instead you can only
> express "more than an hour from execution date".  The financial data we use
> varies in when it arrives, and how long it takes to process (data volume
> changes frequently); we also have tight timelines that make retries
> difficult, so we want to alert an operator while leaving the task running,
> rather than failing and then alerting.
>
> 3) SLA miss emails don't have a subject line containing the instance URL
> (important for us because we run the same DAGs in both staging/production)
> or the execution date they apply to. When opened, they can get hard to read
> for even a moderately sized DAG because they include a flat list of task
> instances that are unsorted (neither alpha nor topo). They are also lacking
> any links back to the Airflow instance.
>
> 4) SLA emails are not callbacks, and can't be turned off (other than either
> removing the SLA or removing the email attribute on the task instance). The
> way that SLA miss callbacks are defined is not intuitive, as in contrast to
> all other callbacks, they are DAG-level rather than task-level. Also, the
> call signature is poorly defined: for instance, two of the arguments are
> just strings produced from the other two arguments.
>
> I have some thoughts about ways to fix these issues:
>
> 1) I just consider this one a bug. If a task instance is skipped, that was
> intentional, and it should not trigger any alerts.
>
> 2) I think that the `sla=` parameter should be split into something like
> this:
>
> `expected_start`: Timedelta after execution date, representing when this
> task must have started by.
> `expected_finish`: Timedelta after execution date, representing when this
> task must have finished by.
> `expected_duration`: Timedelta after task start, representing how long it
> is expected to run including all retries.
>
> This would give better operator control over SLAs, particularly for tasks
> deeper in larger DAGs where exact ordering may be hard to predict.
>
> 3) The emails should be improved to be more operator-friendly, and take
> into account that someone may get a callback for a DAG they don't know very
> well, or be paged by this notification.
>
> 4.1) All Airflow callbacks should support a list, rather than requiring a
> single function. (I've written a wrapper that does this, but it would be
> better for Airflow to just handle this itself.)
>
> 4.2) SLA miss callbacks should be task callbacks that receive context, like
> all the other callbacks. Having a DAG figure out which tasks have missed
> SLAs collectively is fine, but ge

Re: Managed Apache Airflow Service on Google Cloud Platform

2018-05-01 Thread Maxime Beauchemin

I'm sure the community agrees when I say that we're happy and honored to
have Googlers on board.

Congrats on the launch!

Max

On Tue, May 1, 2018 at 9:58 AM, Feng Lu  wrote:

> *Hello everyone,I want to let everyone know that today Google Cloud
> launched a new managed service based on Apache Airflow - Cloud Composer[1].
> Now that we have launched into public beta, I wanted to connect with the
> community to share why we chose Airflow and our plans for Composer and
> involvement with the Airflow community.A year ago we set out to build a
> workflow orchestration product for Google Cloud. We strongly believe that
> such a system should be based on open source - it’s described as a core
> value on our public landing page[2]. We chose Airflow for many reasons,
> including the awesome community, its approachability for developers, and
> its core concepts. We built Cloud Composer because we wanted to make
> Airflow accessible to all Google Cloud customers. We’re also encouraging
> these customers to use Airflow outside of Google Cloud - whether it be
> another Cloud or on-premise. When we started building Cloud Composer we got
> involved in the Airflow community. You have probably seen a few Googlers
> submitting pull requests, including myself. We do not plan on forking
> Airflow with the release of Cloud Composer and it’s our commitment to
> remain involved in the Airflow community as we grow Composer. We will
> continue to actively contribute to Airflow and look forward to partnering
> with the community. You should expect to see myself and other Googlers
> involved in Airflow in the future.Best,Feng[1]
> https://cloud.google.com/composer [2]
> https://cloud.google.com/ *
>

Re: How to clear all failed tasks for a DAG with batch ?

2018-04-30 Thread Maxime Beauchemin

I'm not sure if it fits your use case, but the "Task Instances" list view
from the UI allows for somewhat complex composite filters and to alter the
state of the tasks for a set of task instances. It's easy to add more
"actions" if needed. The new RBAC UI should have the matching
filter->action workflow supported, but I'm not sure whether the existing
actions have been implemented.

Here's a screenshot (Apache mailing list doesn't let us embed images in
emails...) https://ibb.co/jvmcMx

Max

On Mon, Apr 30, 2018 at 7:37 AM, Alex Tronchin-James 949-412-7220 <
alex.n.ja...@gmail.com> wrote:

> It would be a good webui update to add a multiselect option to clear by
> task state. Or maybe clear anything but running/success by default and add
> an "include success" option.
> On Fri, Apr 27, 2018 at 06:47 Maxime Beauchemin <
> maximebeauche...@gmail.com>
> wrote:
>
> > https://airflow.apache.org/cli.html#clear
> >
> > `airflow clear mydagid --only_failed`
> >
> > You can specify a date range, a task_id regex and other flags as well
> using
> > this command.
> >
> > Max
> >
> > On Thu, Apr 26, 2018 at 11:04 PM, dong.yajun  wrote:
> >
> > > Hi list,
> > >
> > > We run a DAG with about 450 bash tasks which was generated by program.
> > but
> > > sometimes there are several tasks fail,  we must open the DAG UI, find
> > the
> > > failed task and clear the task to restart the task one by one.
> > >
> > > is there a way with one step to clear(restart) all failed tasks for a
> > DAG?
> > >
> > >
> > > --
> > > *Ric Dong*
> > >
> >
>

Re: How to clear all failed tasks for a DAG with batch ?

2018-04-26 Thread Maxime Beauchemin

https://airflow.apache.org/cli.html#clear

`airflow clear mydagid --only_failed`

You can specify a date range, a task_id regex and other flags as well using
this command.

Max

On Thu, Apr 26, 2018 at 11:04 PM, dong.yajun  wrote:

> Hi list,
>
> We run a DAG with about 450 bash tasks which was generated by program. but
> sometimes there are several tasks fail,  we must open the DAG UI, find the
> failed task and clear the task to restart the task one by one.
>
> is there a way with one step to clear(restart) all failed tasks for a DAG?
>
>
> --
> *Ric Dong*
>

Re: About how to pause the running task

2018-04-26 Thread Maxime Beauchemin

There are no semantics or concept of pause within a task, though you can
clear/kill tasks which is essentially sending a poison pill to kill the
task subprocess.

Even if BaseOperator (common to all tasks) was to implement pausing
semantics, they probably wouldn't be implemented for most operators. Is it
even possible to pause a bash script externally (through SIGTSTP
, but does that always work)? a
python script? Hive job? A Spark job? Personally I don't think that feature
would get much use. You end up freezing and not releasing a worker slot.

You can pause a DAG though. FYI when a DAG is paused, it only means it will
stop scheduling new tasks.

Max

On Thu, Apr 26, 2018 at 10:14 AM, Song Liu  wrote:

> Hi,
>
> A DAG is composed of many tasks, when this DAG is started, how to pause
> the current running task ?
>
> Thanks,
> Song
>

Re: About the project support in Airflow

2018-04-24 Thread Maxime Beauchemin

Related blog post about multi-tenant Airflow deployment out of Pandora:
https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee

On Tue, Apr 24, 2018 at 10:20 AM, Bolke de Bruin  wrote:

> My suggestion would be to deploy airflow per project. You could even use
> airflow to manage your ci/cd pipeline.
>
> B.
>
> Sent from my iPhone
>
> > On 24 Apr 2018, at 18:33, Maxime Beauchemin 
> wrote:
> >
> > People have been talking about namespacing DAGs in the past. I'd
> recommend
> > using tags (many to many) instead of categories/projects (one to many).
> >
> > It should be fairly easy to add this feature. One question is whether
> tags
> > are defined as code or in the UI/db only.
> >
> > Max
> >
> >> On Tue, Apr 24, 2018 at 1:48 AM, Song Liu  wrote:
> >>
> >> Hi,
> >>
> >> Basically the DAGs are created for a project purpose, so if I have many
> >> different projects, will the Airflow support the Project concept and
> >> organize them separately ?
> >>
> >> Is this a known requirement or any plan for this already ?
> >>
> >> Thanks,
> >> Song
> >>
>

Re: About the project support in Airflow

2018-04-24 Thread Maxime Beauchemin

People have been talking about namespacing DAGs in the past. I'd recommend
using tags (many to many) instead of categories/projects (one to many).

It should be fairly easy to add this feature. One question is whether tags
are defined as code or in the UI/db only.

Max

On Tue, Apr 24, 2018 at 1:48 AM, Song Liu  wrote:

> Hi,
>
> Basically the DAGs are created for a project purpose, so if I have many
> different projects, will the Airflow support the Project concept and
> organize them separately ?
>
> Is this a known requirement or any plan for this already ?
>
> Thanks,
> Song
>

Re: Benchmarking of Airflow Scheduler with Celery Executor

2018-04-13 Thread Maxime Beauchemin

If you're concerned about scheduler scalability I'd go with a bigger box.
The scheduler uses multiprocessing so more CPU power means more throughput.

Also you may want to provision a beefy MySQL box to make sure that doesn't
become the bottleneck. 10k tasks heartbeating to the DB every 30 seconds is
significant load.

Perhaps Airbnb folks chime in about their scale and hardware setup?

Max

On Fri, Apr 13, 2018 at 9:14 AM, ramandu...@gmail.com 
wrote:

> Thanks Ry,
> Just wondering if there is any approximate number on concurrent tasks a
> scheduler can run on say 16 GB RAM and 8 core machine.
> If its already been done that would be useful.
> We did some benchmarking with local executor and observed that each
> TaskInstance was taking ~100MB of memory so we could only run ~130
> concurrent tasks on 16 GB RAM and 8 core machine.
>
> -Raman Gupta
>
>
>
> On 2018/04/12 16:32:37, Ry Walker  wrote:
> > Hi Raman -
> >
> > First, we’d be happy to help you test this out with Airflow. Or you could
> > do it yourself by using http://open.astronomer.io/airflow/ (w/ Docker
> > Engine + Docker Compose) to quickly spin up a test environment.
> Everything
> > is hooked to Prometheus/Grafana to monitor how the system reacts to your
> > workload.
> >
> > -Ry
> > CEO, Astronomer
> >
> > On April 12, 2018 at 12:23:46 PM, ramandu...@gmail.com (
> ramandu...@gmail.com)
> > wrote:
> >
> > Hi All,
> > We have requirement to run 10k(s) of concurrent tasks. We are exploring
> > Airflow's Celery Executor for same. Horizontally Scaling of worker nodes
> > seem possible but it can only have one active scheduler.
> > So will Airflow scheduler be able to handle these many concurrent tasks.
> > Is there any benchmarking number around airflow scheduler's scalability.
> > Thanks,
> > Raman
> >
>

Give it up for Fokko!

2018-04-13 Thread Maxime Beauchemin

Hey all,

I wanted to point out the amazing work that Fokko is doing,
reviewing/merging PRs and doing fantastic committer & maintainer work. It
takes a variety of contributions to make projects like Airflow thrive, but
without this kind of involvement it wouldn't be possible to keep shipping
better versions of the product steadily.

Cheers to that!

Max

Re: Slides / Presentations for Airflow

2018-04-12 Thread Maxime Beauchemin

I have some slides but they're very Airbnb-styled, I need to make a new
deck...

Max

On Thu, Apr 12, 2018 at 9:41 AM, Chris Riccomini 
wrote:

> > I suspect we will be hearing from Joy, Chris R., et al shortly.
>
> Correct. Video editing is going on as we speak. :)
>
> On Thu, Apr 12, 2018 at 9:30 AM, Sid Anand  wrote:
>
> > Great question.. currently, we don't.
> >
> > After meet-ups, usually the host of the meetup sends an email to this DL
> > with a link to slides and videos -- we update
> > https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements and
> > share
> > it on Twitter.
> >
> > WePay is very good about sharing videos of the Airflow meet-ups they host
> > within a short time -- I suspect we will be hearing from Joy, Chris R.,
> et
> > al shortly.
> >
> > BTW, Apache Beam folks speak very often... maybe on average once or
> twice a
> > month.. and all of their talks are not up on their site. It's a best
> effort
> > thing.
> >
> > -s
> >
> > On Wed, Apr 11, 2018 at 1:56 AM, Naik Kaxil  wrote:
> >
> > > Hi all,
> > >
> > >
> > >
> > > Do we maintain slides or presentation materials like Apache Beam does ?
> > > https://beam.apache.org/contribute/presentation-materials/
> > >
> > >
> > > Or if anyone has just presented somewhere can I get the link to slides
> > and
> > > videos if possible?
> > >
> > >
> > >
> > > Regards,
> > >
> > > Kaxil
> > >
> > >
> > > Kaxil Naik
> > >
> > > Data Reply
> > > 38 Grosvenor Gardens
> > >  > 0ALondon+SW1W+0EB+-+UK&entry=gmail&source=g>
> > > London SW1W 0EB - UK
> > > phone: +44 (0)20 7730 6000
> > > k.n...@reply.com
> > > www.reply.com
> > >
> > > [image: Data Reply]
> > >
> >
>

Re: schedule backfill jobs in reverse order

2018-04-04 Thread Maxime Beauchemin

It's a totally reasonable use case. As you said, currently it only moves
forward, though it could also move backwards. You could add a DAG argument
`schedule_past_dagruns=False` (we probably need a better name), that would
enable this feature.

Here's the code that creates DagRuns, it seems like it may be easy to add a
few lines to implement this feature.
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L761

Max

On Mon, Apr 2, 2018 at 7:17 PM, David Capwell  wrote:

> Nothing I know of.  The scheduler finds the latest execution then creates
> the next based off interval; this is also why update to start date have no
> affect (doesn't try to fill gaps)
>
> On Mon, Apr 2, 2018, 11:26 AM Dennis O'Brien 
> wrote:
>
> > Hi folks,
> >
> > I recently asked this question on gitter but didn't get any feedback.
> >
> > Anyone know if there is a way to get the scheduler to reverse the order
> of
> > the dag runs? By default a new DAG starts at start_date then moves
> > sequentially forward in time until it is caught up (assuming
> catchup=True).
> > The same is true for a new DAG just enabled, or a DAG that is cleared,
> and
> > for a backfill.
> >
> > The behavior I'd like to get is for the scheduler to queue up the latest
> > available, so it starts most recent, then moves back in time. If while
> the
> > backfill is running a more recent DAG run is eligible, that one should be
> > queued next.
> >
> > Is there anyway to accomplish this?  Is this a feature that others would
> > find useful?
> >
> > For some background, I have some jobs that make predictions and do a long
> > backfill for historical backtesting, and that can mean no new predictions
> > for a week depending on the job and the time to backfill.  Ideally the
> most
> > recent jobs would take precedence over the historical jobs.
> >
> > thanks,
> > Dennis
> >
>

Re: slow scheduler

2018-04-04 Thread Maxime Beauchemin

As a batch scheduler Airflow doesn't currently guarantee super low latency.
The aim for the project has been to make it possible to do sub-minute
latency at scale, but it's common for this go up to a few minutes in larger
environments.

I'd recommend making it clear to your users what your Airflow environment
can currently guarantee, and that should inform how they define their
tasks. If latency is 1-2 minutes, it doesn't make sense to do a chain of
1-2 seconds tasks. Typically the duration of an Airflow task should be
counted in minutes, not seconds (there are exceptions though). In a way the
current latencies aren't bad as it weeds out things that probably shouldn't
be done in Airflow in the first place. Airflow is not Amazon Lambda.

Latency guarantees may get lower in the future as the community may tackle
"distributed scheduling", which wouldn't be hard to do. Essentially the
worker would evaluate and trigger child tasks as tasks succeed. Even when
that will be the case, I'd recommend avoiding small tasks as they would be
very overhead-heavy. You don't want the system to spend a too significant
portion of its effort on overhead (fetching and loading DAGs in memory,
maintaining/altering state in the database, ...).

Cranking up `max_threads` should help significantly (number of concurrent
DAGs that are getting scheduled), and I'd raise `job_heartbeat_sec`
and `scheduler_heartbeat_sec`
as the 5 seconds defaults are very low. They are set low to allow for
playing around with the examples with LocalExecutor. In production
environments you may want to raise this to 30 to 60 seconds. max_threads
depends on the box you're running the scheduler on, but you could run as
many threads as you have CPU cores on that box. Larger environments
typically "beef-up" their scheduler box and crank up the number of threads.

job_heartbeat_sec = 30
scheduler_heartbeat_sec = 30
max_threads = 32

We should clarify this in the documentation, and provide more guidance
around what a production `airflow.cfg` should look like.

Max

On Wed, Apr 4, 2018 at 5:08 AM, Cieplucha, Michal <
michal.cieplu...@intel.com> wrote:

> Hello all,
>
> Our automation with airflow is getting bigger and bigger (airflow 1.8,
> ~150 DAGs, 3xinstances of scheduler) . Sometimes our users are triggering
> DAG runs based on some external events, so we exposed an API endpoint to
> run a DAG. Those DAGs that are run manually should give fast feedback to
> the user, but we see that it takes few minutes to schedule first task, and
> often next few minutes between tasks. So the most time is consumed between
> tasks, task durations are just some seconds. Does anybody have those
> issues? It looks like scheduler often have empty loops with logs like:
> 2018-04-04 12:05:45,004:DEBUG:airflow.jobs.SchedulerJob:[CT=None]
> Starting Loop...
> 2018-04-04 12:05:45,005:INFO:airflow.jobs.SchedulerJob:[CT=None]
> Heartbeating the process manager
> 2018-04-04 12:05:45,005:INFO:airflow.jobs.SchedulerJob:[CT=None]
> Heartbeating the executor
> 2018-04-04 
> 12:05:45,005:DEBUG:airflow.executors.celery_executor.CeleryExecutor:[CT=None]
> 44 running task instances
> 2018-04-04 
> 12:05:45,005:DEBUG:airflow.executors.celery_executor.CeleryExecutor:[CT=None]
> 0 in queue
> 2018-04-04 
> 12:05:45,006:DEBUG:airflow.executors.celery_executor.CeleryExecutor:[CT=None]
> 340 open slots
> 2018-04-04 
> 12:05:45,006:DEBUG:airflow.executors.celery_executor.CeleryExecutor:[CT=None]
> Calling the 
> sync method
> 2018-04-04 
> 12:05:45,006:DEBUG:airflow.executors.celery_executor.CeleryExecutor:[CT=None]
> Inquiring about 44 celery task(s)
> 2018-04-04 12:05:45,744:DEBUG:airflow.jobs.SchedulerJob:[CT=None] Ran
> scheduling loop in 0.74s
> 2018-04-04 12:05:45,745:DEBUG:airflow.jobs.SchedulerJob:[CT=None]
> Sleeping for 1.00s
>
> Maybe we need to tune airflow settings?
> We have up to 250 unacked messages on rabbit queue, which translates to
> number of running task instances, there is a lot going on in our airflow
> instance but apart from that scheduling issue everything looks fine
> (cpu/memory usage, etc).
> Our general settings:
> 6x dockers with workers, parallelism is 384, dag concurrency 128 and
> celeryd_concurrency 64
>
> Our scheduler config section:
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> max_threads = 2
>
>
> thanks
> mC
>
>
> I am an Intel employee. All comments and opinions are my own and do not
> represent the views of Intel.
>
>
> 
>
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII
> Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP
> 957-07-52-316 | Kapital zakladowy 200.000 PLN.
>
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego
> adresata i moze zawierac informacje poufne. W razie przypadkowego
> otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej
> usuniecie; jakiekolwiek

1 2 3 4 5 >

1 - 100 of 480 matches

Mail list logo