Re: Triggering behavior around DAG state changes

2018-11-08 Thread Brian Greene
In what you describe it doesn’t sound like a state change as much as just a dag 
that has as its first operator an http operator, and another at the end that 
communicates final state?



Sent from a device with less than stellar autocorrect

> On Nov 7, 2018, at 11:12 AM, Aleks Shulman 
>  wrote:
> 
> Hi there,
> 
> I'm new to Airflow and I'm working on an integration between it and another
> system.
> 
> I'm curious about best practices around adding some programmatic behavior
> that specifically occurs when a DAG's state changes. For example, when a
> DAG is started, I'd like perform some action (in this case make a REST
> call), and also when it is marked as failed or success.
> 
> Is there an API for doing that? If not, might there be interest for someone
> contributing it?
> 
> Thanks!
> -- 
> *WeWork | Aleksandr Shulman*
> Senior Data Platform Engineer
> M: 847-814-5804
> wework.com 
> 
> Create Your Life's Work


Re: Duplicate key unique constraint error

2018-10-30 Thread Brian Greene
How do you trigger it externally?

We have several custom operators that trigger other jobs and we had to be 
really careful or we’d get duplicate keys for the dag run and it would fail to 
kick off.

One scheduler, but we saw it repeatedly and have it noted as a thing to watch 
out for.

Brian

Sent from a device with less than stellar autocorrect

> On Oct 29, 2018, at 2:03 PM, Abhishek Sinha  wrote:
> 
> Attaching the scheduler crash logs as well.
> 
> https://pastebin.com/B2WEJKRB
> 
> 
> 
> 
> Regards,
> 
> Abhishek Sinha | m: +919035191078 | e: abhis...@infoworks.io
> 
> 
> On Tue, Oct 30, 2018 at 12:19 AM Abhishek Sinha 
> wrote:
> 
>> Max,
>> 
>> We always trigger the DAG externally. I am not sure if there is still any
>> backfill involved.
>> 
>> Is there a way where I can find out in logs, if more than one instance of
>> scheduler is running?
>> 
>> 
>> On 29 October 2018 at 10:43:19 PM, Maxime Beauchemin (
>> maximebeauche...@gmail.com) wrote:
>> 
>> The stacktrace seems to be pointing in that direction. Id check that
>> first. It seems like it **could** be a race condition with a backfill as
>> well, unclear.
>> 
>> It's still a bug though, and the scheduler should make sure to handle this
>> and not raise/crash.
>> 
>> On Mon, Oct 29, 2018, 10:05 AM Abhishek Sinha 
>> wrote:
>> 
>>> Max,
>>> 
>>> I do not think there was more than one instance of scheduler running.
>>> Since the scheduler crashed and it has been restarted, I cannot confirm it
>>> now. Is there any log that can provide this information?
>>> 
>>> Could there be a different cause apart from multiple scheduler instances
>>> running?
>>> 
>>> 
>>> On 29 October 2018 at 9:30:56 PM, Maxime Beauchemin (
>>> maximebeauche...@gmail.com) wrote:
>>> 
>>> Abhishek, are you running more than one scheduler instance at once?
>>> 
>>> Max
>>> 
>>> On Mon, Oct 29, 2018 at 8:17 AM Abhishek Sinha 
>>> wrote:
>>> 
 The issue is happening more frequently now. Can someone please look into
 this?
 
 
 
 
 On 24 September 2018 at 12:42:49 PM, Abhishek Sinha (
>>> abhis...@infoworks.io
 )
 wrote:
 
 Can someone please help in looking into this issue? It is critical since
 this has come up in one of our production environment. Also, this issue
>>> has
 appeared only once till now.
 
 
 
 
 Regards,
 
 Abhishek
 
 On 20-Sep-2018, at 10:18 PM, Abhishek Sinha 
>>> wrote:
 
 Any update on this?
 
 
 
 
 Regards,
 
 Abhishek
 
 On 18-Sep-2018, at 12:48 AM, Abhishek Sinha 
>>> wrote:
 
 Pastebin: https://pastebin.com/K6BMTb5K
 
 
 
 
 Regards,
 
 Abhishek
 
 On 18-Sep-2018, at 12:31 AM, Stefan Seelmann 
 wrote:
 
 On 9/17/18 8:19 PM, Abhishek Sinha wrote:
 
 Any update on this?
 
 Please find the scheduler error log attached.
 
 Can you share the full python stack trace?
 
 
 Seems the mailing list doesn't allow attachments. Either post the
 stacktrace inline, or post it somewhere at pastebin or so.
 
>>> 
>>> 


Re: Manual validation operator

2018-10-05 Thread Brian Greene
My first thought was this, but my understanding is   That if you had a large 
number of dags “waiting” the sensor would consume all the concurrency.

And what if the user doesn’t approve?

How about the dag you have as it’s last step writes to an api/db the status.

Then 2 other dags (or one with a branch) can each have a sensor that’s watching 
for approved/unapproved values.  When it finds one (or a batch depending on how 
you write it), trigger the “next” dag.  

This leaves only 1-2 sensors running and would enable your process without 
anyone using the airflow UI (assuming they have some other way to mark 
“approval”).  This avoids the “process by error and recover” logic it seems 
like you’d like to get out of.  (Which makes sense to me)

B

Sent from a device with less than stellar autocorrect

> On Oct 4, 2018, at 10:17 AM, Alek Storm  wrote:
> 
> Hi Björn,
> 
> We also sometimes require manual validation, and though we haven't yet
> implemented this, I imagine you could store the approved/unapproved status
> of the job in a database, expose it via an API, and write an Airflow sensor
> that continuously polls that API until the status becomes "approved", at
> which point the DAG execution will continue.
> 
> Best,
> Alek Storm
> 
> On Thu, Oct 4, 2018 at 10:05 AM Björn Pollex
>  wrote:
> 
>> Hi all,
>> 
>> In some of our workflows we require a manual validation step, where some
>> generated data has to be reviewed by an authorised person before the
>> workflow can continue. We currently model this by using a custom dummy
>> operator that always fails. After the review, we manually mark it as
>> success and clear the downstream tasks. This works, but it would be nice to
>> have better representation of this in the UI. The customisation points for
>> plugins don’t seem to offer any way of customising UI for specific
>> operators.
>> 
>> Does anyone else have similar use cases? How are you handling this?
>> 
>> Cheers,
>> 
>>Björn Pollex
>> 
>> 


Re: Solved: suppress PendingDeprecationWarning messages in airflow logs

2018-09-29 Thread Brian Greene
The operator is a thin wrapper around their SDK.  The qubole SDK makes heavy 
use of **kwargs, and the operator just passes them through.

Short of writing our own operator with named Params to keep the base operator 
happy and then delegate to the qubole sdk, there’s no other way to silence the 
warnings to my knowledge.


B

Sent from a device with less than stellar autocorrect

> On Sep 28, 2018, at 7:56 PM, Chris Palmer  wrote:
> 
> Doesn't this warning come from the BaseOperator class -
> https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L2511
> ?
> 
> Are you passing unused arguments to the QuboleOperator, or do you not
> control the instantiation of those operators?
> 
> Chris
> 
> On Fri, Sep 28, 2018 at 7:39 PM Sean Carey 
> wrote:
> 
>> Haha yes, good point.  Thanks!
>> 
>> Sean
>> 
>> 
>> On 9/28/18, 6:13 PM, "Ash Berlin-Taylor"  wrote:
>> 
>>Sounds good for your use, certainly.
>> 
>>I mainly wanted to make sure other people knew before blindly
>> equipping a foot-canon :)
>> 
>>-ash
>> 
>>> On 29 Sep 2018, at 00:09, Sean Carey 
>> wrote:
>>> 
>>> Thanks, Ash.  I understand what you're saying.  The warnings are
>> coming from the Qubole operator.  We get a lot of this:
>>> 
>>> PendingDeprecationWarning: Invalid arguments were passed to
>> QuboleOperator. Support for passing such arguments will be dropped in
>> Airflow 2.0. Invalid arguments were:
>>> *args: ()
>>> **kwargs: {...}
>>> category=PendingDeprecationWarning
>>> 
>>> We've spoken to Qubole about this and they plan to address it.  In
>> the meantime, it generates a ton of noise in our logs which we'd like to
>> suppress -- we are aware of the issue, and don't need to be told about it
>> every 5 seconds.
>>> 
>>> As for it suddenly breaking, being that this is pending Airflow 2.0
>> I feel the risk is low and when we do upgrade it will be thoroughly tested.
>>> 
>>> Thanks!
>>> 
>>> Sean
>>> 
>>> 
>>> On 9/28/18, 5:01 PM, "Ash Berlin-Taylor"  wrote:
>>> 
>>>   What deprecation warnings are you getting? Are they from Airflow
>> itself (i.e. things Airflow is calling like flask_wtf, etc) or of your use
>> of Airflow?
>>> 
>>>   If it is the form could you check and see if someone has already
>> reported a Jira issue so we can fix them?
>> https://issues.apache.org/jira/issues/?jql=project%3DAIRFLOW
>>> 
>>>   If it is the latter PLEASE DO NOT IGNORE THESE.
>>> 
>>>   Deprecation warnings are how we, the Airflow community tell users
>> that you need to make a change to your DAG/code/config to upgrade things.
>> If you silence these warnings you will have a much harder time upgrading to
>> new versions of Airflow (read: you might suddenly find that things stop
>> working because you turned of the warnings.)
>>> 
>>>   -ash
>>> 
 On 28 Sep 2018, at 22:52, Sean Carey 
>> wrote:
 
 Hello,
 
 I’ve been looking for a way to suppress the
>> PendingDeprecationWarning messages cluttering our airflow logs and I have a
>> working solution which I thought I would share.
 
 In order to do this, you first need to configure airflow for custom
>> logging using steps 1-4 here:
 
 
>> https://airflow.readthedocs.io/en/stable/howto/write-logs.html#writing-logs-to-azure-blob-storage
 
 (note that although the document is for Azure remote logging you
>> don’t actually need azure for this)
 
 Next, modify the log_config.py script created in the step above as
>> follows:
 
 
 1.  Import logging
 2.  Define the filter class:
 
 
 
 class DeprecationWarningFilter(logging.Filter):
 
  def filter(self, record):
 
  allow = 'DeprecationWarning' not in record.msg
 
  return allow
 
 
 1.  Add a “filters” section to the LOGGING_CONFIG beneath
>> “formatters:
 
 
 
 'filters': {
 
  'noDepWarn': {
 
  '()': DeprecationWarningFilter,
 
  }
 
  },
 
 
 1.  For each of the handlers where you want to suppress the
>> warnings (console, task, processor, or any of the remote log handlers you
>> may be using) add the following line to its configuration:
 
 
 
 'filters': ['noDepWarn'],
 
 Restart airflow and your logs should be clean.
 
 
 Sean Carey
 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 


Re: execution_date - can we stop the confusion?

2018-09-27 Thread Brian Greene
8 at 7:37 PM George Leslie-Waksman 
> wrote:
> 
>> This comes up a lot. I've seen it on this mailing list multiple times and
>> it's something that I have to explicitly call out to every single person
>> that I've helped train up on Airflow.
>> 
>> If we take a moment to set aside why things are the way they are, what the
>> documentation says, and how experienced users feel things should behave;
>> there still remains the fact that a lot of new users get confused by how
>> "execution_date" works.
>> 
>> Whether it's a problem, whether we need to do something, and what we could
>> do are all separate questions but I think it's important that we
>> acknowledge and start from:
>> 
>> A lot of new users get confused by how "execution_date" works.
>> 
>> I recognize that some of this is a learning curve issue and some of this is
>> a mindset issue but it begs the question: do enough users benefit from the
>> current structure to justify the harm to new users?
>> 
>> --George
>> 
>> On Wed, Sep 26, 2018 at 1:40 PM Brian Greene <
>> br...@heisenbergwoodworking.com> wrote:
>> 
>>> It took a minute to grok, but in the larger context of how af works it
>>> makes perfect sense the way it is.  Changing something so fundamentally
>>> breaking to every dag in existence should bring a comparable benefit.
>>> Beyond the avoiding teaching a concept you disagree with, what benefits
>>> does the proposal bring to offset the cost of change?
>>> 
>>> I’m gonna make a meme - “do you even airflow bro?”
>>> 
>>> Sent from a device with less than stellar autocorrect
>>> 
>>>> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>>> 
>>>> I think if you have a functional mindset (as in "functional data
>>> engineering
>>>> <
>>> 
>> https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
>>>> ")
>>>> as opposed to a cron mindset, using the left bound of the time interval
>>>> makes a lot of sense. Things like your daily table partition keys align
>>>> with your Airflow execution_date.
>>>> 
>>>> The main thing is that whatever we do we cannot break backwards
>>>> compatibility. Offering both views (left bound/right bound), as it's
>> been
>>>> proposed before, either as an environment setting or a user personal
>>>> preference is even more confusing to me personally. Users would have to
>>>> switch context as they help each other or change environments.
>>>> 
>>>> Also note that your intuition may differ from other people's intuition,
>>> and
>>>> that "unlearning" something is way harder than learning something.
>>>> 
>>>> My personal take on this is to make this a rite of passage. This is
>> just
>>>> one of the many thing you have to learn when learning Airflow.
>>>> 
>>>> Max
>>>> 
>>>>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin 
>>> wrote:
>>>>> 
>>>>> Hi Bolke
>>>>> 
>>>>> Speaking as a consultant who is constantly training other teams how to
>>> use
>>>>> airflow, I do frequently see this confusion.
>>>>> Another one is how the batch_date is always batch_date + interval or
>> as
>>> the
>>>>> docs make it quite clear
>>>>> 
>>>>> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
>>>>> AFTER
>>>>> the start date, at the END of the period."
>>>>> 
>>>>> Renaming it would make it simpler for newbies, but essentially they
>> will
>>>>> need to understand how Airflow behaves, execution_date being the batch
>>>>> execution date not the run_date of the DAG
>>>>> 
>>>>> I am actually in the process of writing a blog post
>>>>> <
>> https://samelamin.github.io/2017/04/27/Building-A-Datapipeline-part1/>
>>>>> about this which I could use peoples feedback
>>>>> 
>>>>> If it helps, I find that explaining how backfills work and why they
>> are
>>>>> important will drive home what the execution_date is :)
>>>>> 
>>>>> 
>>>>> 

Re: execution_date - can we stop the confusion?

2018-09-26 Thread Brian Greene
It took a minute to grok, but in the larger context of how af works it makes 
perfect sense the way it is.  Changing something so fundamentally breaking to 
every dag in existence should bring a comparable benefit.  Beyond the avoiding 
teaching a concept you disagree with, what benefits does the proposal bring to 
offset the cost of change?

I’m gonna make a meme - “do you even airflow bro?”

Sent from a device with less than stellar autocorrect

> On Sep 26, 2018, at 8:33 AM, Maxime Beauchemin  
> wrote:
> 
> I think if you have a functional mindset (as in "functional data engineering
> ")
> as opposed to a cron mindset, using the left bound of the time interval
> makes a lot of sense. Things like your daily table partition keys align
> with your Airflow execution_date.
> 
> The main thing is that whatever we do we cannot break backwards
> compatibility. Offering both views (left bound/right bound), as it's been
> proposed before, either as an environment setting or a user personal
> preference is even more confusing to me personally. Users would have to
> switch context as they help each other or change environments.
> 
> Also note that your intuition may differ from other people's intuition, and
> that "unlearning" something is way harder than learning something.
> 
> My personal take on this is to make this a rite of passage. This is just
> one of the many thing you have to learn when learning Airflow.
> 
> Max
> 
>> On Wed, Sep 26, 2018 at 8:18 AM Sam Elamin  wrote:
>> 
>> Hi Bolke
>> 
>> Speaking as a consultant who is constantly training other teams how to use
>> airflow, I do frequently see this confusion.
>> Another one is how the batch_date is always batch_date + interval or as the
>> docs make it quite clear
>> 
>> "*Let’s Repeat That* The scheduler runs your job one schedule_interval
>> AFTER
>> the start date, at the END of the period."
>> 
>> Renaming it would make it simpler for newbies, but essentially they will
>> need to understand how Airflow behaves, execution_date being the batch
>> execution date not the run_date of the DAG
>> 
>> I am actually in the process of writing a blog post
>> 
>> about this which I could use peoples feedback
>> 
>> If it helps, I find that explaining how backfills work and why they are
>> important will drive home what the execution_date is :)
>> 
>> 
>> Regards
>> Sam
>> 
>> 
>> 
>>> On Wed, Sep 26, 2018 at 4:10 PM Bolke de Bruin  wrote:
>>> 
>>> I dont think this makes sense and I dont that think anyone had a real
>>> issue with this. Execution date has been clearly documented  and is part
>> of
>>> the core principles of airflow. Renaming will create more confusion.
>>> 
>>> Please note that I do think that as an anonymous user you cannot speak
>> for
>>> any "new airflow user". That is a contradiction to me.
>>> 
>>> Thanks
>>> Bolke
>>> 
>>> Sent from my iPhone
>>> 
 On 26 Sep 2018, at 07:59, airflowuser > .INVALID>
>>> wrote:
 
 One of the most annoying, hard to understand and against all common
>>> sense is the execution_date behavior. I assume that any new Airflow user
>>> has been struggling with it.
 The amount of questions with answers referring to :
>>> https://airflow.apache.org/scheduler.html?scheduling-triggers  is
>>> uncountable.
 
 Most people mistakenly think that execution_date is the datetime which
>>> the DAG started to run.
 
 I suggest the following changes:
 1. Renaming the execution_date to something else like: run_stamped
>>> This name won't cause people to get confused.
 2. Adding a new variable which indicated the actual datetime when the
>>> DAG run was generated. call it execution_start_date. People seem to want
>>> the information when the DAG actually started to be executed/run.
 
 This is only naming changes. No need to actual change the behavior -
>>> This will only make things simpler as when user encounter  run_stamped
>> he
>>> won't be confused by the name like execution_date
>>> 
>> 


Re: Fundamental change - Separate DAG name and id.

2018-09-20 Thread Brian Greene
Prior to using airflow for much, on first inspection, I think I may have agreed 
with you.

After a bit of use I’d agree with Fokko and others - this isn’t really a 
problem, and separating them seems to do more harm than good related to 
deployment.  

I was gonna stop there, but why?

You can add a task to a dag that’s deployed and has run and still view history. 
 The “new” task shows up white Squares in the old dags.  nobody said you’re 
required to also rename the dag when you do so this.  If your process or desire 
or design determines you need to rename it, well then by definition... isn’t it 
a new thing without a history?  Airflow is implementing exactly that.

One could argue that renaming to reflect exact purpose is good practice.  Yes, 
I’d agree, but again following that logic if it’s a small enough change to 
“slip in” then the name likely shouldn’t change.  If it’s big enough I want to 
change the name then it’s a big enough change that I’m functionally running 
something “new”, and I expect to need to account for that.  Airflow is 
enforcing that logic by coupling the name to the deployment of what you said 
was a new process.

One might put forth that changing the name to be more descriptive In the ui 
makes it easier for support staff.  I think perhaps if that’s your challenge 
it’s not airflow that’s a problem.  Dags are of course documented elsewhere 
besides their name, right?  Yeah it’s self documenting (and the graphs are 
cool), but I have to assume there’s something besides the NAME to tell people 
what it does.  Additionally, far more than the name is required for even an 
operator or monitor watcher to take action - you don’t expect them to know 
which tasks to rerun or how to troubleshoot failures just based on your “now 
most descriptive name in the UI” do you?

I spent time In an informatica shop where all the jobs were numbered.  
Numbered.  Let’s be more exact... their NAMES were NUMBERS like 56709. 
Terrible, but 100% worked, because while a descriptive name would have been 
useful, the name is the thing that’s supposed to NOT CHANGE (see code of 
Abibarshim), and all the other information can attach to that in places where 
you write... other information.  People would curse a number “F’ing 6291 failed 
again” - everyone knew what they were talking about.. I digress.

 You might decide to document “dag ID 12” or just “12” on your wiki - I’m  
going to document “daily_sales_import”.  And when things start failing at 3am 
it’s not my dag “56” that’s failing, it’s the sales_export dag.  But if you 
document “12”, that’s still it’s name, and it’d better be 12 in all your 
environments and documents.  This also means the actual db IDs from your 
proposal are almost certainly NOT the same across your environments, making the 
12 unchangeable name!

There are lots of languages (most of them) where the name of a thing is 
important and hard to change.  It’s not a bad thing, and I’d assume that 
deploying a thing by name has some significance in many systems.  Go rename a 
class in... pick a language... tell me how that should be easier to do 
willy-nilly so it’s easier In the UI.  

I suppose you could view it as a limitation, But i don’t think you’ve 
illuminated a single use case where it’s an actual technical constraint or 
limitation.

The BEST argument against the current implementation is db performance.  It’s a 
hogwash argument.  Basic key indexes on low cardinality string columns are 
plenty fast for the airflow workload, and if your task load is so high airflow 
can’t keep up or your seeing super-fast tasks and airflow db/tracking latency 
is too much... perhaps a messaging or queue processing solution is better 
suited to those workloads.  We see scheduler bottlenecks long before the 
database for our “quick task” scenarios.  Additionally, reading through this 
list you’ll find people running airflow at substantial scale - I’ve not seen 
anyone complaining of production performance issues based on this design 
decision.   At first I hated it.  String keys are dirty, we’re all taught that 
as good little programmers.  Except when performance won’t be a huge 
consideration since it’s not OLTP and easy of queryabilty is more important 
because it’s a growing system... good decision - whoever made it.

How does filename matter?  Frankly I wish the filename was REQUIRED to be the 
dag name so people would quit confusing themselves by mismatching them !   
We’ve renamed dag files with no issue as long as the content doesn’t change, so 
again, not a real use case.  And really - name your stuff careful before you 
get to prod man.

I gotta ask - airflowuser - are you gonna use airflow for anything, or just 
poke it with a stick from a distance and ask semi-inane questions of these fine 
folks that wrote and spend time working on this cool piece of kit?

B

Sent from a device with less than stellar autocorrect

> On Sep 20, 2018, at 3:12 PM, Driesprong, Fokko  wrote:
> 
> I like

Re: S3keysonsor

2018-05-21 Thread Brian Greene
I suggest it’ll work for your needs.

Sent from a device with less than stellar autocorrect

> On May 21, 2018, at 10:16 AM, purna pradeep  wrote:
> 
> Hi ,
> 
> I’m trying to evaluate airflow to see if it suits my needs.
> 
> Basically i can have below steps in a DAG
> 
> 
> 
> 1)Look for a file arrival on given s3 location (this path is not completely
> static) (i can use S3Keysensor in this step)
> 
>  i should be able to specify to look either for latest folder or 24hrs or
> n number of days older folder which has _SUCCESS file as mentioned below
> 
>  sample file location(s):
> 
>  s3a://mybucket/20180425_111447_data1/_SUCCESS
> 
>  s3a://mybucket/20180424_111241_data1/_SUCCESS
> 
>  s3a://mybucket/20180424_111035_data1/_SUCCESS
> 
> 
> 
> 2)invoke a simple restapi using HttpSimpleOperator once the above
> dependency is met ,i can set upstream for step2 as step1
> 
> 
> 
> Does S3keysensor supports step1 out of the box?
> 
> Also in some cases i may to have a DAG without start date & end date it
> just needs to be triggered once file is available in a given s3 location
> 
> 
> 
> *Please suggest !*


Re: How to know the DAG is starting to run

2018-05-11 Thread Brian Greene
Okay I’ll bite...  WT* does that mean?

one of the best things about airflow is how easy it is to connect disparate 
systems... some would even say that’s much of the reason it exists..

It records reasonable metadata into an rdms (I suppose you could argue for 
other designs, but it’s pretty straightforward and for this workload easy 
enough to scale).  

So... what did you mean?

Sent from a device with less than stellar autocorrect

> On May 11, 2018, at 7:19 AM, Luke Diment  wrote:
> 
> We should really instrument an interface for airflow so integration is more 
> succinct between disparate systems!!!
> 
> Sent from my iPhone
> 
>> On 12/05/2018, at 12:09 AM, James Meickle  wrote:
>> 
>> Song:
>> 
>> You can put an operator as the very first node in the DAG, and have
>> everything else in the DAG depend on it. For example, this is the approach
>> we use to only execute DAG tasks on stock market trading days.
>> 
>> -James M.
>> 
>>> On Fri, May 11, 2018 at 3:57 AM, Song Liu  wrote:
>>> 
>>> Hi,
>>> 
>>> I have something just want to be done only once when DAG is constructed,
>>> but it seems that DAG will be instanced every time when run each of
>>> operator.
>>> 
>>> So is that there function in DAG that tell us it is starting to run now ?
>>> 
>>> Thanks,
>>> Song
>>> 
> 
> 
> 
> The contents of this email and any attachments are confidential and may be 
> legally privileged. If you are not the intended recipient please advise the 
> sender immediately and delete the email and attachments. Any use, 
> dissemination, reproduction or distribution of this email and any attachments 
> by anyone other than the intended recipient is prohibited.


Re: About the project support in Airflow

2018-04-25 Thread Brian Greene
+1

Sent from a device with less than stellar autocorrect

> On Apr 25, 2018, at 12:04 PM, James Meickle  wrote:
> 
> Another reason you would want separated infrastructure is that there are a
> lot of ways to exhaust Airflow resources or otherwise cause contention -
> like having too many sensors or sub-DAGs using up all available tasks.
> 
> Doesn't seem like a great idea to push for having different teams with
> co-tenancy until there is also per-team control over resource use...
> 
> On Tue, Apr 24, 2018 at 8:27 PM, 刘松(Cycle++开发组) 
> wrote:
> 
>> It seems that all the current approach is pointing to multiple instance of
>> airflow, but project concept is very nature since one user might to handle
>> different type of tasks.
>> 
>> Another thing about the multiple user support, one way is also to deploy
>> multiple instance, but it seems that airflow is providing multiple user
>> function builtin.
>> 
>> So I can not be convinced that using multiple instance for multiple
>> project purpose.
>> 
>> Thanks,
>> Song
>> 
>> 
>> 
>> 
>> On Wed, Apr 25, 2018 at 4:25 AM +0800, "Ace Haidrey" > > wrote:
>> 
>> 
>> Looks neat Taylor!
>> 
>> And regarding the original question, going off of what Maxime and Bolke
>> said, at Pandora, it made more sense for us to have an instance per team
>> since each team has its own system user for prod and the instance can run
>> all processes as that user. Alternatively you could have a super user that
>> can sudo as those other system users, and have many teams on a single
>> instance but that is a security concern (what if one team sudo's as the
>> other team and accidentally overwrites data - there is nothing stopping
>> them from doing it). It depends what your org set up is, but let me know if
>> there are any questions I can help with.
>> 
>> Ace
>> 
>> 
>>> On Apr 24, 2018, at 1:16 PM, Taylor Edmiston  wrote:
>>> 
>>> We use a similar approach like Bolke mentioned with running multiple
>>> Airflow instances.
>>> 
>>> I haven't read the Pandora article yet, but we have an Astronomer Open
>>> Edition (fully open source) that bundles similar tools like Prometheus,
>>> Grafana, Celery, etc with Airflow and a Docker Compose file if you're
>>> looking to get a setup like that up and running quickly.
>>> 
>>> https://github.com/astronomerio/astronomer/blob/master/examples/airflow-
>> enterprise/docker-compose.yml
>>> https://github.com/astronomerio/astronomer
>>> 
>>> *Taylor Edmiston*
>>> Blog  | Stack Overflow CV
>>> | LinkedIn
>>> | AngelList
>>> 
>>> 
>>> 
>>> On Tue, Apr 24, 2018 at 3:30 PM, Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>> 
 Related blog post about multi-tenant Airflow deployment out of Pandora:
 https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee
 
 On Tue, Apr 24, 2018 at 10:20 AM, Bolke de Bruin
 wrote:
 
> My suggestion would be to deploy airflow per project. You could even
>> use
> airflow to manage your ci/cd pipeline.
> 
> B.
> 
> Sent from my iPhone
> 
>> On 24 Apr 2018, at 18:33, Maxime Beauchemin <
 maximebeauche...@gmail.com>
> wrote:
>> 
>> People have been talking about namespacing DAGs in the past. I'd
> recommend
>> using tags (many to many) instead of categories/projects (one to
>> many).
>> 
>> It should be fairly easy to add this feature. One question is whether
> tags
>> are defined as code or in the UI/db only.
>> 
>> Max
>> 
>>> On Tue, Apr 24, 2018 at 1:48 AM, Song Liu
 wrote:
>>> 
>>> Hi,
>>> 
>>> Basically the DAGs are created for a project purpose, so if I have
 many
>>> different projects, will the Airflow support the Project concept and
>>> organize them separately ?
>>> 
>>> Is this a known requirement or any plan for this already ?
>>> 
>>> Thanks,
>>> Song
>>> 
> 
 
>> 
>> 
>> 


Re: Slot pools correct usage

2018-04-07 Thread Brian Greene
So what’s it doing (your config)?  Does it work if you don’t use pools?  What 
about if the pool is if size 2?   What if just one dag runs?  Have you ever 
seen this query work, or is it just since you started messing with pools that 
it stopped working?

I use 1 pool, no priority (I don’t care about sequence), and it “throttles” 
fine...

Which executor are you using?  I’m not familiar enough with the intricacies to 
know if the pool settings are honored with different executors, but I’m using 
CeleryExecutor with success.

B

Sent from a device with less than stellar autocorrect

> On Apr 6, 2018, at 10:40 PM, Manish Trivedi  wrote:
> 
> Hi Brian,
> 
> Really appreciate your quick reply. Just to be clear, I did not intend to
> run them in particular order. as a matter of fact, these are expensive db
> queries that I cant afford to run in parallel.
> I think I have setup the tasks correctly to use pool but may be missing the
> priority_weight setting correctly. Appreciate if you could run by your
> configs just to see if I am not missing any simple point.
> 
> thanks much,
> Manish
> 
> On Fri, Apr 6, 2018 at 6:18 PM, Brian Greene <
> br...@heisenbergwoodworking.com> wrote:
> 
>> To be clear, you’re hoping that setting the slots to 1 will cause the
>> tasks across district dags to run in order based on the assumption that
>> they’ll queue up and then execute off the pool?
>> 
>> I don’t think it will quite work that way - there’s no guarantee the
>> scheduler will execute your tasks across dags in any particular sequence,
>> and if 1 is “faster” than the other for sure they don’t “line up”.  Thus,
>> no way to ensure they’ll queue in the right order.
>> 
>> I successfully use pools across many dags to limit access to an expensive
>> resource and it works really well, but my design doesn’t require they
>> execute in any particular order, each idempotent.
>> 
>> I’m curious as to your design/constraints - could you elaborate?
>> 
>> Brian
>> 
>> Sent from a device with less than stellar autocorrect
>> 
>>> On Apr 6, 2018, at 3:46 PM, Manish Trivedi  wrote:
>>> 
>>> Hi Airflow devs,
>>> 
>>> I have a use case to limit the # of calls to a certain database. I am
>> using
>>> the pool along with priority weight to schedule the tasks to the slot
>> pool.
>>> I have around 5 operators that I need to execute in serial order across
>>> different dags.
>>> 
>>> Slot pool is created with "1" slot to ensure sequential exection. I am
>> not
>>> able to achieve the desired function with current setup.
>> 


Re: Slot pools correct usage

2018-04-06 Thread Brian Greene
To be clear, you’re hoping that setting the slots to 1 will cause the tasks 
across district dags to run in order based on the assumption that they’ll queue 
up and then execute off the pool?

I don’t think it will quite work that way - there’s no guarantee the scheduler 
will execute your tasks across dags in any particular sequence, and if 1 is 
“faster” than the other for sure they don’t “line up”.  Thus, no way to ensure 
they’ll queue in the right order.

I successfully use pools across many dags to limit access to an expensive 
resource and it works really well, but my design doesn’t require they execute 
in any particular order, each idempotent.

I’m curious as to your design/constraints - could you elaborate?

Brian

Sent from a device with less than stellar autocorrect

> On Apr 6, 2018, at 3:46 PM, Manish Trivedi  wrote:
> 
> Hi Airflow devs,
> 
> I have a use case to limit the # of calls to a certain database. I am using
> the pool along with priority weight to schedule the tasks to the slot pool.
> I have around 5 operators that I need to execute in serial order across
> different dags.
> 
> Slot pool is created with "1" slot to ensure sequential exection. I am not
> able to achieve the desired function with current setup.


Re: RBAC Update

2018-03-30 Thread Brian Greene
I’d think we’d have privilege ‘can_view’ etc, and then a join table (priv) <-> 
(dagid) <-> (user/group).  Then it’s a simple query to get the perms for a 
given dag (as you list In option 2 below).

It also makes a “secure by default” easy - a lack of entries in that table for 
a dag can mean only “admin” access or some such.

Then any dag can have any combo of permissions for any combo of users.  Adding 
the groups option raises complexity around nesting, so maybe skip it for r1?

My $.02 

Brian

Sent from a device with less than stellar autocorrect

> On Mar 29, 2018, at 10:27 AM, Maxime Beauchemin  
> wrote:
> 
> Hijacking the thread further here, any thoughts on how to breakdown per DAG
> access?
> 
> Tao & I are talking about introducing per-DAG permissions and one big
> question is whether we'll need to support different operation-types at a
> per-DAG level, which changes the way we need to model the perms.
> 
> First [simpler] option is to introduce one perm per DAG. If you have access
> to 5 DAGs, and you have `can_clear` and `can_run`, you'll have homogenous
> rights on the DAGs you have access to.
> 
> Second option is to have a breakdown per DAG. Meaning for each DAG we
> create a set of perms ({dag_id}_can_view, {dag_id}_can_modify, ...). So one
> user could have modify on some DAGs, view on others, and other DAGs would
> be invisible. This could be broken down further ({dag_id}_can_clear, ...)
> but it gets hard to manage.
> 
> Thoughts?
> 
> Max
> 
>> On Wed, Mar 28, 2018 at 10:02 PM, Tao Feng  wrote:
>> 
>> Great work Joy. This is awesome! I am interested in helping out the per dag
>> level access.  Just created a ticket to check(AIRFLOW-2267). Let me know if
>> you have any suggestions. I will share my proposal once I am ready.
>> 
>>> On Fri, Mar 23, 2018 at 6:45 PM, Joy Gao  wrote:
>>> 
>>> Hey guys!
>>> 
>>> The RBAC UI  has
>>> been merged to master. I'm looking forward to early adopters' feedback
>> and
>>> bug reports. I also hope to have more folks helping out with the RBAC UI,
>>> especially with introducing DAG-Level access control, which is a feature
>>> that a lot of people have been asking. If you are interested in helping
>> out
>>> with this effort, let's talk more!
>>> 
>>> This commit will be in the 1.10.0 release, and we are going to maintain
>>> both UIs simultaneously for a short period of time. Once RBAC UI is
>> stable
>>> and battle-tested, we will deprecate the old UI and eventually remove it
>>> from the repo (around Airflow 2.0.0 or 2.1.0 release). This is to prevent
>>> two UIs from forking into separate paths, as that would become very
>>> difficult to maintain.
>>> 
>>> Going forward while both UIs are up, if you are making a change to any
>>> files in airflow/www/ (old UI), where applicable, please also make the
>>> change to the airflow/www_rbac/ (new UI). If you rather not make changes
>> in
>>> both UIs, it is recommended that you only make the changes to the RBAC
>> UI,
>>> since that is the one we are maintaining in the long term.
>>> 
>>> I'm excited that the RBAC UI will be able to bring additional security to
>>> Airflow, and with FAB framework in place we can look into leveraging it
>> for
>>> a unified set of APIs used by both UI and CLI.
>>> 
>>> Joy
>>> 
>>> 
>>> 
 On Thu, Feb 8, 2018 at 11:31 AM, Joy Gao  wrote:
 
 Hi folks,
 
 I have a PR 
>> out
 for the new UI. I've included instructions on how to test it out in the
>>> PR
 description. Looking forward to your feedbacks.
 
 Cheers,
 Joy
 
> On Fri, Dec 1, 2017 at 6:18 PM, Joy Gao  wrote:
> 
> Thanks for the background info. Would be really awesome for you to
>> have
> PyPi access :D I'll make the change to have Airflow Webserver's FAB
> dependency pointing to my fork for the mean time.
> 
> For folks who are interested in RBAC, I will be giving a talk/demo at
>>> the Airflow
> Meet-Up
> >> Meetup/events/244525050/>
> next Monday. Happy to chat afterwards about it as well :)
> 
> On Thu, Nov 30, 2017 at 8:36 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> 
>> A bit of related history here:
>> https://github.com/dpgaspar/Flask-AppBuilder/issues/399
>> 
>> On Thu, Nov 30, 2017 at 8:33 AM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>> 
>>> Given I have merge rights on FAB I could probably do another round
>> of
>>> review and get your PRs through. I would really like to get the
>> main
>>> maintainer's input on things that touch the core (composite-key
>> support) as
>>> he might have concerns/intuitions that we can't know about.
>>> 
>>> I do not have Pypi access though so I cannot push new releases
>> out. I
>>> could ask for that.
>>> 
>

Re: How to have dynamic downstream tasks that depend on data generated upstream

2018-03-15 Thread Brian Greene
My $.02 - posted to SO as well.

I fought this use case for a long time.  In short, a dag that’s built based on 
the state of a changing resource, especially a db table, doesn’t fly so well in 
airflow.

My solution was to write a small custom operator that’s a subclass if 
truggerdagoperator, it does the query and then triggers dagruns for each of the 
subprocess.

It makes the process “join” downstream more interesting, but in my use case I 
was able to work around it with another dag that polls and short circuits if 
all the sub processes for a given day have completed.  In other cases partition 
sensors can do the trick.

I have several use cases like this (iterative dag trigger based on a dynamic 
source), and after a lot of fighting with making dynamic Subdags work (a lot), 
I switched to this “trigger subprocess” strategy and have been doing well since.

Note - this may make a large number of dagruns for one targ (the target).  This 
makes the UI challenging in some places, but it’s workable (and I’ve started 
querying the db directly because I’m not ready to write a plugin that does UI 
stuffs)

Sent from a device with less than stellar autocorrect

> On Mar 15, 2018, at 7:22 AM, James Meickle  wrote:
> 
> To my mind, it's not a great idea to clear a resource that you're
> dynamically using to determine the contents of a DAG. Is there any way that
> you can refactor the table to be immutable? Instead of querying all rows in
> the table, you would query records in an "unprocessed" state. Instead of
> truncating the table, you would mark everything in the table as
> "processed". (Though optional, it would be even better for each row to
> store the date it was processed, so that you can re-run this DAG in the
> future.)
> 
> If storing that much data or refactoring the table isn't possible, could
> you run this query once for the day, store the results in S3 (or Redis, or
> ...), and always fetch those results? That way the DAG always has the "most
> recent" view, even if you delete records mid-day.
> 
> On Wed, Mar 14, 2018 at 10:20 PM, Aaron Polhamus 
> wrote:
> 
>> Question for the community. Did some hunting around and didn't see any
>> compelling answers. SO link:
>> https://stackoverflow.com/questions/49290546/how-to-set-
>> up-a-dag-when-downstream-task-definitions-depend-on-upstream-outcomes
>> 
>> 
>> --
>> 
>> 
>> *Aaron Polhamus*
>> *Chief Technology Officer *
>> 
>> Cel (México): +52 (55) 1951-5612
>> Cell (USA): +1 (206) 380-3948
>> Tel: +52 (55) 1168 9757 - Ext. 181
>> 
>> --
>> ***Por favor referirse a nuestra página web
>>  para más información
>> acerca de nuestras políticas de privacidad.*
>> 
>> 


how to pass parameters to subdag operator from triggered dag

2018-01-25 Thread Brian Greene
Good evening,

Here's the setup -

I'm following a "trigger dag -> called(processing) dag" kind of structure.
Then I set timespans, catchups, etc on the triggering dag.  This allows me
to separate the scheduling from the execution, and so far is working great.

The triggered dag uses parameters passed to it to execute.  This has been
easy so far as I've been using python operators and the context from the
operator is passed (including getting at the parameters passed from the
trigger in **kwargs).

Now I'd like to use a subdag operator in the called dag.  I've passed
parameters to a subdag, but only when the parent wasn't "triggered".  Now
I'm in the called DAG, and I'd like access to the parameters from the
trigger to pass to the subdag... and I'm stuck.

I've looked at the subdag operator source code, the trigger operator,
python operator... all trying to find how to get at the context. (perhaps
the wrong word in this case, but it's the idea of the context from the
trigger that I'm after... )

I'm sure I'm missing something simple, and would appreciate any pointers.

Thanks,
Brian