ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916396684


   Forgive me, it's late and been a long day, so I may not be as lucid as I'd 
hope.
   
   My main concern here is about being able to reason about what a DAG will do. 
By adding the ability to add arbitrary pre_execute code before any operator in 
a DAG we end up in a world where it is very hard to look at a DAG and 
understand what it's going to do. 
   
   > So in this case, we can't start processing the data before we know it's 
come in (let's assume that this is entirely based on time of day, you can't 
"sense" it).
   
   I dispute the 'you can't "sense" it'. Strongly.  And the processing based on 
timing along is the worst possible idea -- removing arbitray time delays 
between tasks was one of the main reasons that Airflow has dependencies between 
tasks.
   
   The evolution of data processing worfklow often goes:
   
    - Oh, I've only got one thing to run, I can put it on cron
    - Oh and a second one, but its unrelated to the first, I can cron that to
    - Now I want to combine those two outputs, it's okay I'll just delay it by 
an hour.
    
   That approach will work for months. Right up until you hit an inflection 
point (more users, more processing) and then suddenly your entire pipeline is 
in an inconsistent state (maybe you combined data from two different days. 
Maybe you might not notice it for a month. This is not hyperbole, but lived 
experience.)
   
   As for "you can't sense it": Either it's a file on disk/s3/blob store, or a 
table in a DB, but if you are about to have an operator process it (i.e. read 
it or copy it), then you can, by definition, sense if it's there or not. 
   
   To the "skip expensive operation if dev": I've not seen anyone ask for that 
-- read/write  to different bucket in different envs plenty of time, but never 
skip an operation entirely based on env (cos if you've skipped one step, you 
have to skip the entire "branch" too.) I had a quick search on 
https://apache-airflow.slack-archives.org and couldn't find it -- you might 
have a better idea what to search for (it uses postgress full text search so 
the stemming might be a bit simplistic)
   
   No, there's nothing I have planned in the AIPs I hinted at in my keynote 
(most of them are just ideas anyway at this stage)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to