Re: Speeding up the scheduler - request for comments

2016-06-03 Thread Maxime Beauchemin
Caching is a last resort solution and probably not a good thing here. It would introduce lag and confusion. You seem to say that some things evaluated twice within a scheduler cycle? What would that be? Another option is to reduce the number of database interaction and make sure indexes are in pl

Re: Speeding up the scheduler - request for comments

2016-06-03 Thread Dan Davydov
Scheduler loop times are definitely a concern (at least for Airbnb), and +1 for option 2 as well if it can be implemented correctly. What is important for me is that we should always be able to easily tell which of the dependencies are met and which aren't in the event based model. On Fri, Jun 3,

Re: Proposed Changes to 'airflow initdb' & a config file.

2016-06-03 Thread Chris Riccomini
Hey Paul, While I recognize the use case for this, I view it as more of a deployment-related thing. It adds some complexity to Airflow, and I think it's better suited to be run as part of a deployment system. A Python script can be written that uses Airflow's models and SQLAlchemy to initiate and

Re: Airflow Contributors Meeting (June 01, 2016) : Minutes

2016-06-03 Thread Chris Riccomini
Hey Jeremiah, Something that's been floating in my head is a basic assertion script for DAGs that will validate things are as expected. This can be used to monitor test DAGs (especially if we do nightly builds). The assertions could be things like: * This DAG should have an execution date ever N

Re: Speeding up the scheduler - request for comments

2016-06-03 Thread Chris Riccomini
Hey Bolke, > Are scheduler loop times a concern at all? Yes, I strongly believe that they are. Especially as we add more DAGs/tasks. I am not a fan of (1). Caching is just going to create cache consistency issues, and be really annoying to manage, IMO. I agree that (2) seems more appealing. I c

Speeding up the scheduler - request for comments

2016-06-03 Thread Bolke de Bruin
Hi, I am looking at speeding up the scheduler. Currently loop times increase with the amount of tasks in a dag. This is due to TaskInstance.are_depedencies_met executing several aggregation functions on the database. These calls are expensive: between 0.05-0.15s per task and for every scheduler

Re: separating DAGs and code and handling PYTHONPATH

2016-06-03 Thread Lance Norskog
About structuring memory use: we have some major chunks of code set up as web services. We have a separate machine that runs one service (a Java-based app) and is limited to running 20 at once so that we can't run out of ram. Our installation uses a separate Docker container for each Airflow app.

Re: Airflow scheduler/worker inefficient time

2016-06-03 Thread Maxime Beauchemin
Note that in general, Airflow isn't designed to run thousands of small tasks per minute. The celery library on its own does that well without any oversight from Airflow, though then you miss out on what Airflow has to provide (complex dependency management, state handling, logging, retries, ...).

RE: Airflow scheduler/worker inefficient time

2016-06-03 Thread Ryabchuk, Pavlo
Hey, Had a look at this celery config option, but no luck. Also tried setting executor to Local executor - same result Each task takes no more than 0.1 sec but overall time is huge Thought that it could be due to disabled pickling, enabled it - almost no change :( -Original Message- From

Re: separating DAGs and code and handling PYTHONPATH

2016-06-03 Thread Dennis O'Brien
Thanks very much for the help. It seems I had two errors happening here. First, as Mattias pointed out, I was doing it wrong with the jinja2.PackageLoader. (It's always embarrassing to email a dev list when the error is somewhere entirely different.) I switched to jinja2.FileLoader and it worke