[jira] [Created] (AIRFLOW-2184) Create a druid_checker operator
Tao Feng created AIRFLOW-2184: - Summary: Create a druid_checker operator Key: AIRFLOW-2184 URL: https://issues.apache.org/jira/browse/AIRFLOW-2184 Project: Apache Airflow Issue Type: Improvement Reporter: Tao Feng Assignee: Tao Feng Once we agree on the extended interface provided through druid_hook in AIRFLOW-2183, we would like to create a druid_checker operator to do basic data quality checking on data in druid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2183) Refactor DruidHook to able to issue arbitrary query to druid broker
Tao Feng created AIRFLOW-2183: - Summary: Refactor DruidHook to able to issue arbitrary query to druid broker Key: AIRFLOW-2183 URL: https://issues.apache.org/jira/browse/AIRFLOW-2183 Project: Apache Airflow Issue Type: Improvement Reporter: Tao Feng Assignee: Tao Feng Currently the druidhook only maintains connection to overlord and is used solely for ingestion purpose. We would like to extend the hook so that it could be utilized to issue a query to druid broker. There are couples of benefits: # Allow any operator to issue query to druid broker. # Allow us later on to create a druid_checker for data quality purpose. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2175) Failed to upgradedb 1.8.2 -> 1.9.0
[ https://issues.apache.org/jira/browse/AIRFLOW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387346#comment-16387346 ] Damian Momot commented on AIRFLOW-2175: --- This DAG is no longer in dag files but still exists in DB. For 99% it wasn't sub dag. It was probably left in such strange state. As I remember this installation was initialized on Airflow 1.8.1, then upgraded to 1.8.2. Field is nullable in DB, DB is MySQL {code:java} DESCRIBE dag; Field TypeNull Key Extra ... 'fileloc' 'varchar(2000)' 'YES' '' NULL '' {code} As a workaround I'll just manually alter record in DB, but null check is probably a good idea, especially that field is nullable in DB - it might affect others > Failed to upgradedb 1.8.2 -> 1.9.0 > -- > > Key: AIRFLOW-2175 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2175 > Project: Apache Airflow > Issue Type: Bug > Components: db >Affects Versions: 1.9.0 >Reporter: Damian Momot >Priority: Critical > > We've got airflow installation with hundreds of DAGs and thousands of tasks. > During upgrade (1.8.2 -> 1.9.0) we've got following error. > After analyzing stacktrace i've found that it's most likely caused by None > value in 'fileloc' field of Dag column. I checked database and indeed we've > got one record with such value: > > > {code:java} > SELECT COUNT(*) FROM dag WHERE fileloc IS NULL; > 1 > SELECT COUNT(*) FROM dag; > 343 > {code} > > > {code:java} > Traceback (most recent call last): > File "/usr/local/bin/airflow", line 27, in > args.func(args) > File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 913, > in upgradedb > db_utils.upgradedb() > File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 320, > in upgradedb > command.upgrade(config, 'heads') > File "/usr/local/lib/python2.7/dist-packages/alembic/command.py", line 174, > in upgrade > script.run_env() > File "/usr/local/lib/python2.7/dist-packages/alembic/script/base.py", line > 416, in run_env > util.load_python_file(self.dir, 'env.py') > File "/usr/local/lib/python2.7/dist-packages/alembic/util/pyfiles.py", line > 93, in load_python_file > module = load_module_py(module_id, path) > File "/usr/local/lib/python2.7/dist-packages/alembic/util/compat.py", line > 79, in load_module_py > mod = imp.load_source(module_id, path, fp) > File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", > line 86, in > run_migrations_online() > File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", > line 81, in run_migrations_online > context.run_migrations() > File "", line 8, in run_migrations > File > "/usr/local/lib/python2.7/dist-packages/alembic/runtime/environment.py", line > 807, in run_migrations > self.get_context().run_migrations(**kw) > File "/usr/local/lib/python2.7/dist-packages/alembic/runtime/migration.py", > line 321, in run_migrations > step.migration_fn(**kw) > File > "/usr/local/lib/python2.7/dist-packages/airflow/migrations/versions/cc1e65623dc7_add_max_tries_column_to_task_instance.py", > line 63, in upgrade > dag = dagbag.get_dag(ti.dag_id) > File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 232, > in get_dag > filepath=orm_dag.fileloc, only_if_updated=False) > File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 249, > in process_file > if not os.path.isfile(filepath): > File "/usr/lib/python2.7/genericpath.py", line 29, in isfile > st = os.stat(path) > TypeError: coercing to Unicode: need string or buffer, NoneType found{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-2118) get_pandas_df does always pass a list of rows to be parsed
[ https://issues.apache.org/jira/browse/AIRFLOW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diane Ivy closed AIRFLOW-2118. -- Resolution: Fixed Fixed with https://github.com/apache/incubator-airflow/pull/3066 > get_pandas_df does always pass a list of rows to be parsed > -- > > Key: AIRFLOW-2118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2118 > Project: Apache Airflow > Issue Type: Bug > Components: contrib, hooks >Affects Versions: 1.9.0 > Environment: pandas-gbp 0.3.1 >Reporter: Diane Ivy >Assignee: Diane Ivy >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > While trying to parse the pages in get_pandas_df if only one page is returned > it starts popping off each row and then the gbq_parse_data works incorrectly. > {{while len(pages) > 0:}} > {{ page = pages.pop()}} > {{ dataframe_list.append(gbq_parse_data(schema, page))}} > Possible solution: > {{from google.cloud import bigquery}} > {{if isinstance(pages[0], bigquery.table.Row):}} > {{ pages = [pages]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2118) get_pandas_df does always pass a list of rows to be parsed
[ https://issues.apache.org/jira/browse/AIRFLOW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387075#comment-16387075 ] Diane Ivy commented on AIRFLOW-2118: [~Yuyin.Yang] This seems to fixed in the latest version since it no longer uses the gbq_parse_data. > get_pandas_df does always pass a list of rows to be parsed > -- > > Key: AIRFLOW-2118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2118 > Project: Apache Airflow > Issue Type: Bug > Components: contrib, hooks >Affects Versions: 1.9.0 > Environment: pandas-gbp 0.3.1 >Reporter: Diane Ivy >Assignee: Diane Ivy >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > While trying to parse the pages in get_pandas_df if only one page is returned > it starts popping off each row and then the gbq_parse_data works incorrectly. > {{while len(pages) > 0:}} > {{ page = pages.pop()}} > {{ dataframe_list.append(gbq_parse_data(schema, page))}} > Possible solution: > {{from google.cloud import bigquery}} > {{if isinstance(pages[0], bigquery.table.Row):}} > {{ pages = [pages]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2181) Convert DOS formatted files to UNIX
[ https://issues.apache.org/jira/browse/AIRFLOW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387025#comment-16387025 ] Dan Fowler commented on AIRFLOW-2181: - PR: https://github.com/apache/incubator-airflow/pull/3102 > Convert DOS formatted files to UNIX > --- > > Key: AIRFLOW-2181 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2181 > Project: Apache Airflow > Issue Type: Task >Reporter: Dan Fowler >Assignee: Dan Fowler >Priority: Trivial > > While looking into an issue related to the password_auth backend I noticed > the following files are in DOS format: > > tests/www/api/experimental/test_password_endpoints.py > airflow/contrib/auth/backends/password_auth.py > > I can't think of a reason why these should be DOS formatted, but if there is > let me know and I can close this out. Otherwise, I'll submit a PR for this > fix. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (AIRFLOW-226) Create separate pip packages for webserver and hooks
[ https://issues.apache.org/jira/browse/AIRFLOW-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Davydov reopened AIRFLOW-226: - > Create separate pip packages for webserver and hooks > > > Key: AIRFLOW-226 > URL: https://issues.apache.org/jira/browse/AIRFLOW-226 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Dan Davydov >Priority: Minor > > There are users who want only the airflow hooks, and others who many not need > the front-end. The hooks and webserver should be moved into their own > packages, with the current airflow package depending on these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-226) Create separate pip packages for webserver and hooks
[ https://issues.apache.org/jira/browse/AIRFLOW-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387016#comment-16387016 ] Dan Davydov commented on AIRFLOW-226: - I feel strongly (at least for hooks) that they should be moved out. Things like storing secrets in the Airflow database, hooks, etc. are convenient, but they are equivalent to plugins and should have their own owners and maintainers. It doesn't make sense to e.g. make the owner and expert of the HiveHook be a committer in this repo but they certainly should be the committer and maintainer of the HiveHook repo. Another point as to why it makes sense to decouple hooks and the core is that it doesn't scale to support backwards incompatible changes to all operators for the Airflow committers, we are effectively supporting many hooks which we have no domain knowledge of. Other systems such as Jenkins follow a similar plugin framework. > Create separate pip packages for webserver and hooks > > > Key: AIRFLOW-226 > URL: https://issues.apache.org/jira/browse/AIRFLOW-226 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Dan Davydov >Priority: Minor > > There are users who want only the airflow hooks, and others who many not need > the front-end. The hooks and webserver should be moved into their own > packages, with the current airflow package depending on these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-232) Web UI shows inaccurate task counts on main dashboard
[ https://issues.apache.org/jira/browse/AIRFLOW-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-232. - Resolution: Not A Problem > Web UI shows inaccurate task counts on main dashboard > - > > Key: AIRFLOW-232 > URL: https://issues.apache.org/jira/browse/AIRFLOW-232 > Project: Apache Airflow > Issue Type: Bug >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin >Priority: Major > Attachments: screenshot-1.png > > > Pstgres, celery, rabbitmq, 170 worker nodes, 1 master. > select count(*), state from task_instance where dag_id = 'freebayes' group by > state; > upstream_failed 2134 > up_for_retry 520 > success 141421 > running 542 > failed1165 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-247. --- Resolution: Fixed Fix Version/s: 1.8.0 Fixed in https://github.com/apache/incubator-airflow/commit/9f49f12853d83dd051f0f1ed58b5df20bfcfe087 > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > Fix For: 1.8.0 > > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-236) Support passing S3 credentials through environmental variables
[ https://issues.apache.org/jira/browse/AIRFLOW-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-236. --- Resolution: Fixed This is possible, both using the AWS standard {{AWS_ACCESS_KEY_ID}} and via specifying connections via env vars with {{AIRFLOW_CONN_S3=s3://}} > Support passing S3 credentials through environmental variables > -- > > Key: AIRFLOW-236 > URL: https://issues.apache.org/jira/browse/AIRFLOW-236 > Project: Apache Airflow > Issue Type: Improvement > Components: core >Reporter: Jakob Homan >Priority: Major > > Right now we expect S3 configs to be passed through one of a variety of > config files, or through extra parameters in the connection screen. It'd be > nice to be able to pass these through env variables and note as such through > the extra parameters. This would lessen the need to include credentials in > the webapp itself. > Alternatively, for logging (rather than as a connector), it might just be > better for Airflow to use the profie defined as AWS_DEFAULT and avoid needed > an explicit configuration at all. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-230) [HiveServer2Hook] adding multi statements support
[ https://issues.apache.org/jira/browse/AIRFLOW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-230. --- Resolution: Fixed Fix Version/s: 1.8.0 Fixed as https://github.com/apache/incubator-airflow/commit/a599167c433246d96bea711d8bfd5710b2c9d3ff > [HiveServer2Hook] adding multi statements support > - > > Key: AIRFLOW-230 > URL: https://issues.apache.org/jira/browse/AIRFLOW-230 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Maxime Beauchemin >Priority: Major > Fix For: 1.8.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-229) new DAG runs 5 times when manually started from website
[ https://issues.apache.org/jira/browse/AIRFLOW-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-229. - Resolution: Invalid Not an issue anymore. Feel free to re-open if anyone is still seeing this behaviour! > new DAG runs 5 times when manually started from website > --- > > Key: AIRFLOW-229 > URL: https://issues.apache.org/jira/browse/AIRFLOW-229 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler >Affects Versions: Airflow 1.6.2 > Environment: celery, rabbitmq, mysql >Reporter: audubon >Priority: Minor > > version 1.6.2 > using celery, rabbitmq, mysql > example: > from airflow import DAG > from airflow.operators import BashOperator > from datetime import datetime, timedelta > import json > import sys > one_day_ahead = datetime.combine(datetime.today() + timedelta(1), > datetime.min.time()) > one_day_ahead = one_day_ahead.replace(hour=3, minute=31) > default_args = { > 'owner': 'airflow', > 'depends_on_past': False, > 'start_date': one_day_ahead, > 'email': ['m...@email.com'], > 'email_on_failure': True, > 'email_on_retry': False, > 'retries': 1, > 'retry_delay': timedelta(minutes=5), > } > dag = DAG('alpha', default_args=default_args , schedule_interval='15 6 * * *' > ) > task = BashOperator( > task_id='alphaV2', > bash_command='sleep 10', > dag=dag) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-226) Create separate pip packages for webserver and hooks
[ https://issues.apache.org/jira/browse/AIRFLOW-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-226. - Resolution: Won't Fix Given most of Airflow is optional dependencies installing Airflow itself is not that heavy - and the extra development overhead on an open-source project means this is not likely to happen -- especially given the cost to the end user is a few extra packages installed. (Sorry to resurrect a really old ticket only to close it Won't Fix. If you feel strongly about this we can reopen and discuss this) > Create separate pip packages for webserver and hooks > > > Key: AIRFLOW-226 > URL: https://issues.apache.org/jira/browse/AIRFLOW-226 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Dan Davydov >Priority: Minor > > There are users who want only the airflow hooks, and others who many not need > the front-end. The hooks and webserver should be moved into their own > packages, with the current airflow package depending on these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-215) Airflow worker (CeleryExecutor) needs to be restarted to pick up tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-215. --- Resolution: Fixed Doesn't apply on 1.9.0 or 1.8.2. Was fixed at some point > Airflow worker (CeleryExecutor) needs to be restarted to pick up tasks > -- > > Key: AIRFLOW-215 > URL: https://issues.apache.org/jira/browse/AIRFLOW-215 > Project: Apache Airflow > Issue Type: Bug > Components: celery, subdag >Affects Versions: Airflow 1.7.1.2 >Reporter: Cyril Scetbon >Priority: Major > > We have a main dag that dynamically creates subdags containing tasks using > BashOperator. Using CeleryExecutor we see Celery tasks been created with > *STARTED* status but they are not picked up by our worker. However, if we > restart our worker, then tasks are picked up. > Here you can find code if you want to try to reproduce it > https://www.dropbox.com/s/8u7xf8jt55v8zio/dags.zip. > We also tested using LocalExecutor and everything worked fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-191) Database connection leak on Postgresql backend
[ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-191. --- Resolution: Fixed Fix Version/s: Airflow 1.8 Merged in as https://github.com/apache/incubator-airflow/commit/4905a5563d47b45e38b91661ee5aa7f3765a129b > Database connection leak on Postgresql backend > -- > > Key: AIRFLOW-191 > URL: https://issues.apache.org/jira/browse/AIRFLOW-191 > Project: Apache Airflow > Issue Type: Bug > Components: executor >Affects Versions: Airflow 1.7.1.2 >Reporter: Sergei Iakhnin >Priority: Major > Fix For: Airflow 1.8 > > Attachments: Sid_anands_airflow_idle_in_transaction.png > > > I raised this issue on github several months ago and there was even a PR but > it never maid it into mainline. Basically, workers tend to hang onto DB > connections in Postgres for recording heartbeat. > I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is > configured to allow 1000 simultaneous connections. I should effectively be > able to run 920 tasks at the same time, but am actually limited to only about > 450-480 because of idle transactions from workers hanging on to DB > connections. > If I run the following query > select count(*),state, client_hostname from pg_stat_activity group by state, > client_hostname > These are the results: > count state client_hostname > 1 active (null) > 1 idlelocalhost > 451 idle in transaction (null) > 446 idle(null) > 1 active localhost > The idle connections are all trying to run COMMIT > The "idle in transaction" connections are all trying to run > SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, > job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date > AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, > job.executor_class AS job_executor_class, job.hostname AS job_hostname, > job.unixname AS job_unixname > FROM job > WHERE job.id = 213823 > LIMIT 1 > with differing job.ids of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-187) Make PR tool more user-friendly
[ https://issues.apache.org/jira/browse/AIRFLOW-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-187. --- Resolution: Fixed Fixed by the merged https://github.com/apache/incubator-airflow/pull/1565 > Make PR tool more user-friendly > --- > > Key: AIRFLOW-187 > URL: https://issues.apache.org/jira/browse/AIRFLOW-187 > Project: Apache Airflow > Issue Type: Improvement > Components: PR tool >Reporter: Jeremiah Lowin >Priority: Minor > > General JIRA improvement that can be referenced for any UX improvements to > the PR tool, including better or more prompts, documentation, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-184) Add clear/mark success to CLI
[ https://issues.apache.org/jira/browse/AIRFLOW-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386953#comment-16386953 ] Ash Berlin-Taylor commented on AIRFLOW-184: --- Is this issue still relevant? > Add clear/mark success to CLI > - > > Key: AIRFLOW-184 > URL: https://issues.apache.org/jira/browse/AIRFLOW-184 > Project: Apache Airflow > Issue Type: Bug > Components: cli >Reporter: Chris Riccomini >Assignee: Joy Gao >Priority: Major > > AIRFLOW-177 pointed out that the current CLI does not allow us to clear or > mark success a task (including upstream, downstream, past, future, and > recursive) the way that the UI widget does. Given a goal of keeping parity > between the UI and CLI, it seems like we should support this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-182) CLI command `airflow backfill` fails while CLI `airflow run` succeeds
[ https://issues.apache.org/jira/browse/AIRFLOW-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-182. - Resolution: Cannot Reproduce Airflow 1.7 is now quite old. If this is still happening on the new latest version please open another issue and we'd be happy to help solve it > CLI command `airflow backfill` fails while CLI `airflow run` succeeds > - > > Key: AIRFLOW-182 > URL: https://issues.apache.org/jira/browse/AIRFLOW-182 > Project: Apache Airflow > Issue Type: Bug > Components: celery >Affects Versions: Airflow 1.7.0 > Environment: Heroku Cedar 14, Heroku Redis as Celery Broker >Reporter: Hariharan Mohanraj >Priority: Minor > > When I run the backfill command, I get an error that claims there is no dag > in my dag folder with the name "unusual_prefix_dag1", although my dag is > actually named dag1. However when I run the run command, the task is > scheduled and it works flawlessly. > {code} > $ airflow backfill -t task1 -s 2016-05-01 -e 2016-05-07 dag1 > 2016-05-26T23:22:28.816908+00:00 app[worker.1]: [2016-05-26 23:22:28,816] > {__init__.py:36} INFO - Using executor CeleryExecutor > 2016-05-26T23:22:29.214006+00:00 app[worker.1]: Traceback (most recent call > last): > 2016-05-26T23:22:29.214083+00:00 app[worker.1]: File > "/app/.heroku/python/bin/airflow", line 15, in > 2016-05-26T23:22:29.214121+00:00 app[worker.1]: args.func(args) > 2016-05-26T23:22:29.214151+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/airflow/bin/cli.py", line > 174, in run > 2016-05-26T23:22:29.214207+00:00 app[worker.1]: > DagPickle).filter(DagPickle.id == args.pickle).first() > 2016-05-26T23:22:29.214230+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/query.py", > line 2634, in first > 2016-05-26T23:22:29.214616+00:00 app[worker.1]: ret = list(self[0:1]) > 2016-05-26T23:22:29.214626+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/query.py", > line 2457, in __getitem__ > 2016-05-26T23:22:29.214984+00:00 app[worker.1]: return list(res) > 2016-05-26T23:22:29.214992+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", > line 86, in instances > 2016-05-26T23:22:29.215053+00:00 app[worker.1]: util.raise_from_cause(err) > 2016-05-26T23:22:29.215074+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/util/compat.py", > line 200, in raise_from_cause > 2016-05-26T23:22:29.215121+00:00 app[worker.1]: reraise(type(exception), > exception, tb=exc_tb, cause=cause) > 2016-05-26T23:22:29.215142+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", > line 71, in instances > 2016-05-26T23:22:29.215175+00:00 app[worker.1]: rows = [proc(row) for row > in fetch] > 2016-05-26T23:22:29.215200+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", > line 428, in _instance > 2016-05-26T23:22:29.215274+00:00 app[worker.1]: loaded_instance, > populate_existing, populators) > 2016-05-26T23:22:29.215282+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", > line 486, in _populate_full > 2016-05-26T23:22:29.215369+00:00 app[worker.1]: dict_[key] = getter(row) > 2016-05-26T23:22:29.215406+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py", > line 1253, in process > 2016-05-26T23:22:29.215574+00:00 app[worker.1]: return loads(value) > 2016-05-26T23:22:29.215595+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 260, in > loads > 2016-05-26T23:22:29.215657+00:00 app[worker.1]: return load(file) > 2016-05-26T23:22:29.215678+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 250, in > load > 2016-05-26T23:22:29.215738+00:00 app[worker.1]: obj = pik.load() > 2016-05-26T23:22:29.215758+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/pickle.py", line 858, in load > 2016-05-26T23:22:29.215895+00:00 app[worker.1]: dispatch[key](self) > 2016-05-26T23:22:29.215902+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/pickle.py", line 1090, in load_global > 2016-05-26T23:22:29.216069+00:00 app[worker.1]: klass = > self.find_class(module, name) > 2016-05-26T23:22:29.216077+00:00 app[worker.1]: File > "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 406, in > find_class > 2016-05-26T23:22:29.216181+00:00 app[worker.1]: return > StockU
[jira] [Closed] (AIRFLOW-181) Travis builds fail due to corrupt cache
[ https://issues.apache.org/jira/browse/AIRFLOW-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-181. - Resolution: Fixed Closed by https://github.com/apache/incubator-airflow/commit/afcd4fcf01696ee26911640cdeb481defd93c3aa > Travis builds fail due to corrupt cache > --- > > Key: AIRFLOW-181 > URL: https://issues.apache.org/jira/browse/AIRFLOW-181 > Project: Apache Airflow > Issue Type: Bug >Reporter: Bolke de Bruin >Assignee: Bolke de Bruin >Priority: Major > > Corrupt cache is preventing from unpacking hadoop. It needs to redownload the > distribution without checking the cache -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-160) Parse DAG files through child processes
[ https://issues.apache.org/jira/browse/AIRFLOW-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-160. - Resolution: Fixed Fix Version/s: Airflow 1.8 Fixed by https://github.com/apache/incubator-airflow/commit/fdb7e949140b735b8554ae5b22ad752e86f6ebaf > Parse DAG files through child processes > --- > > Key: AIRFLOW-160 > URL: https://issues.apache.org/jira/browse/AIRFLOW-160 > Project: Apache Airflow > Issue Type: Improvement > Components: scheduler >Reporter: Paul Yang >Assignee: Paul Yang >Priority: Major > Fix For: Airflow 1.8 > > > Currently, the Airflow scheduler parses all user DAG files in the same > process as the scheduler itself. We've seen issues in production where bad > DAG files cause scheduler to fail. A simple example is if the user script > calls `sys.exit(1)`, the scheduler will exit as well. We've also seen an > unusual case where modules loaded by the user DAG affect operation of the > scheduler. For better uptime, the scheduler should be resistant to these > problematic user DAGs. > The proposed solution is to parse and schedule user DAGs through child > processes. This way, the main scheduler process is more isolated from bad > DAGs. There's a side benefit as well - since parsing is distributed among > multiple processes, it's possible to parse the DAG files more frequently, > reducing the latency between when a DAG is modified and when the changes are > picked up. > Another issue right now is that all DAGs must be scheduled before any tasks > are sent to the executor. This means that the frequency of task scheduling is > limited by the slowest DAG to schedule. The changes needed for scheduling > DAGs through child processes will also make it easy to decouple this process > and allow tasks to be scheduled and sent to the executor in a more > independent fashion. This way, overall scheduling won't be held back by a > slow DAG. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-147) HiveServer2Hook.to_csv() writing one row at a time and causing excessive logging
[ https://issues.apache.org/jira/browse/AIRFLOW-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-147. - Resolution: Fixed Fixed by https://github.com/apache/incubator-airflow/commit/a5c00b3f1581580818b585b21abd3df3fa68af64 > HiveServer2Hook.to_csv() writing one row at a time and causing excessive > logging > > > Key: AIRFLOW-147 > URL: https://issues.apache.org/jira/browse/AIRFLOW-147 > Project: Apache Airflow > Issue Type: Bug > Components: hooks >Affects Versions: Airflow 1.7.0 >Reporter: Michael Musson >Priority: Minor > > The default behavior of fetchmany() in impala dbapi (which airflow switched > to recently) is to return a single row at a time. This causes HiveServer2's > to_csv() method to output one row of logging for each row of data in the > results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-129) Allow CELERYD_PREFETCH_MULTIPLIER to be configurable
[ https://issues.apache.org/jira/browse/AIRFLOW-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-129. --- Resolution: Fixed Fix Version/s: Airflow 1.9.0 Not the nicest interface for configuring, but it is now possible to do without patching Airflow. > Allow CELERYD_PREFETCH_MULTIPLIER to be configurable > > > Key: AIRFLOW-129 > URL: https://issues.apache.org/jira/browse/AIRFLOW-129 > Project: Apache Airflow > Issue Type: Improvement > Components: celery >Affects Versions: Airflow 1.7.0 >Reporter: Nam Ngo >Priority: Major > Fix For: Airflow 1.9.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Airflow needs to allow everyone to customise their prefetch limit. Some might > have short running task and don't want the overhead of celery latency. > More on that here: > http://docs.celeryproject.org/en/latest/userguide/optimizing.html#optimizing-prefetch-limit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-135) Clean up git branches (remove old + implement versions)
[ https://issues.apache.org/jira/browse/AIRFLOW-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-135. - Resolution: Fixed There are now only 6 branches. Nice and clean :) > Clean up git branches (remove old + implement versions) > --- > > Key: AIRFLOW-135 > URL: https://issues.apache.org/jira/browse/AIRFLOW-135 > Project: Apache Airflow > Issue Type: Improvement > Components: project-management >Reporter: Jeremiah Lowin >Priority: Minor > Labels: git > Fix For: Airflow 1.8 > > > We have a large number of branches in the git repo, most of which are old > features -- I would bet hardly any of them are active. I think they should be > deleted if possible. In addition, we should begin using branches (as opposed > to tags) to allow easy switching between Airflow versions. Spark > (https://github.com/apache/spark) uses the format {{branch-X.X}}; others like > Kafka (https://github.com/apache/kafka) simply use a version number. But this > is an important way to browse the history and, most importantly, can't be > overwritten like a tag (since tags point at commits and commits can be > rebased away). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-110) Point people to the approriate process to submit PRs in the repository's CONTRIBUTING.md
[ https://issues.apache.org/jira/browse/AIRFLOW-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-110. --- Resolution: Fixed With the addition of the {{.github}} folder this is now quite obvious on GitHub. > Point people to the approriate process to submit PRs in the repository's > CONTRIBUTING.md > > > Key: AIRFLOW-110 > URL: https://issues.apache.org/jira/browse/AIRFLOW-110 > Project: Apache Airflow > Issue Type: Task > Components: docs >Reporter: Arthur Wiedmer >Priority: Trivial > Labels: documentation, newbie > > The current process to contribute code could be made more accessible. I am > assuming that the entry point to the project is Github and the repository. We > could modify the contributing.md as well as the read me to point to the > proper way to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-2123) Install CI Dependencies from setup.py
[ https://issues.apache.org/jira/browse/AIRFLOW-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-2123. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request #3054 [https://github.com/apache/incubator-airflow/pull/3054] > Install CI Dependencies from setup.py > - > > Key: AIRFLOW-2123 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2123 > Project: Apache Airflow > Issue Type: Bug >Reporter: Fokko Driesprong >Priority: Major > Fix For: 2.0.0 > > > Right now we have two places where we keep our dependencies. This is setup.py > for installation and requirements.txt for the CI. These files run terribly > out of sync and therefore I think it is a good idea to install the CI's > dependencies using this setup.py so we have everything in one single place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2123] Install CI dependencies from setup.py
Repository: incubator-airflow Updated Branches: refs/heads/master f1df3de9b -> 976fd1245 [AIRFLOW-2123] Install CI dependencies from setup.py Install the dependencies from setup.py so we keep all the dependencies in one single place Closes #3054 from Fokko/fd-fix-ci-2 Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/976fd124 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/976fd124 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/976fd124 Branch: refs/heads/master Commit: 976fd1245a981b37957e4e35367b0e504d8e3d67 Parents: f1df3de Author: Fokko Driesprong Authored: Mon Mar 5 22:46:07 2018 + Committer: Ash Berlin-Taylor Committed: Mon Mar 5 22:46:45 2018 + -- scripts/ci/requirements.txt | 97 scripts/ci/travis_script.sh | 2 + setup.py| 17 +-- tox.ini | 4 +- 4 files changed, 17 insertions(+), 103 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/976fd124/scripts/ci/requirements.txt -- diff --git a/scripts/ci/requirements.txt b/scripts/ci/requirements.txt deleted file mode 100644 index 9c028d5..000 --- a/scripts/ci/requirements.txt +++ /dev/null @@ -1,97 +0,0 @@ -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -alembic -azure-storage>=0.34.0 -bcrypt -bleach -boto -boto3 -celery -cgroupspy -chartkick -cloudant<2.0 -coverage -coveralls -croniter>=0.3.17 -cryptography -datadog -dill -distributed -docker-py -filechunkio -flake8 -flask -flask-admin -flask-bcrypt -flask-cache -flask-login==0.2.11 -Flask-WTF -flower -freezegun -future -google-api-python-client>=1.5.0,<1.6.0 -gunicorn -hdfs -hive-thrift-py -impyla -ipython -jaydebeapi -jinja2<2.9.0 -jira -ldap3 -lxml -markdown -mock -moto==1.1.19 -mysqlclient -nose -nose-exclude -nose-ignore-docstring==0.2 -nose-timer -oauth2client>=2.0.2,<2.1.0 -pandas -pandas-gbq -parameterized -paramiko>=2.1.1 -pendulum>=1.3.2 -psutil>=4.2.0, <5.0.0 -psycopg2 -pygments -pyhive -pykerberos -PyOpenSSL -PySmbClient -python-daemon -python-dateutil -python-jenkins -qds-sdk>=1.9.6 -redis -rednose -requests -requests-kerberos -requests_mock -sendgrid -setproctitle -slackclient -sphinx -sphinx-argparse -Sphinx-PyPI-upload -sphinx_rtd_theme -sqlalchemy>=1.1.15, <1.2.0 -statsd -thrift -thrift_sasl -unicodecsv -zdesk -kubernetes http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/976fd124/scripts/ci/travis_script.sh -- diff --git a/scripts/ci/travis_script.sh b/scripts/ci/travis_script.sh index 86c086a..8766e94 100755 --- a/scripts/ci/travis_script.sh +++ b/scripts/ci/travis_script.sh @@ -1,3 +1,5 @@ +#!/usr/bin/env bash + # Licensed to the Apache Software Foundation (ASF) under one * # or more contributor license agreements. See the NOTICE file * # distributed with this work for additional information* http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/976fd124/setup.py -- diff --git a/setup.py b/setup.py index d3f48e3..254aa3e 100644 --- a/setup.py +++ b/setup.py @@ -27,6 +27,7 @@ logger = logging.getLogger(__name__) version = imp.load_source( 'airflow.version', os.path.join('airflow', 'version.py')).version +PY3 = sys.version_info[0] == 3 class Tox(TestCommand): user_options = [('tox-args=', None, "Arguments to pass to tox")] @@ -153,8 +154,7 @@ ldap = ['ldap3>=0.9.9.1'] kerberos = ['pykerberos>=1.1.13', 'requests_kerberos>=0.10.0', 'thrift_sasl>=0.2.0', -'snakebite[kerberos]>=2.7.8', -'kerberos>=1.2.5'] +'snakebite[kerberos]>=2.7.8'] password = [ 'bcrypt>=2.0.0', 'flask-bcrypt>=0.7.1', @@ -166,6 +166,8 @@ redis = ['redis>=2.10.5'] kubernetes = ['kubernetes>=3.0.0', 'cryptography>=2.0.0'] +zendesk = ['zdesk'] + all_dbs = postgres + mysql + hive + mssql + hdfs + vertica + cloudant devel = [ 'click', @@ -185,9 +187,15 @@ devel = [ ] devel_minreq = devel + kubernetes + mysql + doc + password + s3 + cgroups devel_hadoop = devel_minreq +
[jira] [Created] (AIRFLOW-2182) Configured
Richard Ferrer created AIRFLOW-2182: --- Summary: Configured Key: AIRFLOW-2182 URL: https://issues.apache.org/jira/browse/AIRFLOW-2182 Project: Apache Airflow Issue Type: New Feature Components: authentication Reporter: Richard Ferrer Assignee: Richard Ferrer -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2181) Convert DOS formatted files to UNIX
[ https://issues.apache.org/jira/browse/AIRFLOW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386816#comment-16386816 ] Ash Berlin-Taylor commented on AIRFLOW-2181: No reason at all - PR welcomed! > Convert DOS formatted files to UNIX > --- > > Key: AIRFLOW-2181 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2181 > Project: Apache Airflow > Issue Type: Task >Reporter: Dan Fowler >Assignee: Dan Fowler >Priority: Trivial > > While looking into an issue related to the password_auth backend I noticed > the following files are in DOS format: > > tests/www/api/experimental/test_password_endpoints.py > airflow/contrib/auth/backends/password_auth.py > > I can't think of a reason why these should be DOS formatted, but if there is > let me know and I can close this out. Otherwise, I'll submit a PR for this > fix. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-97) "airflow" "DAG" strings in file necessary to import dag
[ https://issues.apache.org/jira/browse/AIRFLOW-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor updated AIRFLOW-97: - Affects Version/s: Airflow 1.9.0 > "airflow" "DAG" strings in file necessary to import dag > --- > > Key: AIRFLOW-97 > URL: https://issues.apache.org/jira/browse/AIRFLOW-97 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler >Affects Versions: Airflow 1.7.0, Airflow 1.9.0 >Reporter: Etiene Dalcol >Priority: Minor > > Hello airflow team! Thanks for the awesome tool! > We made a small module to automate our DAG building process and we are using > this module on our DAG definition. Our airflow version is 1.7.0. > However, airflow will not import this file because it doesn't have the words > DAG and airflow on it. (The imports etc are done inside our little module). > Apparently there's a safe_mode that skips files without these strings. > (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L197) > This safe_mode is default to True but is not passed to the process_file > function, so it is always True and there's no apparent way to disable it. > (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L177) > (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L313) > Putting this comment on the top of the file makes it work for the moment and > brought me a good laugh today 👯 > #DAG airflow —> DO NOT REMOVE. the world will explode -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-42) Adding logging.debug DagBag loading stats
[ https://issues.apache.org/jira/browse/AIRFLOW-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-42. Resolution: Fixed Fix Version/s: 1.8.0 Merged in May 2016 via https://github.com/apache/incubator-airflow/commit/3c3f5a67ff80f3e8942aef441f481c62baf97184 > Adding logging.debug DagBag loading stats > - > > Key: AIRFLOW-42 > URL: https://issues.apache.org/jira/browse/AIRFLOW-42 > Project: Apache Airflow > Issue Type: Bug >Reporter: Maxime Beauchemin >Assignee: Maxime Beauchemin >Priority: Major > Fix For: 1.8.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-19) How can I have an Operator B iterate over a list returned from upstream by Operator A?
[ https://issues.apache.org/jira/browse/AIRFLOW-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-19. Resolution: Not A Bug As discussed, the mailing list (http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/) is the best place for questions like this. > How can I have an Operator B iterate over a list returned from upstream by > Operator A? > -- > > Key: AIRFLOW-19 > URL: https://issues.apache.org/jira/browse/AIRFLOW-19 > Project: Apache Airflow > Issue Type: Bug > Components: operators >Reporter: Praveenkumar Venkatesan >Priority: Minor > Labels: support > > Here is what I am trying to do exactly: > https://gist.github.com/praveev/7b93b50746f8e965f7139ecba028490a > the python operator log just returns the following > [2016-04-28 11:56:22,296] {models.py:1041} INFO - Executing > on 2016-04-28 11:56:12 > [2016-04-28 11:56:22,350] {python_operator.py:66} INFO - Done. Returned value > was: None > it didn't even print my kwargs and to_process data > To simplify this. Lets say t1 returns 3 elements. I want to iterate over the > list and run t2 -> t3 for each element. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2181) Convert DOS formatted files to UNIX
Dan Fowler created AIRFLOW-2181: --- Summary: Convert DOS formatted files to UNIX Key: AIRFLOW-2181 URL: https://issues.apache.org/jira/browse/AIRFLOW-2181 Project: Apache Airflow Issue Type: Task Reporter: Dan Fowler Assignee: Dan Fowler While looking into an issue related to the password_auth backend I noticed the following files are in DOS format: tests/www/api/experimental/test_password_endpoints.py airflow/contrib/auth/backends/password_auth.py I can't think of a reason why these should be DOS formatted, but if there is let me know and I can close this out. Otherwise, I'll submit a PR for this fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2181) Convert DOS formatted files to UNIX
[ https://issues.apache.org/jira/browse/AIRFLOW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-2181 started by Dan Fowler. --- > Convert DOS formatted files to UNIX > --- > > Key: AIRFLOW-2181 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2181 > Project: Apache Airflow > Issue Type: Task >Reporter: Dan Fowler >Assignee: Dan Fowler >Priority: Trivial > > While looking into an issue related to the password_auth backend I noticed > the following files are in DOS format: > > tests/www/api/experimental/test_password_endpoints.py > airflow/contrib/auth/backends/password_auth.py > > I can't think of a reason why these should be DOS formatted, but if there is > let me know and I can close this out. Otherwise, I'll submit a PR for this > fix. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2180) Import Errors on Custom Logging Produce Unhelpful Messages
[ https://issues.apache.org/jira/browse/AIRFLOW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Lawrence Pamplona updated AIRFLOW-2180: - Attachment: Screen Shot 2018-03-05 at 1.19.07 PM.png > Import Errors on Custom Logging Produce Unhelpful Messages > -- > > Key: AIRFLOW-2180 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2180 > Project: Apache Airflow > Issue Type: Bug >Reporter: Kevin Lawrence Pamplona >Priority: Minor > Attachments: Screen Shot 2018-03-05 at 1.19.07 PM.png > > > Repro Steps: > 1. Use airflow.cfg with missing [core/remote_logging] field > 2. Start airflow or run `PYTHONPATH=config/ python -c 'import log_conf'`given > that custom logging config is in config/log_conf.py > Execution will produce an irrelevant error: > 'Unable to load custom logging from {}'.format(logging_class_path) > ImportError: Unable to load custom logging from log_config.LOGGING_CONFIG > No handlers could be found for logger > "airflow.utils.log.logging_mixin.LoggingMixin" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2180) Import Errors on Custom Logging Produce Unhelpful Messages
Kevin Lawrence Pamplona created AIRFLOW-2180: Summary: Import Errors on Custom Logging Produce Unhelpful Messages Key: AIRFLOW-2180 URL: https://issues.apache.org/jira/browse/AIRFLOW-2180 Project: Apache Airflow Issue Type: Bug Reporter: Kevin Lawrence Pamplona Repro Steps: 1. Use airflow.cfg with missing [core/remote_logging] field 2. Start airflow or run `PYTHONPATH=config/ python -c 'import log_conf'`given that custom logging config is in config/log_conf.py Execution will produce an irrelevant error: 'Unable to load custom logging from {}'.format(logging_class_path) ImportError: Unable to load custom logging from log_config.LOGGING_CONFIG No handlers could be found for logger "airflow.utils.log.logging_mixin.LoggingMixin" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2179) Make parametrable the IP on which the worker log server binds to
[ https://issues.apache.org/jira/browse/AIRFLOW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386692#comment-16386692 ] Ash Berlin-Taylor commented on AIRFLOW-2179: Sounds like a sensible change. > Make parametrable the IP on which the worker log server binds to > > > Key: AIRFLOW-2179 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2179 > Project: Apache Airflow > Issue Type: Improvement > Components: celery, webserver >Reporter: Albin Gilles >Priority: Minor > > Hello, > I'd be glad if the tiny web server subprocess to serve the workers local log > files could be set to bind to localhost only as could be done for Gunicorn or > Flower. See > [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865] > If you don't see any issue with that possibility, I'll be happy to propose a > PR on github. > Regards, > Albin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow
[ https://issues.apache.org/jira/browse/AIRFLOW-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor resolved AIRFLOW-2163. Resolution: Fixed > Add HBC Digital to list of companies using Airflow > -- > > Key: AIRFLOW-2163 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2163 > Project: Apache Airflow > Issue Type: Bug >Reporter: Terry McCartan >Assignee: Terry McCartan >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[1/2] incubator-airflow git commit: [AIRFLOW-2163] Add HBC Digital to users of airflow
Repository: incubator-airflow Updated Branches: refs/heads/master 1ac4d07d0 -> f1df3de9b [AIRFLOW-2163] Add HBC Digital to users of airflow Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/8b6eab7a Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/8b6eab7a Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/8b6eab7a Branch: refs/heads/master Commit: 8b6eab7a269c7e74fb30cdf7efe7070c38bdc1b3 Parents: 2511c46 Author: Terry McCartan Authored: Fri Mar 2 12:47:51 2018 + Committer: Terry McCartan Committed: Fri Mar 2 12:47:51 2018 + -- README.md | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/8b6eab7a/README.md -- diff --git a/README.md b/README.md index fa7bb77..b3ba1b7 100644 --- a/README.md +++ b/README.md @@ -135,6 +135,7 @@ Currently **officially** using Airflow: 1. [Gusto](https://gusto.com) [[@frankhsu](https://github.com/frankhsu)] 1. [Handshake](https://joinhandshake.com/) [[@mhickman](https://github.com/mhickman)] 1. [Handy](http://www.handy.com/careers/73115?gh_jid=73115&gh_src=o5qcxn) [[@marcintustin](https://github.com/marcintustin) / [@mtustin-handy](https://github.com/mtustin-handy)] +1. [HBC Digital](http://tech.hbc.com) [[@tmccartan](https://github.com/tmccartan) & [@dmateusp](https://github.com/dmateusp)] 1. [Healthjump](http://www.healthjump.com/) [[@miscbits](https://github.com/miscbits)] 1. [HBO](http://www.hbo.com/)[[@yiwang](https://github.com/yiwang)] 1. [HelloFresh](https://www.hellofresh.com) [[@tammymendt](https://github.com/tammymendt) & [@davidsbatista](https://github.com/davidsbatista) & [@iuriinedostup](https://github.com/iuriinedostup)]
[2/2] incubator-airflow git commit: Merge pull request #3084 from tmccartan/master
Merge pull request #3084 from tmccartan/master Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/f1df3de9 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/f1df3de9 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/f1df3de9 Branch: refs/heads/master Commit: f1df3de9bb3fa5c8206ed9e7f0b089a92785b81a Parents: 1ac4d07 8b6eab7 Author: Ash Berlin-Taylor Authored: Mon Mar 5 20:17:14 2018 + Committer: Ash Berlin-Taylor Committed: Mon Mar 5 20:17:14 2018 + -- README.md | 1 + 1 file changed, 1 insertion(+) --
[jira] [Updated] (AIRFLOW-2179) Make parametrable the IP on which the worker log server binds to
[ https://issues.apache.org/jira/browse/AIRFLOW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albin Gilles updated AIRFLOW-2179: -- Description: Hello, I'd be glad if the tiny web server subprocess to serve the workers local log files could be set to bind to localhost only as could be done for Gunicorn or Flower. See [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865] If you don't see any issue with that possibility, I'll be happy to propose a PR on github. Regards, Albin. was: Hello, I'd be glad if the tiny web server subprocess to serve the workers local log files could be set to bind to localhost only as could be done for Gunicorn or Flower. See [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865 If you don't see any issue with that possibility, I'll be happy to propose a PR on github. Regards, Albin. > Make parametrable the IP on which the worker log server binds to > > > Key: AIRFLOW-2179 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2179 > Project: Apache Airflow > Issue Type: Improvement > Components: celery, webserver >Reporter: Albin Gilles >Priority: Minor > > Hello, > I'd be glad if the tiny web server subprocess to serve the workers local log > files could be set to bind to localhost only as could be done for Gunicorn or > Flower. See > [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865] > If you don't see any issue with that possibility, I'll be happy to propose a > PR on github. > Regards, > Albin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2179) Make parametrable the IP on which the worker log server binds to
Albin Gilles created AIRFLOW-2179: - Summary: Make parametrable the IP on which the worker log server binds to Key: AIRFLOW-2179 URL: https://issues.apache.org/jira/browse/AIRFLOW-2179 Project: Apache Airflow Issue Type: Improvement Components: celery, webserver Reporter: Albin Gilles Hello, I'd be glad if the tiny web server subprocess to serve the workers local log files could be set to bind to localhost only as could be done for Gunicorn or Flower. See [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865 If you don't see any issue with that possibility, I'll be happy to propose a PR on github. Regards, Albin. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2178) Scheduler can't get past SLA check if SMTP settings are incorrect
[ https://issues.apache.org/jira/browse/AIRFLOW-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Meickle updated AIRFLOW-2178: --- Attachment: log.txt > Scheduler can't get past SLA check if SMTP settings are incorrect > - > > Key: AIRFLOW-2178 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2178 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler >Affects Versions: 1.9.0 > Environment: 16.04 >Reporter: James Meickle >Priority: Major > Attachments: log.txt > > > After testing Airflow for a while in staging, I provisioned our prod cluster > and enabled the first DAG on it. The "backfill" for this DAG performed just > fine, so I assumed everything was working and left it over the weekend. > However, when the last "backfill" period completed and the scheduler > transitioned to the most recent execution date, it began failing in the > `manage_slas` method. Due to a configuration difference, SMTP was timing out > in production, preventing the SLA check from ever completing; this both > blocked SLA notifications, as well as prevented further tasks in this DAG > from ever getting scheduled. > As an operator, I would expect AIrflow to treat scheduling tasks as a > higher-priority concern, and to do so even f the SLA feature fails to work. I > would also expect Airflow to notify me in the web UI that email sending is > not currently working. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2178) Scheduler can't get past SLA check if SMTP settings are incorrect
James Meickle created AIRFLOW-2178: -- Summary: Scheduler can't get past SLA check if SMTP settings are incorrect Key: AIRFLOW-2178 URL: https://issues.apache.org/jira/browse/AIRFLOW-2178 Project: Apache Airflow Issue Type: Bug Components: scheduler Affects Versions: 1.9.0 Environment: 16.04 Reporter: James Meickle After testing Airflow for a while in staging, I provisioned our prod cluster and enabled the first DAG on it. The "backfill" for this DAG performed just fine, so I assumed everything was working and left it over the weekend. However, when the last "backfill" period completed and the scheduler transitioned to the most recent execution date, it began failing in the `manage_slas` method. Due to a configuration difference, SMTP was timing out in production, preventing the SLA check from ever completing; this both blocked SLA notifications, as well as prevented further tasks in this DAG from ever getting scheduled. As an operator, I would expect AIrflow to treat scheduling tasks as a higher-priority concern, and to do so even f the SLA feature fails to work. I would also expect Airflow to notify me in the web UI that email sending is not currently working. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2175) Failed to upgradedb 1.8.2 -> 1.9.0
[ https://issues.apache.org/jira/browse/AIRFLOW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386502#comment-16386502 ] Joy Gao commented on AIRFLOW-2175: -- Perhaps the fileloc attribute didn't get saved to db successfully. Curious is this a subdag? Maybe add a null check prior to os.path.isfile(filepath) to avoid this TypeError. > Failed to upgradedb 1.8.2 -> 1.9.0 > -- > > Key: AIRFLOW-2175 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2175 > Project: Apache Airflow > Issue Type: Bug > Components: db >Affects Versions: 1.9.0 >Reporter: Damian Momot >Priority: Critical > > We've got airflow installation with hundreds of DAGs and thousands of tasks. > During upgrade (1.8.2 -> 1.9.0) we've got following error. > After analyzing stacktrace i've found that it's most likely caused by None > value in 'fileloc' field of Dag column. I checked database and indeed we've > got one record with such value: > > > {code:java} > SELECT COUNT(*) FROM dag WHERE fileloc IS NULL; > 1 > SELECT COUNT(*) FROM dag; > 343 > {code} > > > {code:java} > Traceback (most recent call last): > File "/usr/local/bin/airflow", line 27, in > args.func(args) > File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 913, > in upgradedb > db_utils.upgradedb() > File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 320, > in upgradedb > command.upgrade(config, 'heads') > File "/usr/local/lib/python2.7/dist-packages/alembic/command.py", line 174, > in upgrade > script.run_env() > File "/usr/local/lib/python2.7/dist-packages/alembic/script/base.py", line > 416, in run_env > util.load_python_file(self.dir, 'env.py') > File "/usr/local/lib/python2.7/dist-packages/alembic/util/pyfiles.py", line > 93, in load_python_file > module = load_module_py(module_id, path) > File "/usr/local/lib/python2.7/dist-packages/alembic/util/compat.py", line > 79, in load_module_py > mod = imp.load_source(module_id, path, fp) > File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", > line 86, in > run_migrations_online() > File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", > line 81, in run_migrations_online > context.run_migrations() > File "", line 8, in run_migrations > File > "/usr/local/lib/python2.7/dist-packages/alembic/runtime/environment.py", line > 807, in run_migrations > self.get_context().run_migrations(**kw) > File "/usr/local/lib/python2.7/dist-packages/alembic/runtime/migration.py", > line 321, in run_migrations > step.migration_fn(**kw) > File > "/usr/local/lib/python2.7/dist-packages/airflow/migrations/versions/cc1e65623dc7_add_max_tries_column_to_task_instance.py", > line 63, in upgrade > dag = dagbag.get_dag(ti.dag_id) > File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 232, > in get_dag > filepath=orm_dag.fileloc, only_if_updated=False) > File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 249, > in process_file > if not os.path.isfile(filepath): > File "/usr/lib/python2.7/genericpath.py", line 29, in isfile > st = os.stat(path) > TypeError: coercing to Unicode: need string or buffer, NoneType found{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-2118) get_pandas_df does always pass a list of rows to be parsed
[ https://issues.apache.org/jira/browse/AIRFLOW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anonymous reassigned AIRFLOW-2118: -- Assignee: Diane Ivy > get_pandas_df does always pass a list of rows to be parsed > -- > > Key: AIRFLOW-2118 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2118 > Project: Apache Airflow > Issue Type: Bug > Components: contrib, hooks >Affects Versions: 1.9.0 > Environment: pandas-gbp 0.3.1 >Reporter: Diane Ivy >Assignee: Diane Ivy >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > While trying to parse the pages in get_pandas_df if only one page is returned > it starts popping off each row and then the gbq_parse_data works incorrectly. > {{while len(pages) > 0:}} > {{ page = pages.pop()}} > {{ dataframe_list.append(gbq_parse_data(schema, page))}} > Possible solution: > {{from google.cloud import bigquery}} > {{if isinstance(pages[0], bigquery.table.Row):}} > {{ pages = [pages]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2177) Add test for GCS download operator
Kaxil Naik created AIRFLOW-2177: --- Summary: Add test for GCS download operator Key: AIRFLOW-2177 URL: https://issues.apache.org/jira/browse/AIRFLOW-2177 Project: Apache Airflow Issue Type: Task Components: contrib, gcp Reporter: Kaxil Naik Assignee: Kaxil Naik Add mock tests for GCS Download operator -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (AIRFLOW-2158) Airflow should not store logs as raw ISO timestamps
[ https://issues.apache.org/jira/browse/AIRFLOW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ash Berlin-Taylor closed AIRFLOW-2158. -- Resolution: Duplicate > Airflow should not store logs as raw ISO timestamps > --- > > Key: AIRFLOW-2158 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2158 > Project: Apache Airflow > Issue Type: Improvement > Environment: 1.9.0 >Reporter: Christian D >Priority: Minor > Labels: easyfix, windows > Fix For: Airflow 2.0 > > > Problem: > When Airflow writes logs to disk, it uses a ISO-8601 timestamp as the > filename. In a Linux filesystem this works completely fine (because all > characters in a ISO-8601 timestamp is allowed). However, it doesn't work on > Windows based systems (including Azure File Storage) because {{:}} is a > disallowed character. > Solution: > Ideally, Airflow should store logs such that they're somewhat compatible > across file systems. An easy way of fixing this would therefore be to always > replace {{:}} with underscores. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2176) Change the way logging is carried out in BigQuery Get Data Operator
Kaxil Naik created AIRFLOW-2176: --- Summary: Change the way logging is carried out in BigQuery Get Data Operator Key: AIRFLOW-2176 URL: https://issues.apache.org/jira/browse/AIRFLOW-2176 Project: Apache Airflow Issue Type: Task Components: contrib, gcp, logging Reporter: Kaxil Naik Assignee: Kaxil Naik Currently, the logging is done by importing logging package. This should be changed to `self.log.info`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2128) 'Tall' DAGs scale worse than 'wide' DAGs
[ https://issues.apache.org/jira/browse/AIRFLOW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Máté Szabó updated AIRFLOW-2128: Description: Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... -> 998 -> 999 Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; ... 0 -> 999 Take a super simple case where both graphs are of 1000 tasks, and all the tasks are just "sleep 0.03" bash commands (see the attached files). With the default SequentialExecutor (without paralellism), I would expect my 2 example DAGs to take (approximately) the same time to run, but apparently this is not the case. For the wide DAG it was about 80 successfully executed tasks in 10 minutes, for the tall one it was 0. This anomaly also seem to affect the web UI. Opening up the graph view or the tree view for the wide DAG takes about 6 seconds on my machine, but for the tall one it takes significantly longer, in fact currently it does not load at all. was: Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... -> 998 -> 999 Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; ... 0 -> 999 Take a super simple case where both graphs are of 1000 tasks, and all the tasks are just "sleep 0.03" bash commands (see the attached files). With the default SequentialExecutor (without paralellism), I would expect my 2 example DAGs to take (approximately) the same time to run, but apprently this is not the case. For the wide DAG it was about 80 successfully executed tasks in 10 minutes, for the tall one it was 0. This anomaly also seem to affect the web UI. Opening up the graph view or the tree view for the wide DAG takes about 6 seconds on my machine, but for the tall one it takes significantly longer, in fact currently it does not load at all. > 'Tall' DAGs scale worse than 'wide' DAGs > > > Key: AIRFLOW-2128 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2128 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, DagRun, scheduler >Affects Versions: 1.9.0 >Reporter: Máté Szabó >Priority: Major > Labels: performance, usability > Attachments: tall_dag.py, wide_dag.py > > > Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... > -> 998 -> 999 > Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; > ... 0 -> 999 > Take a super simple case where both graphs are of 1000 tasks, and all the > tasks are just "sleep 0.03" bash commands (see the attached files). > With the default SequentialExecutor (without paralellism), I would expect my > 2 example DAGs to take (approximately) the same time to run, but apparently > this is not the case. > For the wide DAG it was about 80 successfully executed tasks in 10 minutes, > for the tall one it was 0. > This anomaly also seem to affect the web UI. Opening up the graph view or the > tree view for the wide DAG takes about 6 seconds on my machine, but for the > tall one it takes significantly longer, in fact currently it does not load at > all. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2128) 'Tall' DAGs scale worse than 'wide' DAGs
[ https://issues.apache.org/jira/browse/AIRFLOW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386148#comment-16386148 ] Máté Szabó commented on AIRFLOW-2128: - Yes, that's what I meant. But I'd like to emphasize it does not fail, it's just really slow. If I let it run for a sufficiently long time it does execute the tasks, but I haven't measured the exact time that takes. > 'Tall' DAGs scale worse than 'wide' DAGs > > > Key: AIRFLOW-2128 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2128 > Project: Apache Airflow > Issue Type: Bug > Components: DAG, DagRun, scheduler >Affects Versions: 1.9.0 >Reporter: Máté Szabó >Priority: Major > Labels: performance, usability > Attachments: tall_dag.py, wide_dag.py > > > Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... > -> 998 -> 999 > Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; > ... 0 -> 999 > Take a super simple case where both graphs are of 1000 tasks, and all the > tasks are just "sleep 0.03" bash commands (see the attached files). > With the default SequentialExecutor (without paralellism), I would expect my > 2 example DAGs to take (approximately) the same time to run, but apprently > this is not the case. > For the wide DAG it was about 80 successfully executed tasks in 10 minutes, > for the tall one it was 0. > This anomaly also seem to affect the web UI. Opening up the graph view or the > tree view for the wide DAG takes about 6 seconds on my machine, but for the > tall one it takes significantly longer, in fact currently it does not load at > all. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2175) Failed to upgradedb 1.8.2 -> 1.9.0
Damian Momot created AIRFLOW-2175: - Summary: Failed to upgradedb 1.8.2 -> 1.9.0 Key: AIRFLOW-2175 URL: https://issues.apache.org/jira/browse/AIRFLOW-2175 Project: Apache Airflow Issue Type: Bug Components: db Affects Versions: 1.9.0 Reporter: Damian Momot We've got airflow installation with hundreds of DAGs and thousands of tasks. During upgrade (1.8.2 -> 1.9.0) we've got following error. After analyzing stacktrace i've found that it's most likely caused by None value in 'fileloc' field of Dag column. I checked database and indeed we've got one record with such value: {code:java} SELECT COUNT(*) FROM dag WHERE fileloc IS NULL; 1 SELECT COUNT(*) FROM dag; 343 {code} {code:java} Traceback (most recent call last): File "/usr/local/bin/airflow", line 27, in args.func(args) File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 913, in upgradedb db_utils.upgradedb() File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 320, in upgradedb command.upgrade(config, 'heads') File "/usr/local/lib/python2.7/dist-packages/alembic/command.py", line 174, in upgrade script.run_env() File "/usr/local/lib/python2.7/dist-packages/alembic/script/base.py", line 416, in run_env util.load_python_file(self.dir, 'env.py') File "/usr/local/lib/python2.7/dist-packages/alembic/util/pyfiles.py", line 93, in load_python_file module = load_module_py(module_id, path) File "/usr/local/lib/python2.7/dist-packages/alembic/util/compat.py", line 79, in load_module_py mod = imp.load_source(module_id, path, fp) File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", line 86, in run_migrations_online() File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/env.py", line 81, in run_migrations_online context.run_migrations() File "", line 8, in run_migrations File "/usr/local/lib/python2.7/dist-packages/alembic/runtime/environment.py", line 807, in run_migrations self.get_context().run_migrations(**kw) File "/usr/local/lib/python2.7/dist-packages/alembic/runtime/migration.py", line 321, in run_migrations step.migration_fn(**kw) File "/usr/local/lib/python2.7/dist-packages/airflow/migrations/versions/cc1e65623dc7_add_max_tries_column_to_task_instance.py", line 63, in upgrade dag = dagbag.get_dag(ti.dag_id) File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 232, in get_dag filepath=orm_dag.fileloc, only_if_updated=False) File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 249, in process_file if not os.path.isfile(filepath): File "/usr/lib/python2.7/genericpath.py", line 29, in isfile st = os.stat(path) TypeError: coercing to Unicode: need string or buffer, NoneType found{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2165) XCOM values are being saved as bytestring
[ https://issues.apache.org/jira/browse/AIRFLOW-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386030#comment-16386030 ] Kaxil Naik commented on AIRFLOW-2165: - It has been mentioned here: https://github.com/apache/incubator-airflow/blob/master/UPDATING.md#deprecated-features > XCOM values are being saved as bytestring > - > > Key: AIRFLOW-2165 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2165 > Project: Apache Airflow > Issue Type: Bug > Components: xcom >Affects Versions: 1.9.0 > Environment: Ubuntu > Airflow 1.9.0 from PIP >Reporter: Cong Qin >Priority: Major > Attachments: Screen Shot 2018-03-02 at 11.09.15 AM.png > > > I noticed after upgrading to 1.9.0 that XCOM values are now being saved as > byte strings that cannot be decoded. Once I downgraded back to 1.8.2 the > "old" behavior is back. > It means that when I'm storing certain values inside I cannot pull those > values back out sometimes. I'm not sure if this was a documented change > anywhere (I looked at the changelog between 1.8.2 and 1.9.0) and I couldn't > find out if this was a config level change or something. -- This message was sent by Atlassian JIRA (v7.6.3#76005)