[jira] [Commented] (AIRFLOW-2557) Reduce time spent in S3 tests

2018-06-03 Thread Ash Berlin-Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499406#comment-16499406
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2557:


The moto library (which is in use partially) should probably be used everywhere 
in the S3 tests. This might speed things up.

> Reduce time spent in S3 tests
> -
>
> Key: AIRFLOW-2557
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2557
> Project: Apache Airflow
>  Issue Type: Sub-task
>Reporter: Bolke de Bruin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2556) Reduce time spent on unit tests

2018-06-03 Thread Ash Berlin-Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499403#comment-16499403
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2556:


Does this actually cost Apache money, or is it free cos it's an open-source 
project? Reducing test time still def good though, just curious


> Reduce time spent on unit tests
> ---
>
> Key: AIRFLOW-2556
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2556
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Bolke de Bruin
>Priority: Major
>
> Unit tests are taking up way too much time. This costs time and actually also 
> Money from the Apache Foundation. We need to reduce this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-05-31 Thread Ash Berlin-Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-2238:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/3413

> Update dev/airflow-pr to work with gitub for merge targets
> --
>
> Key: AIRFLOW-2238
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Ash Berlin-Taylor
>Priority: Major
>
> We are planning on migrating the to the Apache "GitBox" project which lets 
> committers work directly on github. This will mean we might not _need_ to use 
> the pr tool, but we should update it so that it merges and pushes back to 
> github, not the ASF repo.
> I think we need to do this before we ask the ASF infra team to migrate our 
> repo over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-05-31 Thread Ash Berlin-Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496391#comment-16496391
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2238:


Requested migration https://issues.apache.org/jira/browse/INFRA-16602

> Update dev/airflow-pr to work with gitub for merge targets
> --
>
> Key: AIRFLOW-2238
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Ash Berlin-Taylor
>Priority: Major
>
> We are planning on migrating the to the Apache "GitBox" project which lets 
> committers work directly on github. This will mean we might not _need_ to use 
> the pr tool, but we should update it so that it merges and pushes back to 
> github, not the ASF repo.
> I think we need to do this before we ask the ASF infra team to migrate our 
> repo over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-1730) The value of XCom that queried from DB is not unpickled.

2018-05-25 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-1730.

   Resolution: Fixed
Fix Version/s: (was: 1.10)
   1.10.0

Issue resolved by pull request #2701
[https://github.com/apache/incubator-airflow/pull/2701]

> The value of XCom that queried from DB is not unpickled.
> 
>
> Key: AIRFLOW-1730
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1730
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: models, xcom
>Affects Versions: 1.9.0
>Reporter: Shintaro Murakami
>Priority: Major
> Fix For: 1.10.0
>
> Attachments: xcoms_by_example_xcom.png
>
>
> If enable_xcom_pickling is True, the value of XCom that queried from DB is 
> not unpickled.
> The list of XComs not rendered correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-05-24 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489050#comment-16489050
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2238:


Oh wait - I just noticed the code in question is guarded by {{if merge_commits 
and False}} so would never run :)

> Update dev/airflow-pr to work with gitub for merge targets
> --
>
> Key: AIRFLOW-2238
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Ash Berlin-Taylor
>Priority: Major
>
> We are planning on migrating the to the Apache "GitBox" project which lets 
> committers work directly on github. This will mean we might not _need_ to use 
> the pr tool, but we should update it so that it merges and pushes back to 
> github, not the ASF repo.
> I think we need to do this before we ask the ASF infra team to migrate our 
> repo over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-05-24 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489030#comment-16489030
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2238:


Done a little bit of digging into this, and the previous script has "duplicate 
merged prevention" that relies on actions happening as {{asfgit}} that we won't 
be able to duplicate. I don't think that is a huge problem though.

> Update dev/airflow-pr to work with gitub for merge targets
> --
>
> Key: AIRFLOW-2238
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Ash Berlin-Taylor
>Priority: Major
>
> We are planning on migrating the to the Apache "GitBox" project which lets 
> committers work directly on github. This will mean we might not _need_ to use 
> the pr tool, but we should update it so that it merges and pushes back to 
> github, not the ASF repo.
> I think we need to do this before we ask the ASF infra team to migrate our 
> repo over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-1730) The value of XCom that queried from DB is not unpickled.

2018-05-24 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1730:
---
Fix Version/s: 1.10

> The value of XCom that queried from DB is not unpickled.
> 
>
> Key: AIRFLOW-1730
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1730
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: models, xcom
>Affects Versions: 1.9.0
>Reporter: Shintaro Murakami
>Priority: Major
> Fix For: 1.10
>
> Attachments: xcoms_by_example_xcom.png
>
>
> If enable_xcom_pickling is True, the value of XCom that queried from DB is 
> not unpickled.
> The list of XComs not rendered correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2506) Dag Status evaluated incorrectly when last task is dummy operator and downstream of multiple tasks

2018-05-22 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483917#comment-16483917
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2506:


Can you provide a sample dag that illustrates this?

> Dag Status evaluated incorrectly when last task is dummy operator and 
> downstream of multiple tasks
> --
>
> Key: AIRFLOW-2506
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2506
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: models
>Affects Versions: Airflow 1.9.0
>Reporter: Debika Mukherjee
>Assignee: Debika Mukherjee
>Priority: Minor
> Fix For: Airflow 2.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-03-21 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408066#comment-16408066
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2238:


Result of vote calling for this 
https://lists.apache.org/thread.html/f55014f097849be4e5d86269e4b7b2dfd569348a5c71d1589d6fe706@%3Cdev.airflow.apache.org%3E

> Update dev/airflow-pr to work with gitub for merge targets
> --
>
> Key: AIRFLOW-2238
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Ash Berlin-Taylor
>Priority: Major
>
> We are planning on migrating the to the Apache "GitBox" project which lets 
> committers work directly on github. This will mean we might not _need_ to use 
> the pr tool, but we should update it so that it merges and pushes back to 
> github, not the ASF repo.
> I think we need to do this before we ask the ASF infra team to migrate our 
> repo over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2238) Update dev/airflow-pr to work with gitub for merge targets

2018-03-21 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-2238:
--

 Summary: Update dev/airflow-pr to work with gitub for merge targets
 Key: AIRFLOW-2238
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2238
 Project: Apache Airflow
  Issue Type: Improvement
  Components: PR tool
Reporter: Ash Berlin-Taylor


We are planning on migrating the to the Apache "GitBox" project which lets 
committers work directly on github. This will mean we might not _need_ to use 
the pr tool, but we should update it so that it merges and pushes back to 
github, not the ASF repo.

I think we need to do this before we ask the ASF infra team to migrate our repo 
over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-1966) Password authentication setup broken by SqlAlchemy version 1.2

2018-03-10 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-1966.

   Resolution: Fixed
Fix Version/s: 1.9.1
   1.10.0

> Password authentication setup broken by SqlAlchemy version 1.2 
> ---
>
> Key: AIRFLOW-1966
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1966
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: authentication
>Affects Versions: 1.8.1, 1.8.0, 1.9.0, 1.9.1
>Reporter: Daniel Thomas
>Priority: Major
> Fix For: 1.10.0, 1.9.1
>
>
> An update of SqlAlchemy to version 1.2 broke the authentication setup as 
> described in the [docs | https://airflow.apache.org/security.html].
> Setting the password failed with an exception "AttributeError: can't set 
> attribute".
> This is was due to Airflow not settings a specific version requirement on 
> SqlAlchemy. This is fixed in master. But this is not sufficient, as this 
> breaks every new deployment of older version for user using Airflow in Docker 
> for example. Took me hours to debug this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-1466) Hostname does not match domain name on ec2

2018-03-09 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-1466.
--
Resolution: Fixed

> Hostname does not match domain name on ec2
> --
>
> Key: AIRFLOW-1466
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1466
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: core
>Affects Versions: Airflow 1.8
> Environment: AWS EMR 5.6, Amazon Linux AMI release 2017.03
>Reporter: Lulu Cheng
>Assignee: Lulu Cheng
>Priority: Major
>
> When running jobs on EMR master, it'll throw the following error
> {code}
> AirflowException("Hostname of job runner does not match")
> {code}
> upon investigation, taskinstance.hostname is set to ec2 host name whereas 
> socket.getfqdn is returning full domain name. It's essentially the same, for 
> example ip-1-2-3-4 versus ip-1-2-3-4.ec2.internal.
> Fix is to instead of calling fqdn, call gethostname.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-1588) Incorrect json parsing when importing integer variable values

2018-03-09 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-1588.

   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request #3037
[https://github.com/apache/incubator-airflow/pull/3037]

> Incorrect json parsing when importing integer variable values
> -
>
> Key: AIRFLOW-1588
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1588
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: configuration
>Affects Versions: 1.8.0
>Reporter: Georges Kohnen
>Priority: Major
> Fix For: 1.10.0
>
>
> When exporting the airflow variables to a json file, integer variables get 
> exported without a quote, e.g.:
> "dataproc_default_timeout": 60,
> However, when importing that same json file again, these values without 
> quotes get ignored, and the variables are not set. When adding quotes around 
> the integer values in the json file, parsing happens correctly, e.g.:
> "dataproc_default_timeout": "60",



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1084) `/health` endpoint on each component

2018-03-09 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393079#comment-16393079
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1084:


> How else will the operations team know whether a worker is running fine

Because its up and you have a process supervisor and monitoring? And because it 
is simpler and doesn't have an otherwise-unused http server in a component 
adding complexity? ;)

Yes a healthcheck endpoint would be a useful addition. The only issue is how to 
integrate a webserver into the scheduler and worker processes such that it can 
actually find out about the health of the process

Scheduler spends most of it's time in an inner loop (processing dags or 
sleeping).

The worker just hands off control to celery.

A PR for this would be great - if you can think of 1) definition of "healthy" 
that is more than the fact the process is alive (which a well structured docker 
container gives us for free without needing a health check) and 2) how to 
handle the http component nicely so that it doesn't run the risk of getting 
decoupled from the state of the thing it's trying to monitor.

> `/health` endpoint on each component
> 
>
> Key: AIRFLOW-1084
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1084
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Semet
>Priority: Major
>
> Please provide a {{/health}} endpoint of each of the following component: 
> - webservice (to avoid pinging the {{/}} root endpoint)
> - worker
> - scheduler
> This would ease integration in Mesos/Marathon framework. 
> If you agree, I volunteer to add this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-2199) Invalid logger reference in Jobs

2018-03-08 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-2199.

   Resolution: Fixed
Fix Version/s: Airflow 2.0

PR tool errored after pushing but before closing. 

> Invalid logger reference in Jobs
> 
>
> Key: AIRFLOW-2199
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2199
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: Airflow 2.0
>
>
> There is still an invalid reference to the self.logger, which should be 
> self.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-2198) Heuristic in dag_processing list_py_file_paths sometimes ignores files containing DAG definitions

2018-03-08 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-2198.
--
Resolution: Duplicate

> Heuristic in dag_processing list_py_file_paths sometimes ignores files 
> containing DAG definitions
> -
>
> Key: AIRFLOW-2198
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2198
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Jarosław Bojar
>Priority: Minor
>
> In function list_py_file_paths in dag_processing module there is a heuristic 
> checking if file contains worda 'airflow' and 'DAG'. If file does not contain 
> both words it is ignored from further processing:
> {code:java}
> # Heuristic that guesses whether a Python file contains an
> # Airflow DAG definition.
> might_contain_dag = True
> if safe_mode and not zipfile.is_zipfile(file_path):
> with open(file_path, 'rb') as f:
> content = f.read()
> might_contain_dag = all(
> [s in content for s in (b'DAG', b'airflow')])
> if not might_contain_dag:
> continue
> {code}
> If DAG instantiation is in different file than dag definition (for example 
> dag definition may be in some factory method), file instantiating DAG is 
> ignored by this heuristic, and DAG is not processed.
> For example:
> dag_factory.py:
> {code:java}
> from airflow import DAG
> def create_dag(dag_id, other_params...):
>   ...
>   return DAG(dag_id, ...){code}
> dag_instantiation.py
> {code:java}
> from dag_factory import create_dag
> first_dag = create_dag('first', other_params...)
> second_dag = create_dag('second', other_params...){code}
> In this case file dag_factory.py is processed but it does not contain dag 
> instantiation and file dag_instantiation.py is ignored by heuristic. 
> Consequently dags are not created.
>  
> Function list_py_file_paths has a parameter safe_mode which may be used to 
> turn off this heuristic, but it is never used when this function is called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2110) Enhance Http Hook to use a header in passed in the "extra" argument and add tenacity retry

2018-03-06 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-2110:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/3071
 Fix Version/s: (was: Airflow 1.9.0)
Airflow 2.0

> Enhance Http Hook to use a header in passed in the "extra" argument and add 
> tenacity retry
> --
>
> Key: AIRFLOW-2110
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2110
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Affects Versions: Airflow 1.8
>Reporter: Alberto Calderari
>Assignee: Alberto Calderari
>Priority: Minor
> Fix For: Airflow 2.0
>
>
> Add possibility to add a json header in the "extra" field in the connection:
> {"Authorization": "Bearer Here1sMyT0k3N"}
> Also add tenacity retry so the operator won't fall over in case of a bad 
> handshake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-226) Create separate pip packages for webserver and hooks

2018-03-06 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387534#comment-16387534
 ] 

Ash Berlin-Taylor commented on AIRFLOW-226:
---

Ah yeah, for hooks and (possibly) operators I can see an argument for some of 
the contrib ones being separated out.

Not _quite_ sure how we might go about doing that mind :)

> Create separate pip packages for webserver and hooks
> 
>
> Key: AIRFLOW-226
> URL: https://issues.apache.org/jira/browse/AIRFLOW-226
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Dan Davydov
>Priority: Minor
>
> There are users who want only the airflow hooks, and others who many not need 
> the front-end. The hooks and webserver should be moved into their own 
> packages, with the current airflow package depending on these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-232) Web UI shows inaccurate task counts on main dashboard

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-232.
-
Resolution: Not A Problem

> Web UI shows inaccurate task counts on main dashboard
> -
>
> Key: AIRFLOW-232
> URL: https://issues.apache.org/jira/browse/AIRFLOW-232
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Pstgres, celery, rabbitmq, 170 worker nodes, 1 master.
> select count(*), state from task_instance where dag_id = 'freebayes' group by 
> state;
> upstream_failed   2134
> up_for_retry  520
> success   141421
> running   542
> failed1165



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-247) EMR Hook, Operators, Sensor

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-247.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Fixed in 
https://github.com/apache/incubator-airflow/commit/9f49f12853d83dd051f0f1ed58b5df20bfcfe087

> EMR Hook, Operators, Sensor
> ---
>
> Key: AIRFLOW-247
> URL: https://issues.apache.org/jira/browse/AIRFLOW-247
> Project: Apache Airflow
>  Issue Type: New Feature
>Reporter: Rob Froetscher
>Assignee: Rob Froetscher
>Priority: Minor
> Fix For: 1.8.0
>
>
> Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be 
> nice to have an EMR hook and operators.
> Hook to generally interact with EMR.
> Operators to:
> * setup and start a job flow
> * add steps to an existing jobflow 
> A sensor to:
> * monitor completion and status of EMR jobs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-236) Support passing S3 credentials through environmental variables

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-236.
---
Resolution: Fixed

This is possible, both using the AWS standard {{AWS_ACCESS_KEY_ID}} and via 
specifying connections via env vars with {{AIRFLOW_CONN_S3=s3://}}

> Support passing S3 credentials through environmental variables
> --
>
> Key: AIRFLOW-236
> URL: https://issues.apache.org/jira/browse/AIRFLOW-236
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core
>Reporter: Jakob Homan
>Priority: Major
>
> Right now we expect S3 configs to be passed through one of a variety of 
> config files, or through extra parameters in the connection screen.  It'd be 
> nice to be able to pass these through env variables and note as such through 
> the extra parameters.  This would lessen the need to include credentials in 
> the webapp itself.
> Alternatively, for logging (rather than as a connector), it might just be 
> better for Airflow to use the profie defined as AWS_DEFAULT and avoid needed 
> an explicit configuration at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-230) [HiveServer2Hook] adding multi statements support

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-230.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Fixed as 
https://github.com/apache/incubator-airflow/commit/a599167c433246d96bea711d8bfd5710b2c9d3ff

> [HiveServer2Hook] adding multi statements support
> -
>
> Key: AIRFLOW-230
> URL: https://issues.apache.org/jira/browse/AIRFLOW-230
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Maxime Beauchemin
>Priority: Major
> Fix For: 1.8.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-229) new DAG runs 5 times when manually started from website

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-229.
-
Resolution: Invalid

Not an issue anymore. Feel free to re-open if anyone is still seeing this 
behaviour!

> new DAG runs 5 times when manually started from website
> ---
>
> Key: AIRFLOW-229
> URL: https://issues.apache.org/jira/browse/AIRFLOW-229
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 1.6.2
> Environment: celery, rabbitmq, mysql
>Reporter: audubon
>Priority: Minor
>
> version 1.6.2
> using celery, rabbitmq, mysql
> example:
> from airflow import DAG
> from airflow.operators import BashOperator
> from datetime import datetime, timedelta
> import json
> import sys
> one_day_ahead = datetime.combine(datetime.today() + timedelta(1), 
> datetime.min.time())
> one_day_ahead = one_day_ahead.replace(hour=3, minute=31)
> default_args = {
> 'owner': 'airflow',
> 'depends_on_past': False,
> 'start_date': one_day_ahead,
> 'email': ['m...@email.com'],
> 'email_on_failure': True,
> 'email_on_retry': False,
> 'retries': 1,
> 'retry_delay': timedelta(minutes=5),
> }
> dag = DAG('alpha', default_args=default_args , schedule_interval='15 6 * * *' 
> )
> task = BashOperator(
> task_id='alphaV2',
> bash_command='sleep 10',
> dag=dag)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-226) Create separate pip packages for webserver and hooks

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-226.
-
Resolution: Won't Fix

Given most of Airflow is optional dependencies installing Airflow itself is not 
that heavy - and the extra development overhead on an open-source project means 
this is not likely to happen -- especially given the cost to the end user is a 
few extra packages installed.

(Sorry to resurrect a really old ticket only to close it Won't Fix. If you feel 
strongly about this we can reopen and discuss this)

> Create separate pip packages for webserver and hooks
> 
>
> Key: AIRFLOW-226
> URL: https://issues.apache.org/jira/browse/AIRFLOW-226
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Dan Davydov
>Priority: Minor
>
> There are users who want only the airflow hooks, and others who many not need 
> the front-end. The hooks and webserver should be moved into their own 
> packages, with the current airflow package depending on these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-215) Airflow worker (CeleryExecutor) needs to be restarted to pick up tasks

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-215.
---
Resolution: Fixed

Doesn't apply on 1.9.0 or 1.8.2. Was fixed at some point

> Airflow worker (CeleryExecutor) needs to be restarted to pick up tasks
> --
>
> Key: AIRFLOW-215
> URL: https://issues.apache.org/jira/browse/AIRFLOW-215
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: celery, subdag
>Affects Versions: Airflow 1.7.1.2
>Reporter: Cyril Scetbon
>Priority: Major
>
> We have a main dag that dynamically creates subdags containing tasks using 
> BashOperator. Using CeleryExecutor we see Celery tasks been created with 
> *STARTED* status but they are not picked up by our worker. However, if we 
> restart our worker, then tasks are picked up. 
> Here you can find code if you want to try to reproduce it 
> https://www.dropbox.com/s/8u7xf8jt55v8zio/dags.zip.
> We also tested using LocalExecutor and everything worked fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-191) Database connection leak on Postgresql backend

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-191.
---
   Resolution: Fixed
Fix Version/s: Airflow 1.8

Merged in as 
https://github.com/apache/incubator-airflow/commit/4905a5563d47b45e38b91661ee5aa7f3765a129b

> Database connection leak on Postgresql backend
> --
>
> Key: AIRFLOW-191
> URL: https://issues.apache.org/jira/browse/AIRFLOW-191
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor
>Affects Versions: Airflow 1.7.1.2
>Reporter: Sergei Iakhnin
>Priority: Major
> Fix For: Airflow 1.8
>
> Attachments: Sid_anands_airflow_idle_in_transaction.png
>
>
> I raised this issue on github several months ago and there was even a PR but 
> it never maid it into mainline. Basically, workers tend to hang onto DB 
> connections in Postgres for recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is 
> configured to allow 1000 simultaneous connections. I should effectively be 
> able to run 920 tasks at the same time, but am actually limited to only about 
> 450-480 because of idle transactions from workers hanging on to DB 
> connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, 
> client_hostname
> These are the results:
> count state client_hostname
> 1 active  (null)
> 1 idlelocalhost
> 451   idle in transaction (null)
> 446   idle(null)
> 1 active  localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, 
> job.job_type AS job_job_type, job.start_date AS job_start_date, job.end_date 
> AS job_end_date, job.latest_heartbeat AS job_latest_heartbeat, 
> job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
> job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-187) Make PR tool more user-friendly

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-187.
---
Resolution: Fixed

Fixed by the merged https://github.com/apache/incubator-airflow/pull/1565 

> Make PR tool more user-friendly
> ---
>
> Key: AIRFLOW-187
> URL: https://issues.apache.org/jira/browse/AIRFLOW-187
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: PR tool
>Reporter: Jeremiah Lowin
>Priority: Minor
>
> General JIRA improvement that can be referenced for any UX improvements to 
> the PR tool, including better or more prompts, documentation, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-184) Add clear/mark success to CLI

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386953#comment-16386953
 ] 

Ash Berlin-Taylor commented on AIRFLOW-184:
---

Is this issue still relevant?

> Add clear/mark success to CLI
> -
>
> Key: AIRFLOW-184
> URL: https://issues.apache.org/jira/browse/AIRFLOW-184
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: cli
>Reporter: Chris Riccomini
>Assignee: Joy Gao
>Priority: Major
>
> AIRFLOW-177 pointed out that the current CLI does not allow us to clear or 
> mark success a task (including upstream, downstream, past, future, and 
> recursive) the way that the UI widget does. Given a goal of keeping parity 
> between the UI and CLI, it seems like we should support this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-182) CLI command `airflow backfill` fails while CLI `airflow run` succeeds

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-182.
-
Resolution: Cannot Reproduce

Airflow 1.7 is now quite old. If this is still happening on the new latest 
version please open another issue and we'd be happy to help solve it

> CLI command `airflow backfill` fails while CLI `airflow run` succeeds
> -
>
> Key: AIRFLOW-182
> URL: https://issues.apache.org/jira/browse/AIRFLOW-182
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: celery
>Affects Versions: Airflow 1.7.0
> Environment: Heroku Cedar 14, Heroku Redis as Celery Broker
>Reporter: Hariharan Mohanraj
>Priority: Minor
>
> When I run the backfill command, I get an error that claims there is no dag 
> in my dag folder with the name "unusual_prefix_dag1", although my dag is 
> actually named dag1. However when I run the run command, the task is 
> scheduled and it works flawlessly.
> {code}
> $ airflow backfill -t task1 -s 2016-05-01 -e 2016-05-07 dag1
> 2016-05-26T23:22:28.816908+00:00 app[worker.1]: [2016-05-26 23:22:28,816] 
> {__init__.py:36} INFO - Using executor CeleryExecutor
> 2016-05-26T23:22:29.214006+00:00 app[worker.1]: Traceback (most recent call 
> last):
> 2016-05-26T23:22:29.214083+00:00 app[worker.1]:   File 
> "/app/.heroku/python/bin/airflow", line 15, in 
> 2016-05-26T23:22:29.214121+00:00 app[worker.1]: args.func(args)
> 2016-05-26T23:22:29.214151+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/airflow/bin/cli.py", line 
> 174, in run
> 2016-05-26T23:22:29.214207+00:00 app[worker.1]: 
> DagPickle).filter(DagPickle.id == args.pickle).first()
> 2016-05-26T23:22:29.214230+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/query.py", 
> line 2634, in first
> 2016-05-26T23:22:29.214616+00:00 app[worker.1]: ret = list(self[0:1])
> 2016-05-26T23:22:29.214626+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/query.py", 
> line 2457, in __getitem__
> 2016-05-26T23:22:29.214984+00:00 app[worker.1]: return list(res)
> 2016-05-26T23:22:29.214992+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", 
> line 86, in instances
> 2016-05-26T23:22:29.215053+00:00 app[worker.1]: util.raise_from_cause(err)
> 2016-05-26T23:22:29.215074+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/util/compat.py", 
> line 200, in raise_from_cause
> 2016-05-26T23:22:29.215121+00:00 app[worker.1]: reraise(type(exception), 
> exception, tb=exc_tb, cause=cause)
> 2016-05-26T23:22:29.215142+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", 
> line 71, in instances
> 2016-05-26T23:22:29.215175+00:00 app[worker.1]: rows = [proc(row) for row 
> in fetch]
> 2016-05-26T23:22:29.215200+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", 
> line 428, in _instance
> 2016-05-26T23:22:29.215274+00:00 app[worker.1]: loaded_instance, 
> populate_existing, populators)
> 2016-05-26T23:22:29.215282+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/orm/loading.py", 
> line 486, in _populate_full
> 2016-05-26T23:22:29.215369+00:00 app[worker.1]: dict_[key] = getter(row)
> 2016-05-26T23:22:29.215406+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py", 
> line 1253, in process
> 2016-05-26T23:22:29.215574+00:00 app[worker.1]: return loads(value)
> 2016-05-26T23:22:29.215595+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 260, in 
> loads
> 2016-05-26T23:22:29.215657+00:00 app[worker.1]: return load(file)
> 2016-05-26T23:22:29.215678+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 250, in 
> load
> 2016-05-26T23:22:29.215738+00:00 app[worker.1]: obj = pik.load()
> 2016-05-26T23:22:29.215758+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/pickle.py", line 858, in load
> 2016-05-26T23:22:29.215895+00:00 app[worker.1]: dispatch[key](self)
> 2016-05-26T23:22:29.215902+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/pickle.py", line 1090, in load_global
> 2016-05-26T23:22:29.216069+00:00 app[worker.1]: klass = 
> self.find_class(module, name)
> 2016-05-26T23:22:29.216077+00:00 app[worker.1]:   File 
> "/app/.heroku/python/lib/python2.7/site-packages/dill/dill.py", line 406, in 
> find_class
> 2016-05-26T23:22:29.216181+00:00 app[worker.1]: return 
> StockU

[jira] [Closed] (AIRFLOW-181) Travis builds fail due to corrupt cache

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-181.
-
Resolution: Fixed

Closed by 
https://github.com/apache/incubator-airflow/commit/afcd4fcf01696ee26911640cdeb481defd93c3aa

> Travis builds fail due to corrupt cache
> ---
>
> Key: AIRFLOW-181
> URL: https://issues.apache.org/jira/browse/AIRFLOW-181
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Bolke de Bruin
>Assignee: Bolke de Bruin
>Priority: Major
>
> Corrupt cache is preventing from unpacking hadoop. It needs to redownload the 
> distribution without checking the cache



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-160) Parse DAG files through child processes

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-160.
-
   Resolution: Fixed
Fix Version/s: Airflow 1.8

Fixed by 
https://github.com/apache/incubator-airflow/commit/fdb7e949140b735b8554ae5b22ad752e86f6ebaf

> Parse DAG files through child processes
> ---
>
> Key: AIRFLOW-160
> URL: https://issues.apache.org/jira/browse/AIRFLOW-160
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Paul Yang
>Assignee: Paul Yang
>Priority: Major
> Fix For: Airflow 1.8
>
>
> Currently, the Airflow scheduler parses all user DAG files in the same 
> process as the scheduler itself. We've seen issues in production where bad 
> DAG files cause scheduler to fail. A simple example is if the user script 
> calls `sys.exit(1)`, the scheduler will exit as well. We've also seen an 
> unusual case where modules loaded by the user DAG affect operation of the 
> scheduler. For better uptime, the scheduler should be resistant to these 
> problematic user DAGs.
> The proposed solution is to parse and schedule user DAGs through child 
> processes. This way, the main scheduler process is more isolated from bad 
> DAGs. There's a side benefit as well - since parsing is distributed among 
> multiple processes, it's possible to parse the DAG files more frequently, 
> reducing the latency between when a DAG is modified and when the changes are 
> picked up.
> Another issue right now is that all DAGs must be scheduled before any tasks 
> are sent to the executor. This means that the frequency of task scheduling is 
> limited by the slowest DAG to schedule. The changes needed for scheduling 
> DAGs through child processes will also make it easy to decouple this process 
> and allow tasks to be scheduled and sent to the executor in a more 
> independent fashion. This way, overall scheduling won't be held back by a 
> slow DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-147) HiveServer2Hook.to_csv() writing one row at a time and causing excessive logging

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-147.
-
Resolution: Fixed

Fixed by 
https://github.com/apache/incubator-airflow/commit/a5c00b3f1581580818b585b21abd3df3fa68af64

> HiveServer2Hook.to_csv() writing one row at a time and causing excessive 
> logging
> 
>
> Key: AIRFLOW-147
> URL: https://issues.apache.org/jira/browse/AIRFLOW-147
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: hooks
>Affects Versions: Airflow 1.7.0
>Reporter: Michael Musson
>Priority: Minor
>
> The default behavior of fetchmany() in impala dbapi (which airflow switched 
> to recently) is to return a single row at a time. This causes HiveServer2's 
> to_csv() method to output one row of logging for each row of data in the 
> results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-129) Allow CELERYD_PREFETCH_MULTIPLIER to be configurable

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-129.
---
   Resolution: Fixed
Fix Version/s: Airflow 1.9.0

Not the nicest interface for configuring, but it is now possible to do without 
patching Airflow.

> Allow CELERYD_PREFETCH_MULTIPLIER to be configurable
> 
>
> Key: AIRFLOW-129
> URL: https://issues.apache.org/jira/browse/AIRFLOW-129
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: celery
>Affects Versions: Airflow 1.7.0
>Reporter: Nam Ngo
>Priority: Major
> Fix For: Airflow 1.9.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Airflow needs to allow everyone to customise their prefetch limit. Some might 
> have short running task and don't want the overhead of celery latency.
> More on that here:
> http://docs.celeryproject.org/en/latest/userguide/optimizing.html#optimizing-prefetch-limit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-135) Clean up git branches (remove old + implement versions)

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-135.
-
Resolution: Fixed

There are now only 6 branches. Nice and clean :)

> Clean up git branches (remove old + implement versions)
> ---
>
> Key: AIRFLOW-135
> URL: https://issues.apache.org/jira/browse/AIRFLOW-135
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: project-management
>Reporter: Jeremiah Lowin
>Priority: Minor
>  Labels: git
> Fix For: Airflow 1.8
>
>
> We have a large number of branches in the git repo, most of which are old 
> features -- I would bet hardly any of them are active. I think they should be 
> deleted if possible. In addition, we should begin using branches (as opposed 
> to tags) to allow easy switching between Airflow versions. Spark 
> (https://github.com/apache/spark) uses the format {{branch-X.X}}; others like 
> Kafka (https://github.com/apache/kafka) simply use a version number. But this 
> is an important way to browse the history and, most importantly, can't be 
> overwritten like a tag (since tags point at commits and commits can be 
> rebased away). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-110) Point people to the approriate process to submit PRs in the repository's CONTRIBUTING.md

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-110.
---
Resolution: Fixed

With the addition of the {{.github}} folder this is now quite obvious on GitHub.

> Point people to the approriate process to submit PRs in the repository's 
> CONTRIBUTING.md
> 
>
> Key: AIRFLOW-110
> URL: https://issues.apache.org/jira/browse/AIRFLOW-110
> Project: Apache Airflow
>  Issue Type: Task
>  Components: docs
>Reporter: Arthur Wiedmer
>Priority: Trivial
>  Labels: documentation, newbie
>
> The current process to contribute code could be made more accessible. I am 
> assuming that the entry point to the project is Github and the repository. We 
> could modify the contributing.md as well as the read me to point to the 
> proper way to do this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-2123) Install CI Dependencies from setup.py

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-2123.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request #3054
[https://github.com/apache/incubator-airflow/pull/3054]

> Install CI Dependencies from setup.py
> -
>
> Key: AIRFLOW-2123
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2123
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.0.0
>
>
> Right now we have two places where we keep our dependencies. This is setup.py 
> for installation and requirements.txt for the CI. These files run terribly 
> out of sync and therefore I think it is a good idea to install the CI's 
> dependencies using this setup.py so we have everything in one single place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2181) Convert DOS formatted files to UNIX

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386816#comment-16386816
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2181:


No reason at all - PR welcomed!

> Convert DOS formatted files to UNIX
> ---
>
> Key: AIRFLOW-2181
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2181
> Project: Apache Airflow
>  Issue Type: Task
>Reporter: Dan Fowler
>Assignee: Dan Fowler
>Priority: Trivial
>
> While looking into an issue related to the password_auth backend I noticed 
> the following files are in DOS format:
>  
> tests/www/api/experimental/test_password_endpoints.py
>  airflow/contrib/auth/backends/password_auth.py
>  
> I can't think of a reason why these should be DOS formatted, but if there is 
> let me know and I can close this out. Otherwise, I'll submit a PR for this 
> fix.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-97) "airflow" "DAG" strings in file necessary to import dag

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-97?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-97:
-
Affects Version/s: Airflow 1.9.0

> "airflow" "DAG" strings in file necessary to import dag
> ---
>
> Key: AIRFLOW-97
> URL: https://issues.apache.org/jira/browse/AIRFLOW-97
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 1.7.0, Airflow 1.9.0
>Reporter: Etiene Dalcol
>Priority: Minor
>
> Hello airflow team! Thanks for the awesome tool!
> We made a small module to automate our DAG building process and we are using 
> this module on our DAG definition. Our airflow version is 1.7.0.
> However, airflow will not import this file because it doesn't have the words 
> DAG and airflow on it. (The imports etc are done inside our little module). 
> Apparently there's a safe_mode that skips files without these strings.
> (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L197)
> This safe_mode is default to True but is not passed to the process_file 
> function, so it is always True and there's no apparent way to disable it.
> (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L177)
> (https://github.com/apache/incubator-airflow/blob/1.7.0/airflow/models.py#L313)
> Putting this comment on the top of the file makes it work for the moment and 
> brought me a good laugh today 👯 
> #DAG airflow —> DO NOT REMOVE. the world will explode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-42) Adding logging.debug DagBag loading stats

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-42.

   Resolution: Fixed
Fix Version/s: 1.8.0

Merged in May 2016 via 
https://github.com/apache/incubator-airflow/commit/3c3f5a67ff80f3e8942aef441f481c62baf97184
 

> Adding logging.debug DagBag loading stats
> -
>
> Key: AIRFLOW-42
> URL: https://issues.apache.org/jira/browse/AIRFLOW-42
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Maxime Beauchemin
>Assignee: Maxime Beauchemin
>Priority: Major
> Fix For: 1.8.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-19) How can I have an Operator B iterate over a list returned from upstream by Operator A?

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-19.

Resolution: Not A Bug

As discussed, the mailing list 
(http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/) is the best 
place for questions like this.

> How can I have an Operator B iterate over a list returned from upstream by 
> Operator A?
> --
>
> Key: AIRFLOW-19
> URL: https://issues.apache.org/jira/browse/AIRFLOW-19
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: operators
>Reporter: Praveenkumar Venkatesan
>Priority: Minor
>  Labels: support
>
> Here is what I am trying to do exactly: 
> https://gist.github.com/praveev/7b93b50746f8e965f7139ecba028490a
> the python operator log just returns the following
> [2016-04-28 11:56:22,296] {models.py:1041} INFO - Executing 
>  on 2016-04-28 11:56:12
> [2016-04-28 11:56:22,350] {python_operator.py:66} INFO - Done. Returned value 
> was: None
> it didn't even print my kwargs and to_process data
> To simplify this. Lets say t1 returns 3 elements. I want to iterate over the 
> list and run t2 -> t3 for each element.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2179) Make parametrable the IP on which the worker log server binds to

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386692#comment-16386692
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2179:


Sounds like a sensible change.

> Make parametrable the IP on which the worker log server binds to
> 
>
> Key: AIRFLOW-2179
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2179
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: celery, webserver
>Reporter: Albin Gilles
>Priority: Minor
>
> Hello,
> I'd be glad if the tiny web server subprocess to serve the workers local log 
> files could be set to bind to localhost only as could be done for Gunicorn or 
> Flower. See 
> [cli.py#L865|https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L865]
> If you don't see any issue with that possibility, I'll be happy to propose a 
> PR on github.
> Regards,
>  Albin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-2163.

Resolution: Fixed

> Add HBC Digital to list of companies using Airflow
> --
>
> Key: AIRFLOW-2163
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2163
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Terry McCartan
>Assignee: Terry McCartan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (AIRFLOW-2158) Airflow should not store logs as raw ISO timestamps

2018-03-05 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor closed AIRFLOW-2158.
--
Resolution: Duplicate

> Airflow should not store logs as raw ISO timestamps
> ---
>
> Key: AIRFLOW-2158
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2158
> Project: Apache Airflow
>  Issue Type: Improvement
> Environment: 1.9.0
>Reporter: Christian D
>Priority: Minor
>  Labels: easyfix, windows
> Fix For: Airflow 2.0
>
>
> Problem:
> When Airflow writes logs to disk, it uses a ISO-8601 timestamp as the 
> filename. In a Linux filesystem this works completely fine (because all 
> characters in a ISO-8601 timestamp is allowed). However, it doesn't work on 
> Windows based systems  (including Azure File Storage) because {{:}} is a 
> disallowed character.
> Solution:
> Ideally, Airflow should store logs such that they're somewhat compatible 
> across file systems. An easy way of fixing this would therefore be to always 
> replace {{:}} with underscores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2158) Airflow should not store logs as raw ISO timestamps

2018-03-01 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381759#comment-16381759
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2158:


Duplicate of https://issues.apache.org/jira/browse/AIRFLOW-1564 which has 
already been fixed on master.

https://github.com/apache/incubator-airflow/commit/4c674ccffda1fbc38b8cc044b0e2c004422a2035
 was the commit that fixed it.

> Airflow should not store logs as raw ISO timestamps
> ---
>
> Key: AIRFLOW-2158
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2158
> Project: Apache Airflow
>  Issue Type: Improvement
> Environment: 1.9.0
>Reporter: Christian D
>Priority: Minor
>  Labels: easyfix, windows
> Fix For: Airflow 2.0
>
>
> Problem:
> When Airflow writes logs to disk, it uses a ISO-8601 timestamp as the 
> filename. In a Linux filesystem this works completely fine (because all 
> characters in a ISO-8601 timestamp is allowed). However, it doesn't work on 
> Windows based systems  (including Azure File Storage) because {{:}} is a 
> disallowed character.
> Solution:
> Ideally, Airflow should store logs such that they're somewhat compatible 
> across file systems. An easy way of fixing this would therefore be to always 
> replace {{:}} with underscores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-87) Setup a development environment

2018-02-28 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380424#comment-16380424
 ] 

Ash Berlin-Taylor commented on AIRFLOW-87:
--

https://github.com/puckel/docker-airflow/ has some docker-compose files that 
might be a useful base to start from

> Setup a development environment
> ---
>
> Key: AIRFLOW-87
> URL: https://issues.apache.org/jira/browse/AIRFLOW-87
> Project: Apache Airflow
>  Issue Type: Task
>  Components: docs
>Reporter: Amikam Snir
>Priority: Minor
>  Labels: docuentation
>
> 1. Add developer guide, for example: how to build, create & deploy an 
> artifact.
> 2. Add a vagrant file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2151) Allow getting AWS Session from AwsHook

2018-02-27 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378685#comment-16378685
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2151:


Probably leaning towards {{get_session}}, but I haven't look at the code 
recently, so if it makes more sense to just return credentials that could be 
okay to.

> Allow getting AWS Session from AwsHook
> --
>
> Key: AIRFLOW-2151
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2151
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws, contrib
>Reporter: Pieter Mulder
>Assignee: Pieter Mulder
>Priority: Major
>
> I would like to be able to get the `session` object that `AwsHook` creates.
> In my case I want to use it's credentials (I now use `_get_credentials()` in 
> my code, but don't like using the private function) to do a `COPY` with 
> Redshift.
> I think the AWS Session could also be useful for people that want to use a 
> client or resource with other arguments then the default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2151) Allow getting AWS Session from AwsHook

2018-02-27 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378660#comment-16378660
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2151:


Ah right, I see. Yup making that a non-private method sounds valuable.

> Allow getting AWS Session from AwsHook
> --
>
> Key: AIRFLOW-2151
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2151
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws, contrib
>Reporter: Pieter Mulder
>Assignee: Pieter Mulder
>Priority: Major
>
> I would like to be able to get the `session` object that `AwsHook` creates.
> In my case I want to use it's credentials (I now use `_get_credentials()` in 
> my code, but don't like using the private function) to do a `COPY` with 
> Redshift.
> I think the AWS Session could also be useful for people that want to use a 
> client or resource with other arguments then the default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2151) Allow getting AWS Session from AwsHook

2018-02-27 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378635#comment-16378635
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2151:


Do you have an example of how you'd like to use this?

We use the AWS hook to call AWS API calls like this:

{code}
hook = AwsHook(aws_conn_id)
emr_client = hook.get_client_type('emr')
{code}

Could you do something similar?

> Allow getting AWS Session from AwsHook
> --
>
> Key: AIRFLOW-2151
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2151
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws, contrib
>Reporter: Pieter Mulder
>Assignee: Pieter Mulder
>Priority: Major
>
> I would like to be able to get the `session` object that `AwsHook` creates.
> In my case I want to use it's credentials (I now use `_get_credentials()` in 
> my code, but don't like using the private function) to do a `COPY` with 
> Redshift.
> I think the AWS Session could also be useful for people that want to use a 
> client or resource with other arguments then the default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2122) SSHOperator throws an error

2018-02-20 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369915#comment-16369915
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2122:


I think this is still a bug -- the hook should accept a boolean parameter, not 
just a string

> SSHOperator throws an error
> ---
>
> Key: AIRFLOW-2122
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2122
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: sam sen
>Priority: Major
>
> Here's my code: 
> {code:java}
> dag = DAG('transfer_ftp_s3', 
> default_args=default_args,schedule_interval=None) }}
> task = SSHOperator(ssh_conn_id='ssh_node', 
>                    task_id="check_ftp_for_new_files", 
>                    command="echo 'hello world'", 
>                    do_xcom_push=True, dag=dag,)
> {code}
>  
> Here's the error
> {code:java}
> [2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: 
> Traceback (most recent call last):
> [2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/bin/airflow", line 27, in 
> [2018-02-19 06:48:02,692] {{base_task_runner.py:98}} INFO - Subtask: 
> args.func(args)
> [2018-02-19 06:48:02,693] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
> [2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: 
> pool=args.pool,
> [2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: 
> result = func(*args, **kwargs)
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/models.py", line 1496, in 
> _run_raw_task
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: 
> result = task_copy.execute(context=context)
> [2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/contrib/operators/ssh_operator.py", 
> line 146, in execute
> [2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: 
> raise AirflowException("SSH operator error: {0}".format(str(e)))
> [2018-02-19 06:48:02,698] {{base_task_runner.py:98}} INFO - Subtask: 
> airflow.exceptions.AirflowException: SSH operator error: 'bool' object has no 
> attribute 'lower'
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2122) SSHOperator throws an error

2018-02-19 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369049#comment-16369049
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2122:


How have you defined the {{ssh_node}} connection? I think you (quite rightly) 
specified one of {{compress}} or {{no_host_key_check}} as a boolean, but it 
needs to be a string.

The work around for you is to specify it as a string.

The fix is to make the SSHHook cope with string or boolean for these 
parameters..

> SSHOperator throws an error
> ---
>
> Key: AIRFLOW-2122
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2122
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: sam sen
>Priority: Major
>
> Here's my code: 
> {code:java}
> dag = DAG('transfer_ftp_s3', 
> default_args=default_args,schedule_interval=None) }}
> task = SSHOperator(ssh_conn_id='ssh_node', 
>                    task_id="check_ftp_for_new_files", 
>                    command="echo 'hello world'", 
>                    do_xcom_push=True, dag=dag,)
> {code}
>  
> Here's the error
> {code:java}
> [2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask: 
> Traceback (most recent call last):
> [2018-02-19 06:48:02,691] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/bin/airflow", line 27, in 
> [2018-02-19 06:48:02,692] {{base_task_runner.py:98}} INFO - Subtask: 
> args.func(args)
> [2018-02-19 06:48:02,693] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
> [2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask: 
> pool=args.pool,
> [2018-02-19 06:48:02,695] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: 
> result = func(*args, **kwargs)
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/models.py", line 1496, in 
> _run_raw_task
> [2018-02-19 06:48:02,696] {{base_task_runner.py:98}} INFO - Subtask: 
> result = task_copy.execute(context=context)
> [2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask:   File 
> "/usr/lib/python2.7/site-packages/airflow/contrib/operators/ssh_operator.py", 
> line 146, in execute
> [2018-02-19 06:48:02,697] {{base_task_runner.py:98}} INFO - Subtask: 
> raise AirflowException("SSH operator error: {0}".format(str(e)))
> [2018-02-19 06:48:02,698] {{base_task_runner.py:98}} INFO - Subtask: 
> airflow.exceptions.AirflowException: SSH operator error: 'bool' object has no 
> attribute 'lower'
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2101) pip install apache-airflow does not install minimum packages for tutorial

2018-02-12 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361198#comment-16361198
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2101:


Log uploaded as [^DMNx1dG4.txt]  in case the pastebin goes away.

Fernet is not a _hard_ requirement, so it shouldn't be required. I think from 
the log that it might have continued anyway as there is extra output 
afterwards. (I'm not certain though)

It's not great behaviour to see stack traces though so we should fix that.

> pip install apache-airflow does not install minimum packages for tutorial
> -
>
> Key: AIRFLOW-2101
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2101
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Paymahn Moghadasian
>Priority: Minor
> Attachments: DMNx1dG4.txt
>
>
> What's expected:
> running `pip install apache-airflow` should install the minimum requirements 
> for running `airflow initdb`
> What happens:
> `airflow initdb` errors out because Fernet cannot be imported.
> Solution:
> run `rm -rf $AIRFLOW_HOME && pip install "apache-airflow[crypto]" && airflow 
> initdb`
> Logs of my output can be seen at https://pastebin.com/DMNx1dG4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2101) pip install apache-airflow does not install minimum packages for tutorial

2018-02-12 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-2101:
---
Attachment: DMNx1dG4.txt

> pip install apache-airflow does not install minimum packages for tutorial
> -
>
> Key: AIRFLOW-2101
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2101
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Paymahn Moghadasian
>Priority: Minor
> Attachments: DMNx1dG4.txt
>
>
> What's expected:
> running `pip install apache-airflow` should install the minimum requirements 
> for running `airflow initdb`
> What happens:
> `airflow initdb` errors out because Fernet cannot be imported.
> Solution:
> run `rm -rf $AIRFLOW_HOME && pip install "apache-airflow[crypto]" && airflow 
> initdb`
> Logs of my output can be seen at https://pastebin.com/DMNx1dG4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1667) Remote log handlers don't upload logs on task finish

2018-02-09 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358757#comment-16358757
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1667:


The process that writes to the log files is a sub-process of the celery worker 
itself – that just invokes {{airflow run --local}} - and that means the flush 
should happen as soon the task instance finishes running.

I do not see this behaivour on Py3/1.9.0 - our tasks appear in S3 when the task 
instance is finished. Are you saying you have to stop the {{airflow worker}} 
process for the logs to appear in S3?

> Remote log handlers don't upload logs on task finish
> 
>
> Key: AIRFLOW-1667
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1667
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: logging
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Arthur Vigil
>Priority: Major
>
> AIRFLOW-1385 revised logging for configurability, but the provided remote log 
> handlers (S3TaskHandler and GCSTaskHandler) only upload on close (flush is 
> left at the default implementation provided by `logging.FileHandler`). A 
> handler will be closed on process exit by `logging.shutdown()`, but depending 
> on the Executor used worker processes may not regularly shutdown, and can 
> very likely persist between tasks. This means during normal execution log 
> files are never uploaded.
> Need to find a way to flush remote log handlers in a timely manner, but 
> without hitting the target resources unnecessarily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2073) FileSensor always return True

2018-02-08 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356957#comment-16356957
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2073:


PR opened https://github.com/apache/incubator-airflow/pull/3017

> FileSensor always return True
> -
>
> Key: AIRFLOW-2073
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2073
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: contrib
>Affects Versions: 1.9.0
> Environment: Ubuntu 16.04
>Reporter: Pierre Payet
>Priority: Trivial
>
> When using a FileSensor, the path is tested with os.walk. However, this 
> function never raise an error if the path does not exist and the poke will 
> always return True.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2073) FileSensor always return True

2018-02-08 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356919#comment-16356919
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2073:


Yeah okay, this Sensor is currently totally broken, and the correct behaviour 
is unclear.

{{os.walk}} basically never fails.  {{[f for f in walk('/i/dont/exist')]}} 
never fails and always evaluates to an empty list.

Using {{os.exists(full_path)}} might be one solution here, but it's not clear 
what the behaviour of this sensor is meant to be when given a directory. If 
it's given a directory is it meant to wait for any files to appear inside the 
directory?

> FileSensor always return True
> -
>
> Key: AIRFLOW-2073
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2073
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: contrib
>Affects Versions: 1.9.0
> Environment: Ubuntu 16.04
>Reporter: Pierre Payet
>Priority: Trivial
>
> When using a FileSensor, the path is tested with os.walk. However, this 
> function never raise an error if the path does not exist and the poke will 
> always return True.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2073) FileSensor always return True

2018-02-08 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356881#comment-16356881
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2073:


FileSensor in 1.9.0 lives here 
https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/contrib/operators/fs_operator.py

> FileSensor always return True
> -
>
> Key: AIRFLOW-2073
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2073
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: contrib
>Affects Versions: 1.9.0
> Environment: Ubuntu 16.04
>Reporter: Pierre Payet
>Priority: Trivial
>
> When using a FileSensor, the path is tested with os.walk. However, this 
> function never raise an error if the path does not exist and the poke will 
> always return True.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2047) airflow is insecure by default

2018-01-30 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345515#comment-16345515
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2047:


There is a small note: "Be sure to checkout [Experimental Rest 
API|https://airflow.apache.org/api.html] for securing the API." at the top of 
the security page. Maybe this does need to be made bigger.

 

See AIRFLOW-1765 for the previous decision on why the API isn't authenticated 
by default

> airflow is insecure by default
> --
>
> Key: AIRFLOW-2047
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2047
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: api
>Affects Versions: 1.9.0
>Reporter: Joost van Dorp
>Priority: Critical
>  Labels: easyfix, security
>
> #  [API Documentation|https://airflow.apache.org/api.html#authentication] 
> states that the API is open/insecure by default.
>  # [Security Documentation|https://airflow.apache.org/security.html#security] 
> does not mention the API.
> Can we either:
> a) Disable the API by default, and have instructions on how to enable the 
> experimental API?
> or
> b) Place a warning on the security documentation page that the API is open by 
> default.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2027) Remove unnecessary 1s sleep in scheduler loop

2018-01-25 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338996#comment-16338996
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2027:


Would removing this sleep not also significantly increase the CPU load of the 
scheduler by putting it in a busy loop?

> Remove unnecessary 1s sleep in scheduler loop
> -
>
> Key: AIRFLOW-2027
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2027
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Dan Davydov
>Assignee: Dan Davydov
>Priority: Major
>
> The scheduler loop sleeps for 1 second every loop unnecessarily. Remove this 
> sleep to slightly speed up scheduling. It can add up since it runs to every 
> scheduler loop which runs # of dags to parse/scheduler parallelism times.
>  
> Also remove the unnecessary increased file processing interval in tests which 
> slows them down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1837) Differing start_dates on tasks not respected by scheduler.

2017-12-20 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298364#comment-16298364
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1837:


I've just tested this again, and can confirm that this is the case with 1.9.0rc8

> Differing start_dates on tasks not respected by scheduler.
> --
>
> Key: AIRFLOW-1837
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1837
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
>
> It it possible to specify start_date directly on tasks in dag, as well as on 
> the DAG. This is correctly handled when creating dag runs, but it is 
> seemingly ignored when scheduling tasks.
> Given this example:
> {code}
> dag_args = {
> "start_date": datetime(2017, 9, 4),
> }
> dag = DAG(
> "my-dag",
> default_args=dag_args,
> schedule_interval="0 0 * * Mon",
> )
> # ...
> with dag:
> op = PythonOperator(
> python_callable=fetcher.run,
> task_id="fetch_all_respondents",
> provide_context=True,
> # The "unfiltered" API calls are a lot quicker, so lets put them
> # ahead of any other filtered job in the queue.
> priority_weight=10,
> start_date=datetime(2014, 9, 1),
> )
> op = PythonOperator(
> python_callable=fetcher.run,
> task_id="fetch_by_demographics",
> op_kwargs={
> 'demo_names': demo_names,
> },
> provide_context=True,
> priority_weight=5,
> )
> {code}
> I only want the fetch_all_respondents tasks to run for 2014..2017, and then 
> from September 2017 I also want the fetch_by_demographics task to run. 
> However right now both tasks are being scheduled from 2014-09-01.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (AIRFLOW-1931) Importing airflow module shouldn't affect logging config

2017-12-15 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292991#comment-16292991
 ] 

Ash Berlin-Taylor edited comment on AIRFLOW-1931 at 12/15/17 6:32 PM:
--

For example

{noformat}
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
Reading the config from /root/airflow/airflow.cfg
[2017-12-15 18:31:22,599] {logging.py:97} INFO -
[2017-12-15 18:31:22,600] {logging.py:97} INFO - looking for now-outdated 
files...
[2017-12-15 18:31:22,600] {logging.py:97} INFO - none found
[2017-12-15 18:31:22,601] {logging.py:97} INFO - pickling environment...
{noformat}


was (Author: ashb):
For example

{{forformat}}
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
Reading the config from /root/airflow/airflow.cfg
[2017-12-15 18:31:22,599] {logging.py:97} INFO -
[2017-12-15 18:31:22,600] {logging.py:97} INFO - looking for now-outdated 
files...
[2017-12-15 18:31:22,600] {logging.py:97} INFO - none found
[2017-12-15 18:31:22,601] {logging.py:97} INFO - pickling environment...
{{noformat}}

> Importing airflow module shouldn't affect logging config
> 
>
> Key: AIRFLOW-1931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1931
> Project: Apache Airflow
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
>Priority: Minor
> Fix For: 2.0.0
>
>
> Right now simply importing the airflow main module will alter the logging 
> config, which leads to some strange interactions with other python modules.
> (One such example is sphinx autodoc where half the lines are in one logging 
> format, and half are in airflow's style after it gets loaded by autodoc.)
> It would be nice if we only used airflow's logging format from within 
> airflow.bin.cli.
> More generally this might also be achieved by doing less at the top level of 
> modules (for instance importing airflow.configuration will end up creating 
> dirs on the filesystem.)
> None of this is a disaster or a bug, it's just a bit tiny bit annoying when 
> you use airflow programatically



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1931) Importing airflow module shouldn't affect logging config

2017-12-15 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292991#comment-16292991
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1931:


For example

{{forformat}}
building [html]: targets for 1 source files that are out of date
updating environment: 0 added, 1 changed, 0 removed
Reading the config from /root/airflow/airflow.cfg
[2017-12-15 18:31:22,599] {logging.py:97} INFO -
[2017-12-15 18:31:22,600] {logging.py:97} INFO - looking for now-outdated 
files...
[2017-12-15 18:31:22,600] {logging.py:97} INFO - none found
[2017-12-15 18:31:22,601] {logging.py:97} INFO - pickling environment...
{{noformat}}

> Importing airflow module shouldn't affect logging config
> 
>
> Key: AIRFLOW-1931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1931
> Project: Apache Airflow
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
>Priority: Minor
> Fix For: 2.0.0
>
>
> Right now simply importing the airflow main module will alter the logging 
> config, which leads to some strange interactions with other python modules.
> (One such example is sphinx autodoc where half the lines are in one logging 
> format, and half are in airflow's style after it gets loaded by autodoc.)
> It would be nice if we only used airflow's logging format from within 
> airflow.bin.cli.
> More generally this might also be achieved by doing less at the top level of 
> modules (for instance importing airflow.configuration will end up creating 
> dirs on the filesystem.)
> None of this is a disaster or a bug, it's just a bit tiny bit annoying when 
> you use airflow programatically



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1931) Importing airflow module shouldn't affect logging config

2017-12-15 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1931:
--

 Summary: Importing airflow module shouldn't affect logging config
 Key: AIRFLOW-1931
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1931
 Project: Apache Airflow
  Issue Type: Improvement
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
Priority: Minor
 Fix For: 2.0.0


Right now simply importing the airflow main module will alter the logging 
config, which leads to some strange interactions with other python modules.

(One such example is sphinx autodoc where half the lines are in one logging 
format, and half are in airflow's style after it gets loaded by autodoc.)

It would be nice if we only used airflow's logging format from within 
airflow.bin.cli.

More generally this might also be achieved by doing less at the top level of 
modules (for instance importing airflow.configuration will end up creating dirs 
on the filesystem.)

None of this is a disaster or a bug, it's just a bit tiny bit annoying when you 
use airflow programatically



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

2017-12-14 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290909#comment-16290909
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1916:


https://github.com/apache/incubator-airflow/pull/2880/files#diff-d06b55e8ca92fabc72372ac03c89704bL63
 :)

> S3 Task logs end up duplicated in the file.
> ---
>
> Key: AIRFLOW-1916
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.1
>
>
> If using the S3TaskHandler logger the contents of the log file in the S3 
> bucket end up duplicated - once from when `airflow run --raw` finalizes the 
> task, and again from when `airflow run --local` finalizes it's logger.
> Log from the UI included below. The file on disk does not have the repetition.
> There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not 
> meant to upload, but something is.
> {noformat}
> *** Reading remote log from 
> s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value 
> was: None
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.

[jira] [Commented] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

2017-12-14 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290900#comment-16290900
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1916:


Doesn't help, as it's always called from an {{atexit}} handler registered why 
the logging stdlib module is imported.

> S3 Task logs end up duplicated in the file.
> ---
>
> Key: AIRFLOW-1916
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.1
>
>
> If using the S3TaskHandler logger the contents of the log file in the S3 
> bucket end up duplicated - once from when `airflow run --raw` finalizes the 
> task, and again from when `airflow run --local` finalizes it's logger.
> Log from the UI included below. The file on disk does not have the repetition.
> There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not 
> meant to upload, but something is.
> {noformat}
> *** Reading remote log from 
> s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value 
> was: None
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {l

[jira] [Commented] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

2017-12-13 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289160#comment-16289160
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1916:


I suspect this will happen on Py2 too:

https://github.com/python/cpython/blob/2.7/Lib/logging/__init__.py#L1660-L1692
https://github.com/python/cpython/blob/3.6/Lib/logging/__init__.py#L1928-L1960

Both versions call flush(), close() on all handlers for us.

> S3 Task logs end up duplicated in the file.
> ---
>
> Key: AIRFLOW-1916
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.1
>
>
> If using the S3TaskHandler logger the contents of the log file in the S3 
> bucket end up duplicated - once from when `airflow run --raw` finalizes the 
> task, and again from when `airflow run --local` finalizes it's logger.
> Log from the UI included below. The file on disk does not have the repetition.
> There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not 
> meant to upload, but something is.
> {noformat}
> *** Reading remote log from 
> s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value 
> was: None
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airfl

[jira] [Commented] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

2017-12-13 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289152#comment-16289152
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1916:


I've added some debugging in, showing where each close was called from, and 
what the os args were. On python 3 this is what I see.

The raw shutdown happens automatically from the logging system:

{noformat}
Closing log from pid 445 / ['/usr/local/bin/airflow', 'run', 'tests', 
'test-logging', '2017-12-13T11:37:55.558651', '--job_id', '18', '--raw', '-sd', 
'/usr/local/airflow/dags/example/csv_to_parquet.py']
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1919, in shutdown
h.close()
  File 
"/usr/local/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py", 
line 72, in close
traceback.print_stack(file=fh)
{noformat}

The non-raw shutdown is called by us.

{noformat}
Closing log from pid 440 / ['/usr/local/bin/airflow', 'run', 'tests', 
'test-logging', '2017-12-13T11:37:55.558651', '--local', '-sd', 
'/usr/local/airflow/dags/example/csv_to_parquet.py']
  File "/usr/local/bin/airflow", line 27, in 
args.func(args)
  File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 438, 
in run
handler.close()
  File 
"/usr/local/lib/python3.6/site-packages/airflow/utils/log/s3_task_handler.py", 
line 72, in close
traceback.print_stack(file=fh)
{noformat}

> S3 Task logs end up duplicated in the file.
> ---
>
> Key: AIRFLOW-1916
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.1
>
>
> If using the S3TaskHandler logger the contents of the log file in the S3 
> bucket end up duplicated - once from when `airflow run --raw` finalizes the 
> task, and again from when `airflow run --local` finalizes it's logger.
> Log from the UI included below. The file on disk does not have the repetition.
> There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not 
> meant to upload, but something is.
> {noformat}
> *** Reading remote log from 
> s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
>  on 2017-12-13 10:45:42.552705
> [2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
> '-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
> --raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
> [2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {configuration.py:206} WARNING - section/key 
> [celery/celery_ssl_active] not found in config
> [2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,622] {default_celery.py:41} WARNING - Celery Executor 
> will run without SSL
> [2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,624] {__init__.py:45} INFO - Using executor 
> CeleryExecutor
> [2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
> /usr/local/airflow/dags/example/csv_to_parquet.py
> [2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
> [2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: 
> [2017-12-13 10:45:53,904] {python_operator.py:90} INFO - Done. Returned value 
> was: None
> [2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
> [2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 
> 
> [2017-12-13 10:45:49,906] {models.py:1407} INFO - 
> 
> Starting attempt 1 of 1
> 
> [2017-12-13 10:45:49,923] {models.py:142

[jira] [Created] (AIRFLOW-1917) print() from python operators end up with extra new line

2017-12-13 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1917:
--

 Summary: print() from python operators end up with extra new line
 Key: AIRFLOW-1917
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1917
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
 Fix For: 1.9.1


If I have the following as the callable for a PythonOperator:

{code}
def print_stuff(ti, **kwargs):
print("Hi from", __file__)
print("Hi 2 from", __file__)
{code}

I see the following in the log files

{noformat}
[2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,904] {python_operator.py:90} INFO - Done. Returned value was: None
{noformat}

Note the extra blank lines.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1916) S3 Task logs end up duplicated in the file.

2017-12-13 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1916:
--

 Summary: S3 Task logs end up duplicated in the file.
 Key: AIRFLOW-1916
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1916
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
 Fix For: 1.9.1


If using the S3TaskHandler logger the contents of the log file in the S3 bucket 
end up duplicated - once from when `airflow run --raw` finalizes the task, and 
again from when `airflow run --local` finalizes it's logger.

Log from the UI included below. The file on disk does not have the repetition.

There is a comment in `run()` in airflow.bin.cli implying that `--raw` is not 
meant to upload, but something is.

{noformat}
*** Reading remote log from 
s3://xxx/ash-test/tests/test-logging/2017-12-13T10:45:42.552705/1.log.
[2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
[2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 

[2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 

[2017-12-13 10:45:49,906] {models.py:1407} INFO - 

Starting attempt 1 of 1


[2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
 on 2017-12-13 10:45:42.552705
[2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
'-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
--raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
[2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,622] {configuration.py:206} WARNING - section/key 
[celery/celery_ssl_active] not found in config
[2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,622] {default_celery.py:41} WARNING - Celery Executor will run without 
SSL
[2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,624] {__init__.py:45} INFO - Using executor CeleryExecutor
[2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
/usr/local/airflow/dags/example/csv_to_parquet.py
[2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
[2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
[2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,904] {python_operator.py:90} INFO - Done. Returned value was: None

[2017-12-13 10:45:49,764] {cli.py:374} INFO - Running on host ac5d0787084d
[2017-12-13 10:45:49,895] {models.py:1197} INFO - Dependencies all met for 

[2017-12-13 10:45:49,905] {models.py:1197} INFO - Dependencies all met for 

[2017-12-13 10:45:49,906] {models.py:1407} INFO - 

Starting attempt 1 of 1


[2017-12-13 10:45:49,923] {models.py:1428} INFO - Executing 
 on 2017-12-13 10:45:42.552705
[2017-12-13 10:45:49,924] {base_task_runner.py:115} INFO - Running: ['bash', 
'-c', 'airflow run tests test-logging 2017-12-13T10:45:42.552705 --job_id 5 
--raw -sd /usr/local/airflow/dags/example/csv_to_parquet.py']
[2017-12-13 10:45:53,622] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,622] {configuration.py:206} WARNING - section/key 
[celery/celery_ssl_active] not found in config
[2017-12-13 10:45:53,625] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,622] {default_celery.py:41} WARNING - Celery Executor will run without 
SSL
[2017-12-13 10:45:53,626] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,624] {__init__.py:45} INFO - Using executor CeleryExecutor
[2017-12-13 10:45:53,755] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,754] {models.py:189} INFO - Filling up the DagBag from 
/usr/local/airflow/dags/example/csv_to_parquet.py
[2017-12-13 10:45:53,859] {cli.py:374} INFO - Running on host ac5d0787084d
[2017-12-13 10:45:53,901] {logging_mixin.py:84} INFO - Hi from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,902] {logging_mixin.py:84} INFO - Hi 2 from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-12-13 10:45:53,903] {csv_to_parquet.py:27} ERROR - Hello
[2017-12-13 10:45:53,905] {base_task_runner.py:98} INFO - Subtask: [2017-12-13 
10:45:53,904] {python_operator.py:90} INFO - Done. Returned value was: None
[2017-12-13 10:45:54,923] {base_task_runner.py:98} INFO 

[jira] [Commented] (AIRFLOW-1903) /admin/airflow should not be hardcoded

2017-12-09 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284808#comment-16284808
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1903:


Someone has made a PR for this on the current webserver : AIRFLOW-1755 - and 
there's an open PR for that in Github too.

> /admin/airflow should not be hardcoded
> --
>
> Key: AIRFLOW-1903
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1903
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core
>Reporter: William Pursell
>Assignee: William Pursell
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Joy's new webserver (https://github.com/wepay/airflow-webserver) is changing 
> some of the endpoints.  Whether that package or some other is merged, we'll 
> need some flexibility.  In the short term, there needs to be a way to inject 
> a different path to logs in the email alerts, and adding a config option 
> seems easiest.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1898) Large XComs are not supported and fail silently

2017-12-08 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283946#comment-16283946
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1898:


Maybe partly or whole addressed via AIRFLOW-855 
https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;a=commitdiff;h=4cf904c;hp=984a87c0cb685ea4dfa765cc4f4a23c9058b3965

> Large XComs are not supported and fail silently
> ---
>
> Key: AIRFLOW-1898
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1898
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: xcom
>Affects Versions: Airflow 1.8
> Environment: MySQL
>Reporter: Len Frodgers
>
> I am using Airlfow backed by MySQL and having problems with large XComs (> 64 
> KB). Xcom uses PickleType which is backed by BLOB on MySQL.
> Unfortunately, MySQL by default truncates anything longer than BLOB (64 KB) 
> when saving, so when unpickling such XComs, they are corrupt and an EOFError 
> is raised.
> Two things we need:
> 1) Validation when saving the XCom that it is not too big
> 2) Use MEDIUMBLOB as the underlying data type for the XCOM column on MySQL so 
> large XComs can be stored (supports  up to 12 MB i think)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1897) Task Logs for running instance not visible in webui

2017-12-08 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1897:
--

 Summary: Task Logs for running instance not visible in webui
 Key: AIRFLOW-1897
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1897
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
 Fix For: 1.9.1


Task logs for currently running instances are not visible in the webUI. This is 
is likely due to my change in AIRFLOW-1873.

I don't think this is a blocker for 1.9.0 to be released, but it's a small 
change which i'll open now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1880) TaskInstance.log_filepath property doesn't account for custom format

2017-12-02 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1880:
--

 Summary: TaskInstance.log_filepath property doesn't account for 
custom format
 Key: AIRFLOW-1880
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1880
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
Priority: Trivial
 Fix For: 1.9.1


The "log_filepath" property on a TaskInstance doesn't output the right path 
after [AIRFLOW-1582] which changed the default and let the format be customized.

This is a minor bug that doesn't affect very much - it's displayed in the admin 
UI, and is included in the failure email. Beyond that it isn't used anywhere by 
Airflow.

Also I wonder if the log_filepath should be an attribute of the Job model, so 
that if the format is changed in the future we can still find the right 
historic log file?





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1873) Task operator logs appear in wrong numbered log file

2017-11-30 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1873:
--

 Summary: Task operator logs appear in wrong numbered log file
 Key: AIRFLOW-1873
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1873
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor


The logs for the running operators appear in the "next" task number.

For example, for the first try for a given task instance the "collecting dag" 
etc appear in 1.log, but log messages from the operator itself appear in 2.log.

1.log:

{noformat}
[2017-11-30 23:14:44,189] {cli.py:374} INFO - Running on host 4f1698e8ae61
[2017-11-30 23:14:44,254] {models.py:1173} INFO - Dependencies all met for 

[2017-11-30 23:14:44,265] {models.py:1173} INFO - Dependencies all met for 

[2017-11-30 23:14:44,266] {models.py:1383} INFO -

Starting attempt 1 of 1


[2017-11-30 23:14:44,290] {models.py:1404} INFO - Executing 
 on 2017-11-20 00:00:00
[2017-11-30 23:14:44,291] {base_task_runner.py:115} INFO - Running: ['bash', 
'-c', 'airflow run tests test-logging 2017-11-20T00:00:00 --job_id 4 --raw -sd 
/usr/local/airflow/dags/example/csv_to_parquet.py']
[2017-11-30 23:14:50,054] {base_task_runner.py:98} INFO - Subtask: [2017-11-30 
23:14:50,052] {configuration.py:206} WARNING - section/key 
[celery/celery_ssl_active] not found in config
[2017-11-30 23:14:50,056] {base_task_runner.py:98} INFO - Subtask: [2017-11-30 
23:14:50,052] {default_celery.py:41} WARNING - Celery Executor will run without 
SSL
[2017-11-30 23:14:50,058] {base_task_runner.py:98} INFO - Subtask: [2017-11-30 
23:14:50,054] {__init__.py:45} INFO - Using executor CeleryExecutor
[2017-11-30 23:14:50,529] {base_task_runner.py:98} INFO - Subtask: [2017-11-30 
23:14:50,529] {models.py:189} INFO - Filling up the DagBag from 
/usr/local/airflow/dags/example/csv_to_parquet.py
[2017-11-30 23:14:50,830] {base_task_runner.py:98} INFO - Subtask: [2017-11-30 
23:14:50,825] {python_operator.py:90} INFO - Done. Returned value was: None
{noformat}

2.log:

{noformat}
[2017-11-30 23:14:50,749] {cli.py:374} INFO - Running on host 4f1698e8ae61
[2017-11-30 23:14:50,820] {logging_mixin.py:84} INFO - Hi from 
/usr/local/airflow/dags/example/csv_to_parquet.py

[2017-11-30 23:14:50,824] {csv_to_parquet.py:21} ERROR - Hello
{noformat}

Notice the timestamps - the contents of 2.log appear just before the last line 
of 1.log, and should be in the same log file (there is only a single run of 
this task instance)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (AIRFLOW-1852) Allow hostname to be overridable

2017-11-27 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1852:
---
Comment: was deleted

(was: Where is the hostname currently used in airflow? (I've been running fine 
without worrying about this, as I'm sure are lots of other people.)

{quote}
Since the web server calls out to the individual worker nodes to snag logs, 
what happens if one dies midway?
{quote}

There's support for writing task logs to GCS or S3 for more persistent storage.)

> Allow hostname to be overridable
> 
>
> Key: AIRFLOW-1852
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1852
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Trevor Joynson
>
> * https://github.com/apache/incubator-airflow/pull/2472
> This makes running Airflow tremendously easier in common
> production deployments that need a little more than just
> a bare `socket.getfqdn()` hostname for service discovery
> per running instance.
> Personally, I just place the Kubernetes Pod FQDN (or even IP) here.
> Question: Since the web server calls out to the individual
> worker nodes to snag logs, what happens if one dies midway?
> I may later look into that, because that scares me slightly.
> I feel like workers should not ever hold such state, but that's purely a 
> personal bias.
> Thanks,
> Trevor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1852) Allow hostname to be overridable

2017-11-27 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267606#comment-16267606
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1852:


Where is the hostname currently used in airflow? (I've been running fine 
without worrying about this, as I'm sure are lots of other people.)

{quote}
Since the web server calls out to the individual worker nodes to snag logs, 
what happens if one dies midway?
{quote}

There's support for writing task logs to GCS or S3 for more persistent storage.

> Allow hostname to be overridable
> 
>
> Key: AIRFLOW-1852
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1852
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Trevor Joynson
>
> * https://github.com/apache/incubator-airflow/pull/2472
> This makes running Airflow tremendously easier in common
> production deployments that need a little more than just
> a bare `socket.getfqdn()` hostname for service discovery
> per running instance.
> Personally, I just place the Kubernetes Pod FQDN (or even IP) here.
> Question: Since the web server calls out to the individual
> worker nodes to snag logs, what happens if one dies midway?
> I may later look into that, because that scares me slightly.
> I feel like workers should not ever hold such state, but that's purely a 
> personal bias.
> Thanks,
> Trevor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1852) Allow hostname to be overridable

2017-11-27 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267607#comment-16267607
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1852:


Where is the hostname currently used in airflow? (I've been running fine 
without worrying about this, as I'm sure are lots of other people.)

{quote}
Since the web server calls out to the individual worker nodes to snag logs, 
what happens if one dies midway?
{quote}

There's support for writing task logs to GCS or S3 for more persistent storage.

> Allow hostname to be overridable
> 
>
> Key: AIRFLOW-1852
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1852
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Trevor Joynson
>
> * https://github.com/apache/incubator-airflow/pull/2472
> This makes running Airflow tremendously easier in common
> production deployments that need a little more than just
> a bare `socket.getfqdn()` hostname for service discovery
> per running instance.
> Personally, I just place the Kubernetes Pod FQDN (or even IP) here.
> Question: Since the web server calls out to the individual
> worker nodes to snag logs, what happens if one dies midway?
> I may later look into that, because that scares me slightly.
> I feel like workers should not ever hold such state, but that's purely a 
> personal bias.
> Thanks,
> Trevor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AIRFLOW-1845) Modal background doesn't cover wide or tall pages

2017-11-24 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1845:
---
Fix Version/s: 1.10.0
  Component/s: ui

> Modal background doesn't cover wide or tall pages
> -
>
> Key: AIRFLOW-1845
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1845
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Ash Berlin-Taylor
>Assignee: Ash Berlin-Taylor
> Fix For: 1.10.0
>
> Attachments: Screen Shot 2017-11-24 at 12.19.50.png
>
>
> If there is any kind of scrolling on the page behind a modal pop up then the 
> grey backgorund behind the modal dialog doesn't correctly cover outside the 
> "first" page view. For example see the attached screenshot (which for some 
> reason Jira isn't embedding...)
> !Screen Shot 2017-11-24 at 12.19.50.png|thumbnail!
> To reproduce: go to a long or tall dag tree view page, scroll first, then 
> click on a task instance to get the modal popup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1837) Differing start_dates on tasks not respected by scheduler.

2017-11-24 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265460#comment-16265460
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1837:


Looking at the code I'm not quite sure how I saw this behaviour. Or even if I 
really did. I will need to go and test this again.

> Differing start_dates on tasks not respected by scheduler.
> --
>
> Key: AIRFLOW-1837
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1837
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
>
> It it possible to specify start_date directly on tasks in dag, as well as on 
> the DAG. This is correctly handled when creating dag runs, but it is 
> seemingly ignored when scheduling tasks.
> Given this example:
> {code}
> dag_args = {
> "start_date": datetime(2017, 9, 4),
> }
> dag = DAG(
> "my-dag",
> default_args=dag_args,
> schedule_interval="0 0 * * Mon",
> )
> # ...
> with dag:
> op = PythonOperator(
> python_callable=fetcher.run,
> task_id="fetch_all_respondents",
> provide_context=True,
> # The "unfiltered" API calls are a lot quicker, so lets put them
> # ahead of any other filtered job in the queue.
> priority_weight=10,
> start_date=datetime(2014, 9, 1),
> )
> op = PythonOperator(
> python_callable=fetcher.run,
> task_id="fetch_by_demographics",
> op_kwargs={
> 'demo_names': demo_names,
> },
> provide_context=True,
> priority_weight=5,
> )
> {code}
> I only want the fetch_all_respondents tasks to run for 2014..2017, and then 
> from September 2017 I also want the fetch_by_demographics task to run. 
> However right now both tasks are being scheduled from 2014-09-01.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AIRFLOW-1845) Modal background doesn't cover wide or tall pages

2017-11-24 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1845:
---
Description: 
If there is any kind of scrolling on the page behind a modal pop up then the 
grey backgorund behind the modal dialog doesn't correctly cover outside the 
"first" page view. For example see the attached screenshot (which for some 
reason Jira isn't embedding...)

!Screen Shot 2017-11-24 at 12.19.50.png|thumbnail!

To reproduce: go to a long or tall dag tree view page, scroll first, then click 
on a task instance to get the modal popup.

  was:
If there is any kind of scrolling on the page behind a modal pop up then the 
grey backgorund behind the modal dialog doesn't correctly cover outside the 
"first" page view. For example:

!Screen Shot 2017-11-24 at 12.19.50.png|thumbnail!

To reproduce: go to a long or tall dag tree view page, scroll first, then click 
on a task instance to get the modal popup.


> Modal background doesn't cover wide or tall pages
> -
>
> Key: AIRFLOW-1845
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1845
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Ash Berlin-Taylor
>Assignee: Ash Berlin-Taylor
> Attachments: Screen Shot 2017-11-24 at 12.19.50.png
>
>
> If there is any kind of scrolling on the page behind a modal pop up then the 
> grey backgorund behind the modal dialog doesn't correctly cover outside the 
> "first" page view. For example see the attached screenshot (which for some 
> reason Jira isn't embedding...)
> !Screen Shot 2017-11-24 at 12.19.50.png|thumbnail!
> To reproduce: go to a long or tall dag tree view page, scroll first, then 
> click on a task instance to get the modal popup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1845) Modal background doesn't cover wide or tall pages

2017-11-24 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1845:
--

 Summary: Modal background doesn't cover wide or tall pages
 Key: AIRFLOW-1845
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1845
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.8.1, 1.9.0
Reporter: Ash Berlin-Taylor
Assignee: Ash Berlin-Taylor
 Attachments: Screen Shot 2017-11-24 at 12.19.50.png

If there is any kind of scrolling on the page behind a modal pop up then the 
grey backgorund behind the modal dialog doesn't correctly cover outside the 
"first" page view. For example:

!Screen Shot 2017-11-24 at 12.19.50.png|thumbnail!

To reproduce: go to a long or tall dag tree view page, scroll first, then click 
on a task instance to get the modal popup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AIRFLOW-1839) S3Hook.list_keys throws exception

2017-11-21 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1839:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/2805

> S3Hook.list_keys throws exception
> -
>
> Key: AIRFLOW-1839
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1839
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Ash Berlin-Taylor
>
> {noformat}
>   File "/usr/local/lib/python3.5/site-packages/airflow/hooks/S3_hook.py", 
> line 104, in list_keys
> return [k.Key for k in response['Contents']] if 
> response.get('Contents') else None
>   File "/usr/local/lib/python3.5/site-packages/airflow/hooks/S3_hook.py", 
> line 104, in 
> return [k.Key for k in response['Contents']] if 
> response.get('Contents') else None
> AttributeError: 'dict' object has no attribute 'Key'
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1839) S3Hook.list_keys throws exception

2017-11-21 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1839:
--

 Summary: S3Hook.list_keys throws exception
 Key: AIRFLOW-1839
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1839
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Ash Berlin-Taylor


{noformat}
  File "/usr/local/lib/python3.5/site-packages/airflow/hooks/S3_hook.py", 
line 104, in list_keys
return [k.Key for k in response['Contents']] if 
response.get('Contents') else None
  File "/usr/local/lib/python3.5/site-packages/airflow/hooks/S3_hook.py", 
line 104, in 
return [k.Key for k in response['Contents']] if 
response.get('Contents') else None
AttributeError: 'dict' object has no attribute 'Key'
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-843) Store task exceptions in context

2017-11-21 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260949#comment-16260949
 ] 

Ash Berlin-Taylor commented on AIRFLOW-843:
---

 I think 1. is a "softer" approach that will mean existing on failure hooks in 
people's code bases wont suddenly error (due to signature mismatch)

But we can go with 2. so long as it's mentioned in updating.md

> Store task exceptions in context
> 
>
> Key: AIRFLOW-843
> URL: https://issues.apache.org/jira/browse/AIRFLOW-843
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Scott Kruger
>Priority: Minor
>
> If a task encounters an exception during execution, it should store the 
> exception on the execution context so that other methods (namely 
> `on_failure_callback` can access it.  This would help with custom error 
> integrations, e.g. Sentry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1753) Can't install on windows 10

2017-11-21 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260862#comment-16260862
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1753:


This is somewhat unrelated to airflow -- one of the python modules we depend 
upon needs to compile something, and python isn't properly configured to find 
the toolchain.

The full output would include the name of the module that is being installed at 
the time of the error, but mire generally: look for a guide about installing 
python on Windows.

> Can't install on windows 10
> ---
>
> Key: AIRFLOW-1753
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1753
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Lakshman Udayakantha
>
> When I installed airflow using "pip install airflow command" two errors pop 
> up.
> 1.  link.exe failed with exit status 1158
> 2.\x86_amd64\\cl.exe' failed with exit status 2
> first issue can be solved by reffering 
> https://stackoverflow.com/questions/43858836/python-installing-clarifai-vs14-0-link-exe-failed-with-exit-status-1158/44563421#44563421.
> But second issue is still there. there was no any solution by googling also. 
> how to prevent that issue and install airflow on windows 10 X64.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1837) Differing start_dates on tasks not respected by scheduler.

2017-11-21 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1837:
--

 Summary: Differing start_dates on tasks not respected by scheduler.
 Key: AIRFLOW-1837
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1837
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor


It it possible to specify start_date directly on tasks in dag, as well as on 
the DAG. This is correctly handled when creating dag runs, but it is seemingly 
ignored when scheduling tasks.

Given this example:

{code}
dag_args = {
"start_date": datetime(2017, 9, 4),
}
dag = DAG(
"my-dag",
default_args=dag_args,
schedule_interval="0 0 * * Mon",
)

# ...
with dag:
op = PythonOperator(
python_callable=fetcher.run,
task_id="fetch_all_respondents",
provide_context=True,
# The "unfiltered" API calls are a lot quicker, so lets put them
# ahead of any other filtered job in the queue.
priority_weight=10,
start_date=datetime(2014, 9, 1),
)

op = PythonOperator(
python_callable=fetcher.run,
task_id="fetch_by_demographics",
op_kwargs={
'demo_names': demo_names,
},
provide_context=True,
priority_weight=5,
)
{code}

I only want the fetch_all_respondents tasks to run for 2014..2017, and then 
from September 2017 I also want the fetch_by_demographics task to run. However 
right now both tasks are being scheduled from 2014-09-01.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AIRFLOW-1795) S3Hook no longer accepts s3_conn_id breaking build in ops/sensors and back-compat

2017-11-17 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1795:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/2795

> S3Hook no longer accepts s3_conn_id breaking build in ops/sensors and 
> back-compat 
> --
>
> Key: AIRFLOW-1795
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1795
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.0
>
>
> Found whilst testing Airflow 1.9.0rc1
> Previously the S3Hook accepted a parameter of {{s3_conn_id}}. As part of 
> AIRFLOW-1520 we moved S3Hook to have a superclass of AWSHook, which accepts a 
> {{aws_conn_id}} parameter instead.
> This break back-compat generally, and more specifically it breaks the built 
> in S3KeySensor which does this:
> {code}
> def poke(self, context):
> from airflow.hooks.S3_hook import S3Hook
> hook = S3Hook(s3_conn_id=self.s3_conn_id)
> {code}
> There are a few other instances of s3_conn_id in the code base that will also 
> probably need updating/tweaking.
> My first though was to add a shim mapping s3_conn_id to aws_conn_id in the 
> S3Hook with a deprecation warning but the surface area with places where this 
> is exposed is larger. I could add such a deprecation warning to all of these. 
> Anyone have thoughts as to best way?
> - Rename all instances with deprecation warnings.
> - S3Hook accepts {{s3_conn_id}} and passes down to {{aws_conn_id}} in 
> superclass.
> - Update existing references in code base to {{aws_conn_id}}, and not in 
> updating about need to update in user code. (This is my least preferred 
> option.)
> {noformat}
> airflow/operators/redshift_to_s3_operator.py
> 33::param s3_conn_id: reference to a specific S3 connection
> 34::type s3_conn_id: string
> 51:s3_conn_id='s3_default',
> 62:self.s3_conn_id = s3_conn_id
> 69:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)
> airflow/operators/s3_file_transform_operator.py
> 40::param source_s3_conn_id: source s3 connection
> 41::type source_s3_conn_id: str
> 44::param dest_s3_conn_id: destination s3 connection
> 45::type dest_s3_conn_id: str
> 62:source_s3_conn_id='s3_default',
> 63:dest_s3_conn_id='s3_default',
> 68:self.source_s3_conn_id = source_s3_conn_id
> 70:self.dest_s3_conn_id = dest_s3_conn_id
> 75:source_s3 = S3Hook(s3_conn_id=self.source_s3_conn_id)
> 76:dest_s3 = S3Hook(s3_conn_id=self.dest_s3_conn_id)
> airflow/operators/s3_to_hive_operator.py
> 74::param s3_conn_id: source s3 connection
> 75::type s3_conn_id: str
> 102:s3_conn_id='s3_default',
> 119:self.s3_conn_id = s3_conn_id
> 130:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)
> airflow/operators/sensors.py
> 504::param s3_conn_id: a reference to the s3 connection
> 505::type s3_conn_id: str
> 514:s3_conn_id='s3_default',
> 531:self.s3_conn_id = s3_conn_id
> 535:hook = S3Hook(s3_conn_id=self.s3_conn_id)
> 568:s3_conn_id='s3_default',
> 576:self.s3_conn_id = s3_conn_id
> 582:hook = S3Hook(s3_conn_id=self.s3_conn_id)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1795) S3Hook no longer accepts s3_conn_id breaking build in ops/sensors and back-compat

2017-11-16 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255672#comment-16255672
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1795:


Argh I just noticed that the S3ToHiveOperator just won't work as it still 
expects the boto2 return types from the S3Hook (for similar reasons to as 
changed in https://github.com/apache/incubator-airflow/pull/2773 - boto2 API 
was mocked, so tests still pass) I don't have time (today) to fix that.

> S3Hook no longer accepts s3_conn_id breaking build in ops/sensors and 
> back-compat 
> --
>
> Key: AIRFLOW-1795
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1795
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Ash Berlin-Taylor
> Fix For: 1.9.0
>
>
> Found whilst testing Airflow 1.9.0rc1
> Previously the S3Hook accepted a parameter of {{s3_conn_id}}. As part of 
> AIRFLOW-1520 we moved S3Hook to have a superclass of AWSHook, which accepts a 
> {{aws_conn_id}} parameter instead.
> This break back-compat generally, and more specifically it breaks the built 
> in S3KeySensor which does this:
> {code}
> def poke(self, context):
> from airflow.hooks.S3_hook import S3Hook
> hook = S3Hook(s3_conn_id=self.s3_conn_id)
> {code}
> There are a few other instances of s3_conn_id in the code base that will also 
> probably need updating/tweaking.
> My first though was to add a shim mapping s3_conn_id to aws_conn_id in the 
> S3Hook with a deprecation warning but the surface area with places where this 
> is exposed is larger. I could add such a deprecation warning to all of these. 
> Anyone have thoughts as to best way?
> - Rename all instances with deprecation warnings.
> - S3Hook accepts {{s3_conn_id}} and passes down to {{aws_conn_id}} in 
> superclass.
> - Update existing references in code base to {{aws_conn_id}}, and not in 
> updating about need to update in user code. (This is my least preferred 
> option.)
> {noformat}
> airflow/operators/redshift_to_s3_operator.py
> 33::param s3_conn_id: reference to a specific S3 connection
> 34::type s3_conn_id: string
> 51:s3_conn_id='s3_default',
> 62:self.s3_conn_id = s3_conn_id
> 69:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)
> airflow/operators/s3_file_transform_operator.py
> 40::param source_s3_conn_id: source s3 connection
> 41::type source_s3_conn_id: str
> 44::param dest_s3_conn_id: destination s3 connection
> 45::type dest_s3_conn_id: str
> 62:source_s3_conn_id='s3_default',
> 63:dest_s3_conn_id='s3_default',
> 68:self.source_s3_conn_id = source_s3_conn_id
> 70:self.dest_s3_conn_id = dest_s3_conn_id
> 75:source_s3 = S3Hook(s3_conn_id=self.source_s3_conn_id)
> 76:dest_s3 = S3Hook(s3_conn_id=self.dest_s3_conn_id)
> airflow/operators/s3_to_hive_operator.py
> 74::param s3_conn_id: source s3 connection
> 75::type s3_conn_id: str
> 102:s3_conn_id='s3_default',
> 119:self.s3_conn_id = s3_conn_id
> 130:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)
> airflow/operators/sensors.py
> 504::param s3_conn_id: a reference to the s3 connection
> 505::type s3_conn_id: str
> 514:s3_conn_id='s3_default',
> 531:self.s3_conn_id = s3_conn_id
> 535:hook = S3Hook(s3_conn_id=self.s3_conn_id)
> 568:s3_conn_id='s3_default',
> 576:self.s3_conn_id = s3_conn_id
> 582:hook = S3Hook(s3_conn_id=self.s3_conn_id)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1146) izip use in Python 3.4

2017-11-15 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253839#comment-16253839
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1146:


Fixed in 1.9.0

> izip use in Python 3.4
> --
>
> Key: AIRFLOW-1146
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1146
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: hive_hooks
>Affects Versions: Airflow 1.8
>Reporter: Alexander Panzhin
>
> Python 3 no longer has itertools.izip, but it is still used in 
> airflow/hooks/hive_hooks.py
> This causes all kinds of havoc.
> This needs fixed, if this is to be used on Python 3+



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1791) Unexpected "AttributeError: 'unicode' object has no attribute 'val'" from Variable.setdefault

2017-11-15 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253833#comment-16253833
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1791:


Fixed by 1177, merged and will included in 1.9.0.

> Unexpected "AttributeError: 'unicode' object has no attribute 'val'" from 
> Variable.setdefault
> -
>
> Key: AIRFLOW-1791
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1791
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: core
>Affects Versions: Airflow 1.8
> Environment: Python 2.7, Airflow 1.8.2
>Reporter: Shawn Wang
>
> In Variable.setdefault method,
> {code:python}
> obj = Variable.get(key, default_var=default_sentinel, 
> deserialize_json=False)
> if obj is default_sentinel:
> // ...
> else:
> if deserialize_json:
> return json.loads(obj.val)
> else:
> return obj.val
> {code}
> While obj is retrieved by "get" method which has already fetched the val 
> attribute from obj, so this "obj.val" throws the AttributeError.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1814) Add op_args and op_kwargs in PythonOperator templated fields

2017-11-13 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249969#comment-16249969
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1814:


They could be added, but there is an way already by using 
{{provide_context=True}}. When you set that to True then everything that is 
accessible from a Jinja template is accessible as a named parameter.:

{code}
def consume_value(task_instance, **kwargs):
my_xcom_value = task_instance.xcom_pull(task_ids=None, key='my_xcom_key')

value_consumer_task = PythonOperator(
task_id='value_consumer_task',
provide_context=True,
python_callable=consume_value,
dag=dag,
)
{code}

I can see when having it be templated directly might make some things nicer 
though.

> Add op_args and op_kwargs in PythonOperator templated fields
> 
>
> Key: AIRFLOW-1814
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1814
> Project: Apache Airflow
>  Issue Type: Wish
>  Components: operators
>Affects Versions: Airflow 1.8, 1.8.0
>Reporter: Galak
>Priority: Minor
>
> *I'm wondering if "_op_args_" and "_op_kwargs_" PythonOperator parameters 
> could be templated.*
> I have 2 different use cases where this change could help a lot:
> +1/ Provide some job execution information as a python callable argument:+
> let's explain it through a simple example:
> {code}
> simple_task = PythonOperator(
> task_id='simple_task',
> provide_context=True,
> python_callable=extract_data,
> op_args=[
>   "my_db_connection_id"
>   "select * from my_table"
>   "/data/{dag.dag_id}/{ts}/my_export.csv"
> ],
> dag=dag
> )
> {code}
> "extract_data" python function seems to be simple here, but it could be 
> anything re-usable in multiple dags...
> +2/ Provide some XCom value as a python callable argument:+
> Let's say I a have a task which is retrieving or calculating a value, and 
> then storing it in an XCom for further use by other tasks:
> {code}
> value_producer_task = PythonOperator(
> task_id='value_producer_task',
> provide_context=True,
> python_callable=produce_value,
> op_args=[
>   "my_db_connection_id",
>   "some_other_static_parameter",
>   "my_xcom_key"
> ],
> dag=dag
> )
> {code}
> Then I can just configure a PythonCallable task to use the produced value:
> {code}
> value_consumer_task = PythonOperator(
> task_id='value_consumer_task',
> provide_context=True,
> python_callable=consume_value,
> op_args=[
>   "{{ task_instance.xcom_pull(task_ids=None, key='my_xcom_key') }}"
> ],
> dag=dag
> )
> {code}
> I quickly tried the following class:
> {code}
> from airflow.operators.python_operator import PythonOperator
> class MyPythonOperator(PythonOperator):
> template_fields = PythonOperator.template_fields + ('op_args', 
> 'op_kwargs')
> {code}
> and it worked like a charm.
> So could these 2 arguments be added to templated_fields? Or did I miss some 
> major drawback to this change?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1756) S3 Task Handler Cannot Read Logs With New S3Hook

2017-11-09 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245860#comment-16245860
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1756:


PR to fix underlying issue and expand tests 
https://github.com/apache/incubator-airflow/pull/2773

> S3 Task Handler Cannot Read Logs With New S3Hook
> 
>
> Key: AIRFLOW-1756
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1756
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Colin Son
>Priority: Critical
> Fix For: 1.9.0
>
>
> With the changes to the S3Hook, it seems like it cannot read the S3 task logs.
> In the `s3_read` in the S3TaskHandler.py:
> {code}
> s3_key = self.hook.get_key(remote_log_location)
> if s3_key:
> return s3_key.get_contents_as_string().decode()
> {code}
> Since the s3_key object is now a dict, you cannot call 
> `get_contents_as_string()` on a dict object. You have to use the S3Hook's 
> `read_key()` method to read the contents of the task logs now. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AIRFLOW-1797) Cannot write task logs to S3 with Python3

2017-11-09 Thread Ash Berlin-Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor updated AIRFLOW-1797:
---
External issue URL: https://github.com/apache/incubator-airflow/pull/2771

> Cannot write task logs to S3 with Python3
> -
>
> Key: AIRFLOW-1797
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1797
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Ash Berlin-Taylor
>
> {noformat}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.5/dist-packages/airflow/utils/log/s3_task_handler.py",
>  line 161, in s3_write
> encrypt=configuration.getboolean('core', 'ENCRYPT_S3_LOGS'),
>   File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/S3_hook.py", 
> line 253, in load_string
> client.upload_fileobj(filelike_buffer, bucket_name, key, 
> ExtraArgs=extra_args)
>   File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 431, 
> in upload_fileobj
> return future.result()
>   File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 
> 73, in result
> return self._coordinator.result()
>   File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 
> 233, in result
> raise self._exception
>   File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 
> 126, in __call__
> return self._execute_main(kwargs)
>   File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 
> 150, in _execute_main
> return_value = self._main(**kwargs)
>   File "/usr/local/lib/python3.5/dist-packages/s3transfer/upload.py", line 
> 679, in _main
> client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 312, 
> in _api_call
> return self._make_api_call(operation_name, kwargs)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 586, 
> in _make_api_call
> request_signer=self._request_signer, context=request_context)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 242, 
> in emit_until_response
> responses = self._emit(event_name, kwargs, stop_on_response=True)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 210, 
> in _emit
> response = handler(**kwargs)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 
> 201, in conditionally_calculate_md5
> calculate_md5(params, **kwargs)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 
> 179, in calculate_md5
> binary_md5 = _calculate_md5_from_file(body)
>   File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 
> 193, in _calculate_md5_from_file
> md5.update(chunk)
> TypeError: Unicode-objects must be encoded before hashing
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1797) Cannot write task logs to S3 with Python3

2017-11-09 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1797:
--

 Summary: Cannot write task logs to S3 with Python3
 Key: AIRFLOW-1797
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1797
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Ash Berlin-Taylor


{noformat}
Traceback (most recent call last):
  File 
"/usr/local/lib/python3.5/dist-packages/airflow/utils/log/s3_task_handler.py", 
line 161, in s3_write
encrypt=configuration.getboolean('core', 'ENCRYPT_S3_LOGS'),
  File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/S3_hook.py", line 
253, in load_string
client.upload_fileobj(filelike_buffer, bucket_name, key, 
ExtraArgs=extra_args)
  File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 431, 
in upload_fileobj
return future.result()
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 73, 
in result
return self._coordinator.result()
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 
233, in result
raise self._exception
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 126, 
in __call__
return self._execute_main(kwargs)
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 150, 
in _execute_main
return_value = self._main(**kwargs)
  File "/usr/local/lib/python3.5/dist-packages/s3transfer/upload.py", line 679, 
in _main
client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
  File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 312, 
in _api_call
return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 586, 
in _make_api_call
request_signer=self._request_signer, context=request_context)
  File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 242, in 
emit_until_response
responses = self._emit(event_name, kwargs, stop_on_response=True)
  File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 210, in 
_emit
response = handler(**kwargs)
  File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 201, 
in conditionally_calculate_md5
calculate_md5(params, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 179, 
in calculate_md5
binary_md5 = _calculate_md5_from_file(body)
  File "/usr/local/lib/python3.5/dist-packages/botocore/handlers.py", line 193, 
in _calculate_md5_from_file
md5.update(chunk)
TypeError: Unicode-objects must be encoded before hashing
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1796) Badly configured logging config can thwo exception in WWW logs view

2017-11-09 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1796:
--

 Summary: Badly configured  logging config can thwo exception in 
WWW logs view
 Key: AIRFLOW-1796
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1796
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Ash Berlin-Taylor


It is possible to specify a custom logging config that changes the 
{{file.task}} handler to {{s3.task}} but forget to update the 
{{core.task_log_reader}} config section.

This should be validated at start time, and mentioned in the comments of the 
default logging config that the config setting needs updating to.

This ends up as the following stack trace/error if you don't set it properly 
when trying to view task logs:

{noformat}
File "/usr/local/lib/python3.5/dist-packages/airflow/www/views.py", line 712, 
in log
   logs = handler.read(ti)
AttributeError: 'NoneType' object has no attribute 'read'
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1795) S3Hook no longer accepts s3_conn_id breaking build in ops/sensors and back-compat

2017-11-09 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1795:
--

 Summary: S3Hook no longer accepts s3_conn_id breaking build in 
ops/sensors and back-compat 
 Key: AIRFLOW-1795
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1795
 Project: Apache Airflow
  Issue Type: Bug
Affects Versions: 1.9.0
Reporter: Ash Berlin-Taylor
 Fix For: 1.9.0


Found whilst testing Airflow 1.9.0rc1

Previously the S3Hook accepted a parameter of {{s3_conn_id}}. As part of 
AIRFLOW-1520 we moved S3Hook to have a superclass of AWSHook, which accepts a 
{{aws_conn_id}} parameter instead.

This break back-compat generally, and more specifically it breaks the built in 
S3KeySensor which does this:

{code}
def poke(self, context):
from airflow.hooks.S3_hook import S3Hook
hook = S3Hook(s3_conn_id=self.s3_conn_id)
{code}

There are a few other instances of s3_conn_id in the code base that will also 
probably need updating/tweaking.

My first though was to add a shim mapping s3_conn_id to aws_conn_id in the 
S3Hook with a deprecation warning but the surface area with places where this 
is exposed is larger. I could add such a deprecation warning to all of these. 
Anyone have thoughts as to best way?

- Rename all instances with deprecation warnings.
- S3Hook accepts {{s3_conn_id}} and passes down to {{aws_conn_id}} in 
superclass.
- Update existing references in code base to {{aws_conn_id}}, and not in 
updating about need to update in user code. (This is my least preferred option.)

{noformat}
airflow/operators/redshift_to_s3_operator.py
33::param s3_conn_id: reference to a specific S3 connection
34::type s3_conn_id: string
51:s3_conn_id='s3_default',
62:self.s3_conn_id = s3_conn_id
69:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)

airflow/operators/s3_file_transform_operator.py
40::param source_s3_conn_id: source s3 connection
41::type source_s3_conn_id: str
44::param dest_s3_conn_id: destination s3 connection
45::type dest_s3_conn_id: str
62:source_s3_conn_id='s3_default',
63:dest_s3_conn_id='s3_default',
68:self.source_s3_conn_id = source_s3_conn_id
70:self.dest_s3_conn_id = dest_s3_conn_id
75:source_s3 = S3Hook(s3_conn_id=self.source_s3_conn_id)
76:dest_s3 = S3Hook(s3_conn_id=self.dest_s3_conn_id)

airflow/operators/s3_to_hive_operator.py
74::param s3_conn_id: source s3 connection
75::type s3_conn_id: str
102:s3_conn_id='s3_default',
119:self.s3_conn_id = s3_conn_id
130:self.s3 = S3Hook(s3_conn_id=self.s3_conn_id)

airflow/operators/sensors.py
504::param s3_conn_id: a reference to the s3 connection
505::type s3_conn_id: str
514:s3_conn_id='s3_default',
531:self.s3_conn_id = s3_conn_id
535:hook = S3Hook(s3_conn_id=self.s3_conn_id)
568:s3_conn_id='s3_default',
576:self.s3_conn_id = s3_conn_id
582:hook = S3Hook(s3_conn_id=self.s3_conn_id)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1765) Default API auth backed should deny all.

2017-10-31 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226700#comment-16226700
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1765:


I have created two PRs that address fix this in different ways. Only one should 
be used and the other closed unmerged.

- https://github.com/apache/incubator-airflow/pull/2736 - default backend 
denies all, added an allow_all backend
- https://github.com/apache/incubator-airflow/pull/2737 - default backend still 
allows_all, added a deny_all backend.

In cases both there remains a airflow.api.auth.backend.default so that existing 
config's won't suddenly break.

> Default API auth backed should deny all.
> 
>
> Key: AIRFLOW-1765
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1765
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: api, authentication
>Affects Versions: 1.8.2
>Reporter: Ash Berlin-Taylor
>  Labels: security
> Fix For: 1.9.0
>
>
> It has been discovered that the experimental API in the default configuration 
> is not protected behind any authentication.
> This means that out of the box the Airflow webserver's /api/experimental/ can 
> be requested by anyone, meaning pools can be updated/deleted and task 
> instance variables can be read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1765) Default API auth backed should deny all.

2017-10-30 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225789#comment-16225789
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1765:


Fair points.

I think the surprising thing to me was that it defaults to allow/isn't 
mentioned in https://airflow.apache.org/security.html -- and I think from the 
code there's no way to not make it open by default other than maybe to 
mis-configure an api auth backend? I don't have the code in front of me and 
it's late so I might be way off on this.

My plan will be to create 2, or 3 auth api backends. A "denyAll" (and make this 
the default now), a "allowAll" to get the old behaviour back. It might also be 
worth creating a "sessionAuth" wich just needs a valid login using whatever 
mechanism the front end allows. (#3 is probably optional for closing this hole)

Suitable doc updates to go with this.

Sound reasonable?

> Default API auth backed should deny all.
> 
>
> Key: AIRFLOW-1765
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1765
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: api, authentication
>Affects Versions: 1.8.2
>Reporter: Ash Berlin-Taylor
>  Labels: security
> Fix For: 1.9.0
>
>
> It has been discovered that the experimental API in the default configuration 
> is not protected behind any authentication.
> This means that out of the box the Airflow webserver's /api/experimental/ can 
> be requested by anyone, meaning pools can be updated/deleted and task 
> instance variables can be read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (AIRFLOW-1765) Default API auth backed should deny all.

2017-10-30 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225454#comment-16225454
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1765:


The /dags page needs to not use the experimental API before we can deny by 
default.

> Default API auth backed should deny all.
> 
>
> Key: AIRFLOW-1765
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1765
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: api, authentication
>Affects Versions: 1.8.2
>Reporter: Ash Berlin-Taylor
>Priority: Critical
>  Labels: security
> Fix For: 1.9.0
>
>
> It has been discovered that the experimental API in the default configuration 
> is not protected behind any authentication.
> This means that out of the box the Airflow webserver's /api/experimental/ can 
> be requested by anyone, meaning pools can be updated/deleted and task 
> instance variables can be read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AIRFLOW-1765) Default API auth backed should deny all.

2017-10-30 Thread Ash Berlin-Taylor (JIRA)
Ash Berlin-Taylor created AIRFLOW-1765:
--

 Summary: Default API auth backed should deny all.
 Key: AIRFLOW-1765
 URL: https://issues.apache.org/jira/browse/AIRFLOW-1765
 Project: Apache Airflow
  Issue Type: Bug
  Components: api, authentication
Affects Versions: 1.8.2
Reporter: Ash Berlin-Taylor
Priority: Critical
 Fix For: 1.9.0


It has been discovered that the experimental API in the default configuration 
is not protected behind any authentication.

This means that out of the box the Airflow webserver's /api/experimental/ can 
be requested by anyone, meaning pools can be updated/deleted and task instance 
variables can be read.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    2   3   4   5   6   7   8   >