[jira] [Commented] (AIRFLOW-2059) taskinstance query is awful, un-indexed, and does not scale

2018-03-01 Thread Tao Feng (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383269#comment-16383269
 ] 

Tao Feng commented on AIRFLOW-2059:
---

pr created: [https://github.com/apache/incubator-airflow/pull/3086] . I don't 
see any issues by indexing the job_id column as its type is integer type. Let 
me know if I miss anything.

> taskinstance query is awful, un-indexed, and does not scale
> ---
>
> Key: AIRFLOW-2059
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2059
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: db, webserver
>Affects Versions: Airflow 1.8
> Environment: [nhanlon@ ~]$ nproc
> 4
> [nhanlon@ ~]$ free -g
>  total   used   free sharedbuffers cached
> Mem: 7  5  1  0  0  1
> -/+ buffers/cache:  4  3 
> Swap:0  0  0 
> [nhanlon@ ~]$ cat /etc/*release*
> CentOS release 6.7 (Final)
> CentOS release 6.7 (Final)
> CentOS release 6.7 (Final)
> cpe:/o:centos:linux:6:GA
> [nhanlon@ ~]$ mysqld --version
> mysqld  Ver 5.6.31-77.0 for Linux on x86_64 (Percona Server (GPL), Release 
> 77.0, Revision 5c1061c)
>Reporter: Neil Hanlon
>Assignee: Tao Feng
>Priority: Critical
>
>  
> The page at /admin/taskinstance/ can reach a point where it blocks loading 
> the page and crushes the database. It appears this is because the 
> task_instance.job_id column is unindexed. On our database, getting the 
> results for this query took over four minutes, locking the table for the 
> duration.
>  
> 500 rows in set (4 min 8.93 sec)
>  
> Query:
>  
> {code:java}
> SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id 
> AS task_instance_dag_id, task_instance.execution_date AS 
> task_instance_execution_date, task_instance.start_date AS 
> task_instance_start_date, task_instance.end_date AS task_instance_end_date, 
> task_instance.duration AS task_instance_duration, task_instance.state AS 
> task_instance_state, task_instance.try_number AS task_instance_try_number, 
> task_instance.hostname AS task_instance_hostname, task_instance.unixname AS 
> task_instance_unixname, task_instance.job_id AS task_instance_job_id, 
> task_instance.pool AS task_instance_pool, task_instance.queue AS 
> task_instance_queue, task_instance.priority_weight AS 
> task_instance_priority_weight, task_instance.operator AS 
> task_instance_operator, task_instance.queued_dttm AS 
> task_instance_queued_dttm, task_instance.pid AS task_instance_pid 
> FROM task_instance ORDER BY task_instance.job_id DESC 
> LIMIT 500;
> {code}
> Profile, explain:
>  
> {code:java}
> :airflow> EXPLAIN SELECT task_instance.task_id AS 
> task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, 
> task_instance.execution_date AS task_instance_execution_date, 
> task_instance.start_date AS task_instance_start_date, task_instance.end_date 
> AS task_instance_end_date, task_instance.duration AS task_instance_duration, 
> task_instance.state AS task_instance_state, task_instance.try_number AS 
> task_instance_try_number, task_instance.hostname AS task_instance_hostname, 
> task_instance.unixname AS task_instance_unixname, task_instance.job_id AS 
> task_instance_job_id, task_instance.pool AS task_instance_pool, 
> task_instance.queue AS task_instance_queue, task_instance.priority_weight AS 
> task_instance_priority_weight, task_instance.operator AS 
> task_instance_operator, task_instance.queued_dttm AS 
> task_instance_queued_dttm, task_instance.pid AS task_instance_pid 
> -> FROM task_instance ORDER BY task_instance.job_id DESC 
> -> LIMIT 500;
> ++-+---+--+---+--+-+--+-++
> | id | select_type | table | type | possible_keys | key | key_len | ref | 
> rows | Extra |
> ++-+---+--+---+--+-+--+-++
> | 1 | SIMPLE | task_instance | ALL | NULL | NULL | NULL | NULL | 2542776 | 
> Using filesort |
> ++-+---+--+---+--+-+--+-++
> 1 row in set (0.00 sec)
> :airflow> select count(*) from task_instance;
> +--+
> | count(*) |
> +--+
> | 2984749 |
> +--+
> 1 row in set (1.67 sec)
> :airflow> show profile for query 2;
> +--++
> | Status | Duration |
> +--++
> | starting | 0.000157 |
> | checking permissions | 0.17 |
> | Opening tables | 0.33 |
> | init | 0.46 |
> | System lock | 0.17 |
> | optimizing | 0.10 |
> | statistics | 0.22 |
> | preparing | 0.20 |
> | Sorting result | 0.10 |
> | executing | 

[jira] [Assigned] (AIRFLOW-2059) taskinstance query is awful, un-indexed, and does not scale

2018-03-01 Thread Tao Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Feng reassigned AIRFLOW-2059:
-

Assignee: Tao Feng

> taskinstance query is awful, un-indexed, and does not scale
> ---
>
> Key: AIRFLOW-2059
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2059
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: db, webserver
>Affects Versions: Airflow 1.8
> Environment: [nhanlon@ ~]$ nproc
> 4
> [nhanlon@ ~]$ free -g
>  total   used   free sharedbuffers cached
> Mem: 7  5  1  0  0  1
> -/+ buffers/cache:  4  3 
> Swap:0  0  0 
> [nhanlon@ ~]$ cat /etc/*release*
> CentOS release 6.7 (Final)
> CentOS release 6.7 (Final)
> CentOS release 6.7 (Final)
> cpe:/o:centos:linux:6:GA
> [nhanlon@ ~]$ mysqld --version
> mysqld  Ver 5.6.31-77.0 for Linux on x86_64 (Percona Server (GPL), Release 
> 77.0, Revision 5c1061c)
>Reporter: Neil Hanlon
>Assignee: Tao Feng
>Priority: Critical
>
>  
> The page at /admin/taskinstance/ can reach a point where it blocks loading 
> the page and crushes the database. It appears this is because the 
> task_instance.job_id column is unindexed. On our database, getting the 
> results for this query took over four minutes, locking the table for the 
> duration.
>  
> 500 rows in set (4 min 8.93 sec)
>  
> Query:
>  
> {code:java}
> SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id 
> AS task_instance_dag_id, task_instance.execution_date AS 
> task_instance_execution_date, task_instance.start_date AS 
> task_instance_start_date, task_instance.end_date AS task_instance_end_date, 
> task_instance.duration AS task_instance_duration, task_instance.state AS 
> task_instance_state, task_instance.try_number AS task_instance_try_number, 
> task_instance.hostname AS task_instance_hostname, task_instance.unixname AS 
> task_instance_unixname, task_instance.job_id AS task_instance_job_id, 
> task_instance.pool AS task_instance_pool, task_instance.queue AS 
> task_instance_queue, task_instance.priority_weight AS 
> task_instance_priority_weight, task_instance.operator AS 
> task_instance_operator, task_instance.queued_dttm AS 
> task_instance_queued_dttm, task_instance.pid AS task_instance_pid 
> FROM task_instance ORDER BY task_instance.job_id DESC 
> LIMIT 500;
> {code}
> Profile, explain:
>  
> {code:java}
> :airflow> EXPLAIN SELECT task_instance.task_id AS 
> task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, 
> task_instance.execution_date AS task_instance_execution_date, 
> task_instance.start_date AS task_instance_start_date, task_instance.end_date 
> AS task_instance_end_date, task_instance.duration AS task_instance_duration, 
> task_instance.state AS task_instance_state, task_instance.try_number AS 
> task_instance_try_number, task_instance.hostname AS task_instance_hostname, 
> task_instance.unixname AS task_instance_unixname, task_instance.job_id AS 
> task_instance_job_id, task_instance.pool AS task_instance_pool, 
> task_instance.queue AS task_instance_queue, task_instance.priority_weight AS 
> task_instance_priority_weight, task_instance.operator AS 
> task_instance_operator, task_instance.queued_dttm AS 
> task_instance_queued_dttm, task_instance.pid AS task_instance_pid 
> -> FROM task_instance ORDER BY task_instance.job_id DESC 
> -> LIMIT 500;
> ++-+---+--+---+--+-+--+-++
> | id | select_type | table | type | possible_keys | key | key_len | ref | 
> rows | Extra |
> ++-+---+--+---+--+-+--+-++
> | 1 | SIMPLE | task_instance | ALL | NULL | NULL | NULL | NULL | 2542776 | 
> Using filesort |
> ++-+---+--+---+--+-+--+-++
> 1 row in set (0.00 sec)
> :airflow> select count(*) from task_instance;
> +--+
> | count(*) |
> +--+
> | 2984749 |
> +--+
> 1 row in set (1.67 sec)
> :airflow> show profile for query 2;
> +--++
> | Status | Duration |
> +--++
> | starting | 0.000157 |
> | checking permissions | 0.17 |
> | Opening tables | 0.33 |
> | init | 0.46 |
> | System lock | 0.17 |
> | optimizing | 0.10 |
> | statistics | 0.22 |
> | preparing | 0.20 |
> | Sorting result | 0.10 |
> | executing | 0.08 |
> | Sending data | 0.000151 |
> | Creating sort index | 248.955841 |
> | end | 0.015358 |
> | query end | 0.12 |
> | closing tables | 0.19 |
> | freeing items | 0.000549 |
> | 

[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook

2018-03-01 Thread Jakob Homan (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383201#comment-16383201
 ] 

Jakob Homan commented on AIRFLOW-2159:
--

Thanks, Dan! Resolving.

> Fix typos in salesforce_hook
> 
>
> Key: AIRFLOW-2159
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2159
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Reporter: Jakob Homan
>Assignee: Dan Fowler
>Priority: Major
> Fix For: Airflow 2.0
>
>
> There are several typos in the saleforce_hook file that would be a good 
> starter task to fix.
> {noformat}
> - ndjson:
> JSON array but each element is new-line deliminated
> instead of comman deliminated like in `json`
> This requires a significant amount of cleanup.
> Pandas doesn't handle output to CSV and json in a uniform way.
> This is especially painful for datetime types.
> Pandas wants to write them as strings in CSV,
> but as milisecond Unix timestamps.{noformat}
> To fix: comman, deliminated, milisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AIRFLOW-2159) Fix typos in salesforce_hook

2018-03-01 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan resolved AIRFLOW-2159.
--
Resolution: Fixed

> Fix typos in salesforce_hook
> 
>
> Key: AIRFLOW-2159
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2159
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Reporter: Jakob Homan
>Assignee: Dan Fowler
>Priority: Major
> Fix For: Airflow 2.0
>
>
> There are several typos in the saleforce_hook file that would be a good 
> starter task to fix.
> {noformat}
> - ndjson:
> JSON array but each element is new-line deliminated
> instead of comman deliminated like in `json`
> This requires a significant amount of cleanup.
> Pandas doesn't handle output to CSV and json in a uniform way.
> This is especially painful for datetime types.
> Pandas wants to write them as strings in CSV,
> but as milisecond Unix timestamps.{noformat}
> To fix: comman, deliminated, milisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2159) Fix typos in salesforce_hook

2018-03-01 Thread Jakob Homan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Homan updated AIRFLOW-2159:
-
Fix Version/s: Airflow 2.0

> Fix typos in salesforce_hook
> 
>
> Key: AIRFLOW-2159
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2159
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Reporter: Jakob Homan
>Assignee: Dan Fowler
>Priority: Major
> Fix For: Airflow 2.0
>
>
> There are several typos in the saleforce_hook file that would be a good 
> starter task to fix.
> {noformat}
> - ndjson:
> JSON array but each element is new-line deliminated
> instead of comman deliminated like in `json`
> This requires a significant amount of cleanup.
> Pandas doesn't handle output to CSV and json in a uniform way.
> This is especially painful for datetime types.
> Pandas wants to write them as strings in CSV,
> but as milisecond Unix timestamps.{noformat}
> To fix: comman, deliminated, milisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook

2018-03-01 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383198#comment-16383198
 ] 

ASF subversion and git services commented on AIRFLOW-2159:
--

Commit c7e39683d80caf89928f0002cf5214fe68d8775b in incubator-airflow's branch 
refs/heads/master from [~dfowler]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=c7e3968 ]

[AIRFLOW-2159] Fix a few typos in salesforce_hook


> Fix typos in salesforce_hook
> 
>
> Key: AIRFLOW-2159
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2159
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Reporter: Jakob Homan
>Assignee: Dan Fowler
>Priority: Major
>
> There are several typos in the saleforce_hook file that would be a good 
> starter task to fix.
> {noformat}
> - ndjson:
> JSON array but each element is new-line deliminated
> instead of comman deliminated like in `json`
> This requires a significant amount of cleanup.
> Pandas doesn't handle output to CSV and json in a uniform way.
> This is especially painful for datetime types.
> Pandas wants to write them as strings in CSV,
> but as milisecond Unix timestamps.{noformat}
> To fix: comman, deliminated, milisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


incubator-airflow git commit: [AIRFLOW-2159] Fix a few typos in salesforce_hook

2018-03-01 Thread jghoman
Repository: incubator-airflow
Updated Branches:
  refs/heads/master 2511c46c2 -> c7e39683d


[AIRFLOW-2159] Fix a few typos in salesforce_hook


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/c7e39683
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/c7e39683
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/c7e39683

Branch: refs/heads/master
Commit: c7e39683d80caf89928f0002cf5214fe68d8775b
Parents: 2511c46
Author: dan-sf 
Authored: Thu Mar 1 09:46:37 2018 -0800
Committer: dan-sf 
Committed: Thu Mar 1 09:46:37 2018 -0800

--
 airflow/contrib/hooks/salesforce_hook.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/c7e39683/airflow/contrib/hooks/salesforce_hook.py
--
diff --git a/airflow/contrib/hooks/salesforce_hook.py 
b/airflow/contrib/hooks/salesforce_hook.py
index b82e4ca..bf03638 100644
--- a/airflow/contrib/hooks/salesforce_hook.py
+++ b/airflow/contrib/hooks/salesforce_hook.py
@@ -208,14 +208,14 @@ class SalesforceHook(BaseHook, LoggingMixin):
 - json:
 JSON array.  Each element in the array is a different row.
 - ndjson:
-JSON array but each element is new-line deliminated
-instead of comman deliminated like in `json`
+JSON array but each element is new-line delimited
+instead of comma delimited like in `json`
 
 This requires a significant amount of cleanup.
 Pandas doesn't handle output to CSV and json in a uniform way.
 This is especially painful for datetime types.
 Pandas wants to write them as strings in CSV,
-but as milisecond Unix timestamps.
+but as millisecond Unix timestamps.
 
 By default, this function will try and leave all values as
 they are represented in Salesforce.



[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook

2018-03-01 Thread Dan Fowler (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382917#comment-16382917
 ] 

Dan Fowler commented on AIRFLOW-2159:
-

This PR resolves this ticket: 
https://github.com/apache/incubator-airflow/pull/3085

> Fix typos in salesforce_hook
> 
>
> Key: AIRFLOW-2159
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2159
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: hooks
>Reporter: Jakob Homan
>Assignee: Dan Fowler
>Priority: Major
>
> There are several typos in the saleforce_hook file that would be a good 
> starter task to fix.
> {noformat}
> - ndjson:
> JSON array but each element is new-line deliminated
> instead of comman deliminated like in `json`
> This requires a significant amount of cleanup.
> Pandas doesn't handle output to CSV and json in a uniform way.
> This is especially painful for datetime types.
> Pandas wants to write them as strings in CSV,
> but as milisecond Unix timestamps.{noformat}
> To fix: comman, deliminated, milisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (AIRFLOW-2150) Use get_partition_names() instead of get_partitions() in HiveMetastoreHook().max_partition()

2018-03-01 Thread Kevin Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-2150 started by Kevin Yang.
---
> Use get_partition_names() instead of get_partitions() in 
> HiveMetastoreHook().max_partition()
> 
>
> Key: AIRFLOW-2150
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2150
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Kevin Yang
>Assignee: Kevin Yang
>Priority: Major
>
> get_partitions() is extremely expensive for large tables, max_partition() 
> should be using get_partition_names() instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow

2018-03-01 Thread Terry McCartan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry McCartan reassigned AIRFLOW-2163:
---

Assignee: (was: Terry McCartan)

> Add HBC Digital to list of companies using Airflow
> --
>
> Key: AIRFLOW-2163
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2163
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Terry McCartan
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow

2018-03-01 Thread Terry McCartan (JIRA)
Terry McCartan created AIRFLOW-2163:
---

 Summary: Add HBC Digital to list of companies using Airflow
 Key: AIRFLOW-2163
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2163
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Terry McCartan
Assignee: Terry McCartan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2162) Run DAG as user other than airflow does NOT have access to AIRFLOW_ environment variables

2018-03-01 Thread Sebastian Radloff (JIRA)
Sebastian Radloff created AIRFLOW-2162:
--

 Summary: Run DAG as user other than airflow does NOT have access 
to AIRFLOW_ environment variables
 Key: AIRFLOW-2162
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2162
 Project: Apache Airflow
  Issue Type: Bug
  Components: configuration
Reporter: Sebastian Radloff


When running airflow with LocalExecutor, I inject airflow environment variables 
that are supposed to override what is in the airflow.cfg, according to the 
documentation [https://airflow.apache.org/configuration.html.

I|https://airflow.apache.org/configuration.html.]f you specify to run your DAGs 
as another linux user, root for example, this is what airflow executes under 
the hood:
{code:java}
['bash', '-c', u'sudo -H -u root airflow run docker_sample docker_op_tester 
2018-03-01T15:14:55.699668 --job_id 2 --raw -sd DAGS_FOLDER/docker-operator.py 
--cfg_path /tmp/tmpignV9B']
{code}
 

It uses sudo and switches to the root linux user, unfortunately, it won't have 
access to the environment variables injected to override the config. This is 
important for people who are trying to inject variables into a docker container 
at run time while wishing to maintain a level of security around database 
credentials.

I think a decent proposal made by [~ashb] in gitter, would be to automatically 
pass all environment variables starting with *AIRFLOW__* to any user. Please 
lmk if y'all want any help on the documentation or point me in the right 
direction and I could create a PR. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2124) Allow local mainPythonFileUri

2018-03-01 Thread Kaxil Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381782#comment-16381782
 ] 

Kaxil Naik commented on AIRFLOW-2124:
-

I will also double check on this and update you [~Fokko]  once I am back from 
holidays.

> Allow local mainPythonFileUri
> -
>
> Key: AIRFLOW-2124
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2124
> Project: Apache Airflow
>  Issue Type: Wish
>Reporter: robbert van waardhuizen
>Assignee: Fokko Driesprong
>Priority: Major
>
> For our workflow, we currently are in the transition from using BashOperator 
> to using the DataProcPySparkOperators. While rewriting the DAG we came to the 
> conclusion that it is not possible to submit a (local) path as our main 
> Python file, and a Hadoop Compatible Filesystem (HCFS) is required.
> Our main Python drivers are located in a Git repository. Putting our main 
> Python files in a GS bucket would require manual updating/overwriting these 
> files.
> In terms of code, this works using the BashOperator:
>  
> {code:java}
> gcloud dataproc jobs submit pyspark \
>  /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \
>  --cluster {cluster_name}{code}
>  
>  
> But cannot be replicated using the DataProcPySparkOperator:
> {code:java}
> DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py",
> cluster_name=cluster_name)
> {code}
> Error:
> {code:java}
> === Cloud Dataproc Agent Error ===
> java.lang.NullPointerException
> at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
> at sun.nio.fs.UnixPath.(UnixPath.java:71)
> at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442)
> at 
> com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
>  End of Cloud Dataproc Agent Error 
> {code}
> What would be best practice in this case?
> Is it possible to add the ability to submit local paths as main Python file?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2158) Airflow should not store logs as raw ISO timestamps

2018-03-01 Thread Ash Berlin-Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381759#comment-16381759
 ] 

Ash Berlin-Taylor commented on AIRFLOW-2158:


Duplicate of https://issues.apache.org/jira/browse/AIRFLOW-1564 which has 
already been fixed on master.

https://github.com/apache/incubator-airflow/commit/4c674ccffda1fbc38b8cc044b0e2c004422a2035
 was the commit that fixed it.

> Airflow should not store logs as raw ISO timestamps
> ---
>
> Key: AIRFLOW-2158
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2158
> Project: Apache Airflow
>  Issue Type: Improvement
> Environment: 1.9.0
>Reporter: Christian D
>Priority: Minor
>  Labels: easyfix, windows
> Fix For: Airflow 2.0
>
>
> Problem:
> When Airflow writes logs to disk, it uses a ISO-8601 timestamp as the 
> filename. In a Linux filesystem this works completely fine (because all 
> characters in a ISO-8601 timestamp is allowed). However, it doesn't work on 
> Windows based systems  (including Azure File Storage) because {{:}} is a 
> disallowed character.
> Solution:
> Ideally, Airflow should store logs such that they're somewhat compatible 
> across file systems. An easy way of fixing this would therefore be to always 
> replace {{:}} with underscores.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2124) Allow local mainPythonFileUri

2018-03-01 Thread Fokko Driesprong (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381714#comment-16381714
 ] 

Fokko Driesprong commented on AIRFLOW-2124:
---

We would like to integrate this in the DataProcOperator. We don't want to have 
additional steps We'll develop something internal which will take care of this 
and then push it back to Airflow. Cheers

> Allow local mainPythonFileUri
> -
>
> Key: AIRFLOW-2124
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2124
> Project: Apache Airflow
>  Issue Type: Wish
>Reporter: robbert van waardhuizen
>Assignee: Fokko Driesprong
>Priority: Major
>
> For our workflow, we currently are in the transition from using BashOperator 
> to using the DataProcPySparkOperators. While rewriting the DAG we came to the 
> conclusion that it is not possible to submit a (local) path as our main 
> Python file, and a Hadoop Compatible Filesystem (HCFS) is required.
> Our main Python drivers are located in a Git repository. Putting our main 
> Python files in a GS bucket would require manual updating/overwriting these 
> files.
> In terms of code, this works using the BashOperator:
>  
> {code:java}
> gcloud dataproc jobs submit pyspark \
>  /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \
>  --cluster {cluster_name}{code}
>  
>  
> But cannot be replicated using the DataProcPySparkOperator:
> {code:java}
> DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py",
> cluster_name=cluster_name)
> {code}
> Error:
> {code:java}
> === Cloud Dataproc Agent Error ===
> java.lang.NullPointerException
> at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
> at sun.nio.fs.UnixPath.(UnixPath.java:71)
> at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442)
> at 
> com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
>  End of Cloud Dataproc Agent Error 
> {code}
> What would be best practice in this case?
> Is it possible to add the ability to submit local paths as main Python file?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-2124) Allow local mainPythonFileUri

2018-03-01 Thread Fokko Driesprong (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong reassigned AIRFLOW-2124:
-

Assignee: Fokko Driesprong

> Allow local mainPythonFileUri
> -
>
> Key: AIRFLOW-2124
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2124
> Project: Apache Airflow
>  Issue Type: Wish
>Reporter: robbert van waardhuizen
>Assignee: Fokko Driesprong
>Priority: Major
>
> For our workflow, we currently are in the transition from using BashOperator 
> to using the DataProcPySparkOperators. While rewriting the DAG we came to the 
> conclusion that it is not possible to submit a (local) path as our main 
> Python file, and a Hadoop Compatible Filesystem (HCFS) is required.
> Our main Python drivers are located in a Git repository. Putting our main 
> Python files in a GS bucket would require manual updating/overwriting these 
> files.
> In terms of code, this works using the BashOperator:
>  
> {code:java}
> gcloud dataproc jobs submit pyspark \
>  /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \
>  --cluster {cluster_name}{code}
>  
>  
> But cannot be replicated using the DataProcPySparkOperator:
> {code:java}
> DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py",
> cluster_name=cluster_name)
> {code}
> Error:
> {code:java}
> === Cloud Dataproc Agent Error ===
> java.lang.NullPointerException
> at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
> at sun.nio.fs.UnixPath.(UnixPath.java:71)
> at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442)
> at 
> com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538)
> at 
> com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
> at 
> com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
>  End of Cloud Dataproc Agent Error 
> {code}
> What would be best practice in this case?
> Is it possible to add the ability to submit local paths as main Python file?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now

2018-03-01 Thread Sergio Herrera (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Herrera updated AIRFLOW-2157:

Description: 
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old task 
()
 about creating a development environment.

 

  was:
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old task about creating a development environment.

[AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87]


> Builds in TravisCI are so unstable now
> --
>
> Key: AIRFLOW-2157
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2157
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci, travis
>Reporter: Sergio Herrera
>Priority: Major
>  Labels: CI, test
>
> At the time i write this, I have a PR that builds and pass the tests 
> correctly. The problem is sometimes, after rebasing with changes in master 
> branch, TravisCI builds fails because a bad environment, but after some 
> commit recreations, it passes the tests.
> After studying some of that builds, I think the problem is that installing 
> some things from scratch has a performance impact and other issues like 
> unavailable services or bad packages installations.
> A possible great solution is creating a base image that contains some of the 
> software preinstalled (e.g, databases or messages queues) as the environment 
> for testing is the same for every build.
> This can be related to an old task 
> ()
>  about creating a development environment.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now

2018-03-01 Thread Sergio Herrera (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Herrera updated AIRFLOW-2157:

Description: 
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old task (AIRFLOW-87) about creating a development 
environment.

 

  was:
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old task 
()
 about creating a development environment.

 


> Builds in TravisCI are so unstable now
> --
>
> Key: AIRFLOW-2157
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2157
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci, travis
>Reporter: Sergio Herrera
>Priority: Major
>  Labels: CI, test
>
> At the time i write this, I have a PR that builds and pass the tests 
> correctly. The problem is sometimes, after rebasing with changes in master 
> branch, TravisCI builds fails because a bad environment, but after some 
> commit recreations, it passes the tests.
> After studying some of that builds, I think the problem is that installing 
> some things from scratch has a performance impact and other issues like 
> unavailable services or bad packages installations.
> A possible great solution is creating a base image that contains some of the 
> software preinstalled (e.g, databases or messages queues) as the environment 
> for testing is the same for every build.
> This can be related to an old task (AIRFLOW-87) about creating a development 
> environment.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now

2018-03-01 Thread Sergio Herrera (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Herrera updated AIRFLOW-2157:

Description: 
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old task about creating a development environment.

[AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87]

  was:
At the time i write this, I have a PR that builds and pass the tests correctly. 
The problem is sometimes, after rebasing with changes in master branch, 
TravisCI builds fails because a bad environment, but after some commit 
recreations, it passes the tests.

After studying some of that builds, I think the problem is that installing some 
things from scratch has a performance impact and other issues like unavailable 
services or bad packages installations.

A possible great solution is creating a base image that contains some of the 
software preinstalled (e.g, databases or messages queues) as the environment 
for testing is the same for every build.

This can be related to an old 
[task](https://issues.apache.org/jira/browse/AIRFLOW-87) about creating a 
development environment.


> Builds in TravisCI are so unstable now
> --
>
> Key: AIRFLOW-2157
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2157
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci, travis
>Reporter: Sergio Herrera
>Priority: Major
>  Labels: CI, test
>
> At the time i write this, I have a PR that builds and pass the tests 
> correctly. The problem is sometimes, after rebasing with changes in master 
> branch, TravisCI builds fails because a bad environment, but after some 
> commit recreations, it passes the tests.
> After studying some of that builds, I think the problem is that installing 
> some things from scratch has a performance impact and other issues like 
> unavailable services or bad packages installations.
> A possible great solution is creating a base image that contains some of the 
> software preinstalled (e.g, databases or messages queues) as the environment 
> for testing is the same for every build.
> This can be related to an old task about creating a development environment.
> [AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)