[jira] [Commented] (AIRFLOW-2059) taskinstance query is awful, un-indexed, and does not scale
[ https://issues.apache.org/jira/browse/AIRFLOW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383269#comment-16383269 ] Tao Feng commented on AIRFLOW-2059: --- pr created: [https://github.com/apache/incubator-airflow/pull/3086] . I don't see any issues by indexing the job_id column as its type is integer type. Let me know if I miss anything. > taskinstance query is awful, un-indexed, and does not scale > --- > > Key: AIRFLOW-2059 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2059 > Project: Apache Airflow > Issue Type: Bug > Components: db, webserver >Affects Versions: Airflow 1.8 > Environment: [nhanlon@ ~]$ nproc > 4 > [nhanlon@ ~]$ free -g > total used free sharedbuffers cached > Mem: 7 5 1 0 0 1 > -/+ buffers/cache: 4 3 > Swap:0 0 0 > [nhanlon@ ~]$ cat /etc/*release* > CentOS release 6.7 (Final) > CentOS release 6.7 (Final) > CentOS release 6.7 (Final) > cpe:/o:centos:linux:6:GA > [nhanlon@ ~]$ mysqld --version > mysqld Ver 5.6.31-77.0 for Linux on x86_64 (Percona Server (GPL), Release > 77.0, Revision 5c1061c) >Reporter: Neil Hanlon >Assignee: Tao Feng >Priority: Critical > > > The page at /admin/taskinstance/ can reach a point where it blocks loading > the page and crushes the database. It appears this is because the > task_instance.job_id column is unindexed. On our database, getting the > results for this query took over four minutes, locking the table for the > duration. > > 500 rows in set (4 min 8.93 sec) > > Query: > > {code:java} > SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id > AS task_instance_dag_id, task_instance.execution_date AS > task_instance_execution_date, task_instance.start_date AS > task_instance_start_date, task_instance.end_date AS task_instance_end_date, > task_instance.duration AS task_instance_duration, task_instance.state AS > task_instance_state, task_instance.try_number AS task_instance_try_number, > task_instance.hostname AS task_instance_hostname, task_instance.unixname AS > task_instance_unixname, task_instance.job_id AS task_instance_job_id, > task_instance.pool AS task_instance_pool, task_instance.queue AS > task_instance_queue, task_instance.priority_weight AS > task_instance_priority_weight, task_instance.operator AS > task_instance_operator, task_instance.queued_dttm AS > task_instance_queued_dttm, task_instance.pid AS task_instance_pid > FROM task_instance ORDER BY task_instance.job_id DESC > LIMIT 500; > {code} > Profile, explain: > > {code:java} > :airflow> EXPLAIN SELECT task_instance.task_id AS > task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, > task_instance.execution_date AS task_instance_execution_date, > task_instance.start_date AS task_instance_start_date, task_instance.end_date > AS task_instance_end_date, task_instance.duration AS task_instance_duration, > task_instance.state AS task_instance_state, task_instance.try_number AS > task_instance_try_number, task_instance.hostname AS task_instance_hostname, > task_instance.unixname AS task_instance_unixname, task_instance.job_id AS > task_instance_job_id, task_instance.pool AS task_instance_pool, > task_instance.queue AS task_instance_queue, task_instance.priority_weight AS > task_instance_priority_weight, task_instance.operator AS > task_instance_operator, task_instance.queued_dttm AS > task_instance_queued_dttm, task_instance.pid AS task_instance_pid > -> FROM task_instance ORDER BY task_instance.job_id DESC > -> LIMIT 500; > ++-+---+--+---+--+-+--+-++ > | id | select_type | table | type | possible_keys | key | key_len | ref | > rows | Extra | > ++-+---+--+---+--+-+--+-++ > | 1 | SIMPLE | task_instance | ALL | NULL | NULL | NULL | NULL | 2542776 | > Using filesort | > ++-+---+--+---+--+-+--+-++ > 1 row in set (0.00 sec) > :airflow> select count(*) from task_instance; > +--+ > | count(*) | > +--+ > | 2984749 | > +--+ > 1 row in set (1.67 sec) > :airflow> show profile for query 2; > +--++ > | Status | Duration | > +--++ > | starting | 0.000157 | > | checking permissions | 0.17 | > | Opening tables | 0.33 | > | init | 0.46 | > | System lock | 0.17 | > | optimizing | 0.10 | > | statistics | 0.22 | > | preparing | 0.20 | > | Sorting result | 0.10 | > | executing |
[jira] [Assigned] (AIRFLOW-2059) taskinstance query is awful, un-indexed, and does not scale
[ https://issues.apache.org/jira/browse/AIRFLOW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Feng reassigned AIRFLOW-2059: - Assignee: Tao Feng > taskinstance query is awful, un-indexed, and does not scale > --- > > Key: AIRFLOW-2059 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2059 > Project: Apache Airflow > Issue Type: Bug > Components: db, webserver >Affects Versions: Airflow 1.8 > Environment: [nhanlon@ ~]$ nproc > 4 > [nhanlon@ ~]$ free -g > total used free sharedbuffers cached > Mem: 7 5 1 0 0 1 > -/+ buffers/cache: 4 3 > Swap:0 0 0 > [nhanlon@ ~]$ cat /etc/*release* > CentOS release 6.7 (Final) > CentOS release 6.7 (Final) > CentOS release 6.7 (Final) > cpe:/o:centos:linux:6:GA > [nhanlon@ ~]$ mysqld --version > mysqld Ver 5.6.31-77.0 for Linux on x86_64 (Percona Server (GPL), Release > 77.0, Revision 5c1061c) >Reporter: Neil Hanlon >Assignee: Tao Feng >Priority: Critical > > > The page at /admin/taskinstance/ can reach a point where it blocks loading > the page and crushes the database. It appears this is because the > task_instance.job_id column is unindexed. On our database, getting the > results for this query took over four minutes, locking the table for the > duration. > > 500 rows in set (4 min 8.93 sec) > > Query: > > {code:java} > SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id > AS task_instance_dag_id, task_instance.execution_date AS > task_instance_execution_date, task_instance.start_date AS > task_instance_start_date, task_instance.end_date AS task_instance_end_date, > task_instance.duration AS task_instance_duration, task_instance.state AS > task_instance_state, task_instance.try_number AS task_instance_try_number, > task_instance.hostname AS task_instance_hostname, task_instance.unixname AS > task_instance_unixname, task_instance.job_id AS task_instance_job_id, > task_instance.pool AS task_instance_pool, task_instance.queue AS > task_instance_queue, task_instance.priority_weight AS > task_instance_priority_weight, task_instance.operator AS > task_instance_operator, task_instance.queued_dttm AS > task_instance_queued_dttm, task_instance.pid AS task_instance_pid > FROM task_instance ORDER BY task_instance.job_id DESC > LIMIT 500; > {code} > Profile, explain: > > {code:java} > :airflow> EXPLAIN SELECT task_instance.task_id AS > task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, > task_instance.execution_date AS task_instance_execution_date, > task_instance.start_date AS task_instance_start_date, task_instance.end_date > AS task_instance_end_date, task_instance.duration AS task_instance_duration, > task_instance.state AS task_instance_state, task_instance.try_number AS > task_instance_try_number, task_instance.hostname AS task_instance_hostname, > task_instance.unixname AS task_instance_unixname, task_instance.job_id AS > task_instance_job_id, task_instance.pool AS task_instance_pool, > task_instance.queue AS task_instance_queue, task_instance.priority_weight AS > task_instance_priority_weight, task_instance.operator AS > task_instance_operator, task_instance.queued_dttm AS > task_instance_queued_dttm, task_instance.pid AS task_instance_pid > -> FROM task_instance ORDER BY task_instance.job_id DESC > -> LIMIT 500; > ++-+---+--+---+--+-+--+-++ > | id | select_type | table | type | possible_keys | key | key_len | ref | > rows | Extra | > ++-+---+--+---+--+-+--+-++ > | 1 | SIMPLE | task_instance | ALL | NULL | NULL | NULL | NULL | 2542776 | > Using filesort | > ++-+---+--+---+--+-+--+-++ > 1 row in set (0.00 sec) > :airflow> select count(*) from task_instance; > +--+ > | count(*) | > +--+ > | 2984749 | > +--+ > 1 row in set (1.67 sec) > :airflow> show profile for query 2; > +--++ > | Status | Duration | > +--++ > | starting | 0.000157 | > | checking permissions | 0.17 | > | Opening tables | 0.33 | > | init | 0.46 | > | System lock | 0.17 | > | optimizing | 0.10 | > | statistics | 0.22 | > | preparing | 0.20 | > | Sorting result | 0.10 | > | executing | 0.08 | > | Sending data | 0.000151 | > | Creating sort index | 248.955841 | > | end | 0.015358 | > | query end | 0.12 | > | closing tables | 0.19 | > | freeing items | 0.000549 | > |
[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook
[ https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383201#comment-16383201 ] Jakob Homan commented on AIRFLOW-2159: -- Thanks, Dan! Resolving. > Fix typos in salesforce_hook > > > Key: AIRFLOW-2159 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2159 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks >Reporter: Jakob Homan >Assignee: Dan Fowler >Priority: Major > Fix For: Airflow 2.0 > > > There are several typos in the saleforce_hook file that would be a good > starter task to fix. > {noformat} > - ndjson: > JSON array but each element is new-line deliminated > instead of comman deliminated like in `json` > This requires a significant amount of cleanup. > Pandas doesn't handle output to CSV and json in a uniform way. > This is especially painful for datetime types. > Pandas wants to write them as strings in CSV, > but as milisecond Unix timestamps.{noformat} > To fix: comman, deliminated, milisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-2159) Fix typos in salesforce_hook
[ https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan resolved AIRFLOW-2159. -- Resolution: Fixed > Fix typos in salesforce_hook > > > Key: AIRFLOW-2159 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2159 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks >Reporter: Jakob Homan >Assignee: Dan Fowler >Priority: Major > Fix For: Airflow 2.0 > > > There are several typos in the saleforce_hook file that would be a good > starter task to fix. > {noformat} > - ndjson: > JSON array but each element is new-line deliminated > instead of comman deliminated like in `json` > This requires a significant amount of cleanup. > Pandas doesn't handle output to CSV and json in a uniform way. > This is especially painful for datetime types. > Pandas wants to write them as strings in CSV, > but as milisecond Unix timestamps.{noformat} > To fix: comman, deliminated, milisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2159) Fix typos in salesforce_hook
[ https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakob Homan updated AIRFLOW-2159: - Fix Version/s: Airflow 2.0 > Fix typos in salesforce_hook > > > Key: AIRFLOW-2159 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2159 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks >Reporter: Jakob Homan >Assignee: Dan Fowler >Priority: Major > Fix For: Airflow 2.0 > > > There are several typos in the saleforce_hook file that would be a good > starter task to fix. > {noformat} > - ndjson: > JSON array but each element is new-line deliminated > instead of comman deliminated like in `json` > This requires a significant amount of cleanup. > Pandas doesn't handle output to CSV and json in a uniform way. > This is especially painful for datetime types. > Pandas wants to write them as strings in CSV, > but as milisecond Unix timestamps.{noformat} > To fix: comman, deliminated, milisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook
[ https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383198#comment-16383198 ] ASF subversion and git services commented on AIRFLOW-2159: -- Commit c7e39683d80caf89928f0002cf5214fe68d8775b in incubator-airflow's branch refs/heads/master from [~dfowler] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=c7e3968 ] [AIRFLOW-2159] Fix a few typos in salesforce_hook > Fix typos in salesforce_hook > > > Key: AIRFLOW-2159 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2159 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks >Reporter: Jakob Homan >Assignee: Dan Fowler >Priority: Major > > There are several typos in the saleforce_hook file that would be a good > starter task to fix. > {noformat} > - ndjson: > JSON array but each element is new-line deliminated > instead of comman deliminated like in `json` > This requires a significant amount of cleanup. > Pandas doesn't handle output to CSV and json in a uniform way. > This is especially painful for datetime types. > Pandas wants to write them as strings in CSV, > but as milisecond Unix timestamps.{noformat} > To fix: comman, deliminated, milisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2159] Fix a few typos in salesforce_hook
Repository: incubator-airflow Updated Branches: refs/heads/master 2511c46c2 -> c7e39683d [AIRFLOW-2159] Fix a few typos in salesforce_hook Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/c7e39683 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/c7e39683 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/c7e39683 Branch: refs/heads/master Commit: c7e39683d80caf89928f0002cf5214fe68d8775b Parents: 2511c46 Author: dan-sfAuthored: Thu Mar 1 09:46:37 2018 -0800 Committer: dan-sf Committed: Thu Mar 1 09:46:37 2018 -0800 -- airflow/contrib/hooks/salesforce_hook.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/c7e39683/airflow/contrib/hooks/salesforce_hook.py -- diff --git a/airflow/contrib/hooks/salesforce_hook.py b/airflow/contrib/hooks/salesforce_hook.py index b82e4ca..bf03638 100644 --- a/airflow/contrib/hooks/salesforce_hook.py +++ b/airflow/contrib/hooks/salesforce_hook.py @@ -208,14 +208,14 @@ class SalesforceHook(BaseHook, LoggingMixin): - json: JSON array. Each element in the array is a different row. - ndjson: -JSON array but each element is new-line deliminated -instead of comman deliminated like in `json` +JSON array but each element is new-line delimited +instead of comma delimited like in `json` This requires a significant amount of cleanup. Pandas doesn't handle output to CSV and json in a uniform way. This is especially painful for datetime types. Pandas wants to write them as strings in CSV, -but as milisecond Unix timestamps. +but as millisecond Unix timestamps. By default, this function will try and leave all values as they are represented in Salesforce.
[jira] [Commented] (AIRFLOW-2159) Fix typos in salesforce_hook
[ https://issues.apache.org/jira/browse/AIRFLOW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382917#comment-16382917 ] Dan Fowler commented on AIRFLOW-2159: - This PR resolves this ticket: https://github.com/apache/incubator-airflow/pull/3085 > Fix typos in salesforce_hook > > > Key: AIRFLOW-2159 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2159 > Project: Apache Airflow > Issue Type: Improvement > Components: hooks >Reporter: Jakob Homan >Assignee: Dan Fowler >Priority: Major > > There are several typos in the saleforce_hook file that would be a good > starter task to fix. > {noformat} > - ndjson: > JSON array but each element is new-line deliminated > instead of comman deliminated like in `json` > This requires a significant amount of cleanup. > Pandas doesn't handle output to CSV and json in a uniform way. > This is especially painful for datetime types. > Pandas wants to write them as strings in CSV, > but as milisecond Unix timestamps.{noformat} > To fix: comman, deliminated, milisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2150) Use get_partition_names() instead of get_partitions() in HiveMetastoreHook().max_partition()
[ https://issues.apache.org/jira/browse/AIRFLOW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-2150 started by Kevin Yang. --- > Use get_partition_names() instead of get_partitions() in > HiveMetastoreHook().max_partition() > > > Key: AIRFLOW-2150 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2150 > Project: Apache Airflow > Issue Type: Bug >Reporter: Kevin Yang >Assignee: Kevin Yang >Priority: Major > > get_partitions() is extremely expensive for large tables, max_partition() > should be using get_partition_names() instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow
[ https://issues.apache.org/jira/browse/AIRFLOW-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry McCartan reassigned AIRFLOW-2163: --- Assignee: (was: Terry McCartan) > Add HBC Digital to list of companies using Airflow > -- > > Key: AIRFLOW-2163 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2163 > Project: Apache Airflow > Issue Type: Bug >Reporter: Terry McCartan >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2163) Add HBC Digital to list of companies using Airflow
Terry McCartan created AIRFLOW-2163: --- Summary: Add HBC Digital to list of companies using Airflow Key: AIRFLOW-2163 URL: https://issues.apache.org/jira/browse/AIRFLOW-2163 Project: Apache Airflow Issue Type: Bug Reporter: Terry McCartan Assignee: Terry McCartan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2162) Run DAG as user other than airflow does NOT have access to AIRFLOW_ environment variables
Sebastian Radloff created AIRFLOW-2162: -- Summary: Run DAG as user other than airflow does NOT have access to AIRFLOW_ environment variables Key: AIRFLOW-2162 URL: https://issues.apache.org/jira/browse/AIRFLOW-2162 Project: Apache Airflow Issue Type: Bug Components: configuration Reporter: Sebastian Radloff When running airflow with LocalExecutor, I inject airflow environment variables that are supposed to override what is in the airflow.cfg, according to the documentation [https://airflow.apache.org/configuration.html. I|https://airflow.apache.org/configuration.html.]f you specify to run your DAGs as another linux user, root for example, this is what airflow executes under the hood: {code:java} ['bash', '-c', u'sudo -H -u root airflow run docker_sample docker_op_tester 2018-03-01T15:14:55.699668 --job_id 2 --raw -sd DAGS_FOLDER/docker-operator.py --cfg_path /tmp/tmpignV9B'] {code} It uses sudo and switches to the root linux user, unfortunately, it won't have access to the environment variables injected to override the config. This is important for people who are trying to inject variables into a docker container at run time while wishing to maintain a level of security around database credentials. I think a decent proposal made by [~ashb] in gitter, would be to automatically pass all environment variables starting with *AIRFLOW__* to any user. Please lmk if y'all want any help on the documentation or point me in the right direction and I could create a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2124) Allow local mainPythonFileUri
[ https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381782#comment-16381782 ] Kaxil Naik commented on AIRFLOW-2124: - I will also double check on this and update you [~Fokko] once I am back from holidays. > Allow local mainPythonFileUri > - > > Key: AIRFLOW-2124 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2124 > Project: Apache Airflow > Issue Type: Wish >Reporter: robbert van waardhuizen >Assignee: Fokko Driesprong >Priority: Major > > For our workflow, we currently are in the transition from using BashOperator > to using the DataProcPySparkOperators. While rewriting the DAG we came to the > conclusion that it is not possible to submit a (local) path as our main > Python file, and a Hadoop Compatible Filesystem (HCFS) is required. > Our main Python drivers are located in a Git repository. Putting our main > Python files in a GS bucket would require manual updating/overwriting these > files. > In terms of code, this works using the BashOperator: > > {code:java} > gcloud dataproc jobs submit pyspark \ > /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \ > --cluster {cluster_name}{code} > > > But cannot be replicated using the DataProcPySparkOperator: > {code:java} > DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py", > cluster_name=cluster_name) > {code} > Error: > {code:java} > === Cloud Dataproc Agent Error === > java.lang.NullPointerException > at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77) > at sun.nio.fs.UnixPath.(UnixPath.java:71) > at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442) > at > com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > End of Cloud Dataproc Agent Error > {code} > What would be best practice in this case? > Is it possible to add the ability to submit local paths as main Python file? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2158) Airflow should not store logs as raw ISO timestamps
[ https://issues.apache.org/jira/browse/AIRFLOW-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381759#comment-16381759 ] Ash Berlin-Taylor commented on AIRFLOW-2158: Duplicate of https://issues.apache.org/jira/browse/AIRFLOW-1564 which has already been fixed on master. https://github.com/apache/incubator-airflow/commit/4c674ccffda1fbc38b8cc044b0e2c004422a2035 was the commit that fixed it. > Airflow should not store logs as raw ISO timestamps > --- > > Key: AIRFLOW-2158 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2158 > Project: Apache Airflow > Issue Type: Improvement > Environment: 1.9.0 >Reporter: Christian D >Priority: Minor > Labels: easyfix, windows > Fix For: Airflow 2.0 > > > Problem: > When Airflow writes logs to disk, it uses a ISO-8601 timestamp as the > filename. In a Linux filesystem this works completely fine (because all > characters in a ISO-8601 timestamp is allowed). However, it doesn't work on > Windows based systems (including Azure File Storage) because {{:}} is a > disallowed character. > Solution: > Ideally, Airflow should store logs such that they're somewhat compatible > across file systems. An easy way of fixing this would therefore be to always > replace {{:}} with underscores. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2124) Allow local mainPythonFileUri
[ https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381714#comment-16381714 ] Fokko Driesprong commented on AIRFLOW-2124: --- We would like to integrate this in the DataProcOperator. We don't want to have additional steps We'll develop something internal which will take care of this and then push it back to Airflow. Cheers > Allow local mainPythonFileUri > - > > Key: AIRFLOW-2124 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2124 > Project: Apache Airflow > Issue Type: Wish >Reporter: robbert van waardhuizen >Assignee: Fokko Driesprong >Priority: Major > > For our workflow, we currently are in the transition from using BashOperator > to using the DataProcPySparkOperators. While rewriting the DAG we came to the > conclusion that it is not possible to submit a (local) path as our main > Python file, and a Hadoop Compatible Filesystem (HCFS) is required. > Our main Python drivers are located in a Git repository. Putting our main > Python files in a GS bucket would require manual updating/overwriting these > files. > In terms of code, this works using the BashOperator: > > {code:java} > gcloud dataproc jobs submit pyspark \ > /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \ > --cluster {cluster_name}{code} > > > But cannot be replicated using the DataProcPySparkOperator: > {code:java} > DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py", > cluster_name=cluster_name) > {code} > Error: > {code:java} > === Cloud Dataproc Agent Error === > java.lang.NullPointerException > at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77) > at sun.nio.fs.UnixPath.(UnixPath.java:71) > at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442) > at > com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > End of Cloud Dataproc Agent Error > {code} > What would be best practice in this case? > Is it possible to add the ability to submit local paths as main Python file? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-2124) Allow local mainPythonFileUri
[ https://issues.apache.org/jira/browse/AIRFLOW-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong reassigned AIRFLOW-2124: - Assignee: Fokko Driesprong > Allow local mainPythonFileUri > - > > Key: AIRFLOW-2124 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2124 > Project: Apache Airflow > Issue Type: Wish >Reporter: robbert van waardhuizen >Assignee: Fokko Driesprong >Priority: Major > > For our workflow, we currently are in the transition from using BashOperator > to using the DataProcPySparkOperators. While rewriting the DAG we came to the > conclusion that it is not possible to submit a (local) path as our main > Python file, and a Hadoop Compatible Filesystem (HCFS) is required. > Our main Python drivers are located in a Git repository. Putting our main > Python files in a GS bucket would require manual updating/overwriting these > files. > In terms of code, this works using the BashOperator: > > {code:java} > gcloud dataproc jobs submit pyspark \ > /usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py \ > --cluster {cluster_name}{code} > > > But cannot be replicated using the DataProcPySparkOperator: > {code:java} > DataProcPySparkOperator(main="/usr/local/airflow/git/airflow-dags/jobs/main_python_driver.py", > cluster_name=cluster_name) > {code} > Error: > {code:java} > === Cloud Dataproc Agent Error === > java.lang.NullPointerException > at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77) > at sun.nio.fs.UnixPath.(UnixPath.java:71) > at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.registerResourceForDownload(AbstractJobHandler.java:442) > at > com.google.cloud.hadoop.services.agent.job.PySparkJobHandler.buildCommand(PySparkJobHandler.java:93) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:538) > at > com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:127) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) > at > com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > End of Cloud Dataproc Agent Error > {code} > What would be best practice in this case? > Is it possible to add the ability to submit local paths as main Python file? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now
[ https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Herrera updated AIRFLOW-2157: Description: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old task () about creating a development environment. was: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old task about creating a development environment. [AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87] > Builds in TravisCI are so unstable now > -- > > Key: AIRFLOW-2157 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2157 > Project: Apache Airflow > Issue Type: Improvement > Components: ci, travis >Reporter: Sergio Herrera >Priority: Major > Labels: CI, test > > At the time i write this, I have a PR that builds and pass the tests > correctly. The problem is sometimes, after rebasing with changes in master > branch, TravisCI builds fails because a bad environment, but after some > commit recreations, it passes the tests. > After studying some of that builds, I think the problem is that installing > some things from scratch has a performance impact and other issues like > unavailable services or bad packages installations. > A possible great solution is creating a base image that contains some of the > software preinstalled (e.g, databases or messages queues) as the environment > for testing is the same for every build. > This can be related to an old task > ( ) > about creating a development environment. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now
[ https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Herrera updated AIRFLOW-2157: Description: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old task (AIRFLOW-87) about creating a development environment. was: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old task () about creating a development environment. > Builds in TravisCI are so unstable now > -- > > Key: AIRFLOW-2157 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2157 > Project: Apache Airflow > Issue Type: Improvement > Components: ci, travis >Reporter: Sergio Herrera >Priority: Major > Labels: CI, test > > At the time i write this, I have a PR that builds and pass the tests > correctly. The problem is sometimes, after rebasing with changes in master > branch, TravisCI builds fails because a bad environment, but after some > commit recreations, it passes the tests. > After studying some of that builds, I think the problem is that installing > some things from scratch has a performance impact and other issues like > unavailable services or bad packages installations. > A possible great solution is creating a base image that contains some of the > software preinstalled (e.g, databases or messages queues) as the environment > for testing is the same for every build. > This can be related to an old task (AIRFLOW-87) about creating a development > environment. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2157) Builds in TravisCI are so unstable now
[ https://issues.apache.org/jira/browse/AIRFLOW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Herrera updated AIRFLOW-2157: Description: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old task about creating a development environment. [AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87] was: At the time i write this, I have a PR that builds and pass the tests correctly. The problem is sometimes, after rebasing with changes in master branch, TravisCI builds fails because a bad environment, but after some commit recreations, it passes the tests. After studying some of that builds, I think the problem is that installing some things from scratch has a performance impact and other issues like unavailable services or bad packages installations. A possible great solution is creating a base image that contains some of the software preinstalled (e.g, databases or messages queues) as the environment for testing is the same for every build. This can be related to an old [task](https://issues.apache.org/jira/browse/AIRFLOW-87) about creating a development environment. > Builds in TravisCI are so unstable now > -- > > Key: AIRFLOW-2157 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2157 > Project: Apache Airflow > Issue Type: Improvement > Components: ci, travis >Reporter: Sergio Herrera >Priority: Major > Labels: CI, test > > At the time i write this, I have a PR that builds and pass the tests > correctly. The problem is sometimes, after rebasing with changes in master > branch, TravisCI builds fails because a bad environment, but after some > commit recreations, it passes the tests. > After studying some of that builds, I think the problem is that installing > some things from scratch has a performance impact and other issues like > unavailable services or bad packages installations. > A possible great solution is creating a base image that contains some of the > software preinstalled (e.g, databases or messages queues) as the environment > for testing is the same for every build. > This can be related to an old task about creating a development environment. > [AIRFLOW-87|https://issues.apache.org/jira/projects/AIRFLOW/issues/AIRFLOW-87] -- This message was sent by Atlassian JIRA (v7.6.3#76005)