[jira] [Resolved] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running
[ https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arthur Wiedmer resolved AIRFLOW-2335. - Resolution: Fixed Fix Version/s: 1.10.0 Issue resolved by pull request #3236 [https://github.com/apache/incubator-airflow/pull/3236] > Issue downloading oracle jdk8 is preventing travis builds from running > -- > > Key: AIRFLOW-2335 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2335 > Project: Apache Airflow > Issue Type: Bug >Reporter: Daniel Imberman >Assignee: Daniel Imberman >Priority: Major > Fix For: 1.10.0 > > > Currently, all airflow build are dying after ~1 minute due to an issue with > how travis pulls jdk8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running
[ https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441901#comment-16441901 ] ASF subversion and git services commented on AIRFLOW-2335: -- Commit 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa in incubator-airflow's branch refs/heads/master from [~dimberman] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=0f8507a ] [AIRFLOW-2335] fix issue with jdk8 download for ci Make sure you have checked _all_ steps below. - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-2335 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a JIRA issue. - [x] Here are some details about my PR, including screenshots of any UI changes: There is an issue with travis pulling jdk8 that is preventing CI jobs from running. This blocks further development of the project. Reference: https://github.com/travis-ci/travis- ci/issues/9512#issuecomment-382235301 - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: This PR can't be unit tested since it is just configuration. However, the fact that unit tests run successfully should show that it works. - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added. - [ ] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff` Closes #3236 from dimberman/AIRFLOW- 2335_travis_issue > Issue downloading oracle jdk8 is preventing travis builds from running > -- > > Key: AIRFLOW-2335 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2335 > Project: Apache Airflow > Issue Type: Bug >Reporter: Daniel Imberman >Assignee: Daniel Imberman >Priority: Major > > Currently, all airflow build are dying after ~1 minute due to an issue with > how travis pulls jdk8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2335] fix issue with jdk8 download for ci
Repository: incubator-airflow Updated Branches: refs/heads/master 3f1bfd38c -> 0f8507ae3 [AIRFLOW-2335] fix issue with jdk8 download for ci Make sure you have checked _all_ steps below. - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-2335 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a JIRA issue. - [x] Here are some details about my PR, including screenshots of any UI changes: There is an issue with travis pulling jdk8 that is preventing CI jobs from running. This blocks further development of the project. Reference: https://github.com/travis-ci/travis- ci/issues/9512#issuecomment-382235301 - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: This PR can't be unit tested since it is just configuration. However, the fact that unit tests run successfully should show that it works. - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added. - [ ] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff` Closes #3236 from dimberman/AIRFLOW- 2335_travis_issue Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/0f8507ae Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/0f8507ae Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/0f8507ae Branch: refs/heads/master Commit: 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa Parents: 3f1bfd3 Author: Daniel Imberman Authored: Tue Apr 17 21:57:14 2018 -0700 Committer: Arthur Wiedmer Committed: Tue Apr 17 21:57:42 2018 -0700 -- .travis.yml | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/0f8507ae/.travis.yml -- diff --git a/.travis.yml b/.travis.yml index d9a333d..883473d 100644 --- a/.travis.yml +++ b/.travis.yml @@ -6,9 +6,9 @@ # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at -# +# # http://www.apache.org/licenses/LICENSE-2.0 -# +# # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -37,7 +37,6 @@ addons: - krb5-user - krb5-kdc - krb5-admin-server - - oracle-java8-installer - python-selinux postgresql: "9.2" python: @@ -93,7 +92,19 @@ before_install: - cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys - ln -s ~/.ssh/authorized_keys ~/.ssh/authorized_keys2 - chmod 600 ~/.ssh/* + - sudo add-apt-repository -y ppa:webupd8team/java + - sudo apt-get update + - sudo apt-get install -y oracle-java8-installer || true + #todo remove this kludge and the above || true when the ppa is fixed + - cd /var/lib/dpkg/info + - sudo sed -i 's|JAVA_VERSION=8u161|JAVA_VERSION=8u172|' oracle-java8-installer.* + - sudo sed -i 's|PARTNER_URL=http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/|PARTNER_URL=http://download.oracle.com/otn-pub/java/jdk/8u172-b11/a58eab1ec242421181065cdc37240b08/|' oracle-java8-installer.* + - sudo sed -i 's|SHA256SUM_TGZ="6dbc56a0e3310b69e91bb64db63a485bd7b6a8083f08e48047276380a0e2021e"|SHA256SUM_TGZ="28a00b9400b6913563553e09e8024c286b506d8523334c93ddec6c9ec7e9d346"|' oracle-java8-installer.* + - sudo sed -i 's|J_DIR=jdk1.8.0_161|J_DIR=jdk1.8.0_172|' oracle-java8-installer.* + - sudo apt-get update + - sudo apt-get install -y oracle-java8-installer - jdk_switcher use oraclejdk8 + - cd $TRAVIS_BUILD_DIR install: - pip install --upgrade pip - pip install tox
[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running
[ https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441902#comment-16441902 ] ASF subversion and git services commented on AIRFLOW-2335: -- Commit 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa in incubator-airflow's branch refs/heads/master from [~dimberman] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=0f8507a ] [AIRFLOW-2335] fix issue with jdk8 download for ci Make sure you have checked _all_ steps below. - [x] My PR addresses the following [Airflow JIRA] (https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-2335 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a JIRA issue. - [x] Here are some details about my PR, including screenshots of any UI changes: There is an issue with travis pulling jdk8 that is preventing CI jobs from running. This blocks further development of the project. Reference: https://github.com/travis-ci/travis- ci/issues/9512#issuecomment-382235301 - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: This PR can't be unit tested since it is just configuration. However, the fact that unit tests run successfully should show that it works. - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git- commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added. - [ ] Passes `git diff upstream/master -u -- "*.py" | flake8 --diff` Closes #3236 from dimberman/AIRFLOW- 2335_travis_issue > Issue downloading oracle jdk8 is preventing travis builds from running > -- > > Key: AIRFLOW-2335 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2335 > Project: Apache Airflow > Issue Type: Bug >Reporter: Daniel Imberman >Assignee: Daniel Imberman >Priority: Major > > Currently, all airflow build are dying after ~1 minute due to an issue with > how travis pulls jdk8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running
Daniel Imberman created AIRFLOW-2335: Summary: Issue downloading oracle jdk8 is preventing travis builds from running Key: AIRFLOW-2335 URL: https://issues.apache.org/jira/browse/AIRFLOW-2335 Project: Apache Airflow Issue Type: Bug Reporter: Daniel Imberman Assignee: Daniel Imberman Currently, all airflow build are dying after ~1 minute due to an issue with how travis pulls jdk8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2334) Add AWS Database Migration Service operators and sensors
Jordan Zucker created AIRFLOW-2334: -- Summary: Add AWS Database Migration Service operators and sensors Key: AIRFLOW-2334 URL: https://issues.apache.org/jira/browse/AIRFLOW-2334 Project: Apache Airflow Issue Type: New Feature Components: aws, contrib, operators Reporter: Jordan Zucker Assignee: Jordan Zucker [AWS Database Migration Service]([https://aws.amazon.com/dms/)] allows for long running, asynchronous tasks to move, copy, and back up databases. It is great for Airflow management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2333) Add Segment Hook to Airflow
Jordan Zucker created AIRFLOW-2333: -- Summary: Add Segment Hook to Airflow Key: AIRFLOW-2333 URL: https://issues.apache.org/jira/browse/AIRFLOW-2333 Project: Apache Airflow Issue Type: New Feature Components: contrib, hooks Reporter: Jordan Zucker Assignee: Jordan Zucker [Segment]([https://segment.com/)] is used by many to track analytics. Would be nice to allow Airflow to interact with Segment and store username and password with encryption in its database. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud
[ https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441527#comment-16441527 ] Guillermo Rodríguez Cano commented on AIRFLOW-1894: --- I did read your comment, [~yiga2], and I did not say anything about that in my comment (In fact I think that what you did could be a complement to the standard google-cloud-python library as I haven't checked it enough to conclude whether it is possible to stream a file or not). I assume the google-cloud-python library is better performing than the currently used one but the changes required in hooks are quite some and so I was requesting for more information on that, and offered my help to change it (and try to figure out if it is possible to emulate such storage transfer service you point and that I guess it is the same as the transfer service offered in the UI of Google Cloud Console). > Rebase and migrate existing Airflow GCP operators to google-python-cloud > > > Key: AIRFLOW-1894 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1894 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: Airflow 2.0 >Reporter: Feng Lu >Assignee: Feng Lu >Priority: Minor > > [google-api-python-client|https://github.com/google/google-api-python-client] > is in maintenance mode and it's recommended that > [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python] > should be used whenever possible. Given that we don't have feature parity > between the two libraries, this issue is created to track the long-term > migration efforts moving from google-api-python-client to > google-cloud-python. Here are some general guidelines we try to follow in > this cleanup process: > - add google-cloud-python dependency as part of gcp_api extra packages (make > sure there is no dependency conflict between the two). > - new operators shall be based on google-cloud-python if possible. > - migrate existing GCP operators when the underlying GCP service is available > in google-cloud-python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud
[ https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441510#comment-16441510 ] Yannick Einsweiler commented on AIRFLOW-1894: - [~wileeam] read my initial comment, the idea was to leverage [https://cloud.google.com/storage/transfer/reference/rest/] *without* the need for a local or virtual machine passthrough. We ended writing that operator (and a sensor that checks when job is complete), all triggered from within Gcloud but have since then abandoned the project and have our AWS Lambda save output to GCS directly. > Rebase and migrate existing Airflow GCP operators to google-python-cloud > > > Key: AIRFLOW-1894 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1894 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: Airflow 2.0 >Reporter: Feng Lu >Assignee: Feng Lu >Priority: Minor > > [google-api-python-client|https://github.com/google/google-api-python-client] > is in maintenance mode and it's recommended that > [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python] > should be used whenever possible. Given that we don't have feature parity > between the two libraries, this issue is created to track the long-term > migration efforts moving from google-api-python-client to > google-cloud-python. Here are some general guidelines we try to follow in > this cleanup process: > - add google-cloud-python dependency as part of gcp_api extra packages (make > sure there is no dependency conflict between the two). > - new operators shall be based on google-cloud-python if possible. > - migrate existing GCP operators when the underlying GCP service is available > in google-cloud-python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2332) S3 download operator
Guillermo Rodríguez Cano created AIRFLOW-2332: - Summary: S3 download operator Key: AIRFLOW-2332 URL: https://issues.apache.org/jira/browse/AIRFLOW-2332 Project: Apache Airflow Issue Type: New Feature Components: aws Reporter: Guillermo Rodríguez Cano Assignee: Guillermo Rodríguez Cano An operator that will download a remote file on S3 to the same machine where Airflow is running (or saved in the xcom if not too big) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2332) S3 Download Operator
[ https://issues.apache.org/jira/browse/AIRFLOW-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillermo Rodríguez Cano updated AIRFLOW-2332: -- Summary: S3 Download Operator (was: S3 download operator) > S3 Download Operator > > > Key: AIRFLOW-2332 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2332 > Project: Apache Airflow > Issue Type: New Feature > Components: aws >Reporter: Guillermo Rodríguez Cano >Assignee: Guillermo Rodríguez Cano >Priority: Major > > An operator that will download a remote file on S3 to the same machine where > Airflow is running (or saved in the xcom if not too big) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud
[ https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441453#comment-16441453 ] Guillermo Rodríguez Cano commented on AIRFLOW-1894: --- Hello, Do you have any update on this [~fenglu]? I suggest making up smaller chunks of what to do here. I have recently implemented a S3 to GCS operator but there are some performance flaws (file cannot be streamed all the way from S3 to GCS, and a lot of memory is used on the machine doing the task, so this is not reasonable for very large files). I am happy to help but not sure how much progress there is on this migration already done (that is not committed obviously). > Rebase and migrate existing Airflow GCP operators to google-python-cloud > > > Key: AIRFLOW-1894 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1894 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Affects Versions: Airflow 2.0 >Reporter: Feng Lu >Assignee: Feng Lu >Priority: Minor > > [google-api-python-client|https://github.com/google/google-api-python-client] > is in maintenance mode and it's recommended that > [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python] > should be used whenever possible. Given that we don't have feature parity > between the two libraries, this issue is created to track the long-term > migration efforts moving from google-api-python-client to > google-cloud-python. Here are some general guidelines we try to follow in > this cleanup process: > - add google-cloud-python dependency as part of gcp_api extra packages (make > sure there is no dependency conflict between the two). > - new operators shall be based on google-cloud-python if possible. > - migrate existing GCP operators when the underlying GCP service is available > in google-cloud-python. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.
[ https://issues.apache.org/jira/browse/AIRFLOW-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-2331 started by Cristòfol Torrens. -- > Add support for initialization action timeout on dataproc cluster creation. > --- > > Key: AIRFLOW-2331 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2331 > Project: Apache Airflow > Issue Type: Improvement > Components: contrib >Reporter: Cristòfol Torrens >Assignee: Cristòfol Torrens >Priority: Minor > Labels: contrib, dataproc, operator > Fix For: Airflow 2.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Add support to customize timeout for initialization scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.
Cristòfol Torrens created AIRFLOW-2331: -- Summary: Add support for initialization action timeout on dataproc cluster creation. Key: AIRFLOW-2331 URL: https://issues.apache.org/jira/browse/AIRFLOW-2331 Project: Apache Airflow Issue Type: Improvement Components: contrib Reporter: Cristòfol Torrens Assignee: Cristòfol Torrens Fix For: Airflow 2.0 Add support to customize timeout for initialization scripts. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given
[ https://issues.apache.org/jira/browse/AIRFLOW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-2330 started by Berislav Lopac. --- > GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends > destination_object even when not given > - > > Key: AIRFLOW-2330 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2330 > Project: Apache Airflow > Issue Type: Bug >Reporter: Berislav Lopac >Assignee: Berislav Lopac >Priority: Major > > Currently, the operator builds the destination like this: > {code} > hook.copy(self.source_bucket, source_object, > self.destination_bucket, "{}/{}".format(self.destination_object, > source_object)) > {code} > If destination is {{None}} (the default) the file will land in > {{None/\{source_object\}}}, and if it's an empty string it goes to > {{/\{source_object\}}}. Basically, it should not prepend > {{destination_object}} if it's empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given
[ https://issues.apache.org/jira/browse/AIRFLOW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Berislav Lopac reassigned AIRFLOW-2330: --- Assignee: Berislav Lopac > GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends > destination_object even when not given > - > > Key: AIRFLOW-2330 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2330 > Project: Apache Airflow > Issue Type: Bug >Reporter: Berislav Lopac >Assignee: Berislav Lopac >Priority: Major > > Currently, the operator builds the destination like this: > {code} > hook.copy(self.source_bucket, source_object, > self.destination_bucket, "{}/{}".format(self.destination_object, > source_object)) > {code} > If destination is {{None}} (the default) the file will land in > {{None/\{source_object\}}}, and if it's an empty string it goes to > {{/\{source_object\}}}. Basically, it should not prepend > {{destination_object}} if it's empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given
Berislav Lopac created AIRFLOW-2330: --- Summary: GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given Key: AIRFLOW-2330 URL: https://issues.apache.org/jira/browse/AIRFLOW-2330 Project: Apache Airflow Issue Type: Bug Reporter: Berislav Lopac Currently, the operator builds the destination like this: {code} hook.copy(self.source_bucket, source_object, self.destination_bucket, "{}/{}".format(self.destination_object, source_object)) {code} If destination is {{None}} (the default) the file will land in {{None/\{source_object\}}}, and if it's an empty string it goes to {{/\{source_object\}}}. Basically, it should not prepend {{destination_object}} if it's empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (AIRFLOW-1929) TriggerDagRunOperator should allow to set the execution_date
[ https://issues.apache.org/jira/browse/AIRFLOW-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Kamath updated AIRFLOW-1929: - Comment: was deleted (was: [~bolke] [~ashb] I have raised a PR for this JIRA. Can you please take a look https://github.com/apache/incubator-airflow/pull/3152) > TriggerDagRunOperator should allow to set the execution_date > > > Key: AIRFLOW-1929 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1929 > Project: Apache Airflow > Issue Type: Bug >Affects Versions: 1.9.0, 1.8.2 >Reporter: Bolke de Bruin >Assignee: Bolke de Bruin >Priority: Major > Fix For: 1.9.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2329) airflow initdb fails: Specified key was too long; max key length is 767 bytes
Andreas Költringer created AIRFLOW-2329: --- Summary: airflow initdb fails: Specified key was too long; max key length is 767 bytes Key: AIRFLOW-2329 URL: https://issues.apache.org/jira/browse/AIRFLOW-2329 Project: Apache Airflow Issue Type: Bug Components: db Affects Versions: 1.9.0 Reporter: Andreas Költringer Turns out that the default charset in MariaDB is {{utf8mb4}}, and the default max. keylength is 767 bytes ([MariaDB < 10.2.2|https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_large_prefix] // [MySQL < 5.7.7|https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_large_prefix] ). The field {{dag_id}} is defined as {{VARCHAR(250)}}, {{250 x 4 = 1000 > 767}}, hence the problem. Possible workarounds: * Avoid database versions in question * change the encoding to {{utf8}}, which is [not recommended however|https://stackoverflow.com/a/766996/6699237] * use a MariaDB/MySQL Docker container Solution in Airflow could be to turn on [{{innodb_large_prefix}} and related configs|https://wiki.archlinux.org/index.php/MySQL#Increase_character_limit]. However, this requires the option {{ROW_FORMAT=DYNAMIC}} (and maybe also {{ENGINE=InnoDB}}) to be set on each CREATE statement. [SqlAlchemy supports this|http://docs.sqlalchemy.org/en/latest/dialects/mysql.html#create-table-arguments-including-storage-engines], but the question is whether Airflow has some mechanics built in to pass this option in via some config? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-2184) Create a druid_checker operator
[ https://issues.apache.org/jira/browse/AIRFLOW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong resolved AIRFLOW-2184. --- Resolution: Fixed Fix Version/s: 1.10.0 > Create a druid_checker operator > --- > > Key: AIRFLOW-2184 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2184 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Tao Feng >Assignee: Tao Feng >Priority: Major > Fix For: 1.10.0 > > > Once we agree on the extended interface provided through druid_hook in > AIRFLOW-2183, we would like to create a druid_checker operator to do basic > data quality checking on data in druid. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2184] Add druid_checker_operator
Repository: incubator-airflow Updated Branches: refs/heads/master 6e82f1d7c -> 3f1bfd38c [AIRFLOW-2184] Add druid_checker_operator Closes #3228 from feng-tao/airflow-2184 Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/3f1bfd38 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/3f1bfd38 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/3f1bfd38 Branch: refs/heads/master Commit: 3f1bfd38cd5a1c9c58045004390a6a766bec5e8d Parents: 6e82f1d Author: Tao feng Authored: Tue Apr 17 11:12:41 2018 +0200 Committer: Fokko Driesprong Committed: Tue Apr 17 11:12:41 2018 +0200 -- airflow/hooks/druid_hook.py | 7 +- airflow/operators/druid_check_operator.py| 91 +++ docs/code.rst| 1 + tests/operators/test_druid_check_operator.py | 74 ++ 4 files changed, 170 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/3f1bfd38/airflow/hooks/druid_hook.py -- diff --git a/airflow/hooks/druid_hook.py b/airflow/hooks/druid_hook.py index 97f8c4d..e8b61c0 100644 --- a/airflow/hooks/druid_hook.py +++ b/airflow/hooks/druid_hook.py @@ -7,9 +7,9 @@ # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at -# +# # http://www.apache.org/licenses/LICENSE-2.0 -# +# # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -127,7 +127,8 @@ class DruidDbApiHook(DbApiHook): path=conn.extra_dejson.get('endpoint', '/druid/v2/sql'), scheme=conn.extra_dejson.get('schema', 'http') ) -self.log('Get the connection to druid broker on {host}'.format(host=conn.host)) +self.log.info('Get the connection to druid ' + 'broker on {host}'.format(host=conn.host)) return druid_broker_conn def get_uri(self): http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/3f1bfd38/airflow/operators/druid_check_operator.py -- diff --git a/airflow/operators/druid_check_operator.py b/airflow/operators/druid_check_operator.py new file mode 100644 index 000..73f7915 --- /dev/null +++ b/airflow/operators/druid_check_operator.py @@ -0,0 +1,91 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from airflow.exceptions import AirflowException +from airflow.hooks.druid_hook import DruidDbApiHook +from airflow.operators.check_operator import CheckOperator +from airflow.utils.decorators import apply_defaults + + +class DruidCheckOperator(CheckOperator): +""" +Performs checks against Druid. The ``DruidCheckOperator`` expects +a sql query that will return a single row. Each value on that +first row is evaluated using python ``bool`` casting. If any of the +values return ``False`` the check is failed and errors out. + +Note that Python bool casting evals the following as ``False``: + +* ``False`` +* ``0`` +* Empty string (``""``) +* Empty list (``[]``) +* Empty dictionary or set (``{}``) + +Given a query like ``SELECT COUNT(*) FROM foo``, it will fail only if +the count ``== 0``. You can craft much more complex query that could, +for instance, check that the table has the same number of rows as +the source table upstream, or that the count of today's partition is +greater than yesterday's partition, or that a set of metrics are less +than 3 standard deviation for the 7 day average. +This operator can be used as a data quality check in your pipeline, and +depending on where you put it in your DAG, you have the choice to +stop the critical path, preventing from
[jira] [Commented] (AIRFLOW-2184) Create a druid_checker operator
[ https://issues.apache.org/jira/browse/AIRFLOW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440642#comment-16440642 ] ASF subversion and git services commented on AIRFLOW-2184: -- Commit 3f1bfd38cd5a1c9c58045004390a6a766bec5e8d in incubator-airflow's branch refs/heads/master from Tao feng [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=3f1bfd3 ] [AIRFLOW-2184] Add druid_checker_operator Closes #3228 from feng-tao/airflow-2184 > Create a druid_checker operator > --- > > Key: AIRFLOW-2184 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2184 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Tao Feng >Assignee: Tao Feng >Priority: Major > > Once we agree on the extended interface provided through druid_hook in > AIRFLOW-2183, we would like to create a druid_checker operator to do basic > data quality checking on data in druid. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2326) Duplicate GCS copy operator
[ https://issues.apache.org/jira/browse/AIRFLOW-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-2326 started by Berislav Lopac. --- > Duplicate GCS copy operator > --- > > Key: AIRFLOW-2326 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2326 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Berislav Lopac >Assignee: Berislav Lopac >Priority: Minor > > I apologise if this is a known thing, but I have been wondering if anyone can > give a rationale why do we have two separate operators that perform Google > Cloud Storage objects copy -- specifically, > {{gcs_copy_operator.GoogleCloudStorageCopyOperator}} and > {{gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator}}. As far as I > can tell they have nearly the same functionality, with the latter being a bit > more flexible (with the {{move_object}} flag). > If both are not needed, I would like to propose removing one of them > (specifically, the {{gcs_copy_operator}} one); if necessary it can be made > into a wrapper/subclass of the other one, marked for deprecation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work started] (AIRFLOW-2222) GoogleCloudStorageHook.copy fails for large files between locations
[ https://issues.apache.org/jira/browse/AIRFLOW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW- started by Berislav Lopac. --- > GoogleCloudStorageHook.copy fails for large files between locations > --- > > Key: AIRFLOW- > URL: https://issues.apache.org/jira/browse/AIRFLOW- > Project: Apache Airflow > Issue Type: Bug >Reporter: Berislav Lopac >Assignee: Berislav Lopac >Priority: Major > > When copying large files (confirmed for around 3GB) between buckets in > different projects, the operation fails and the Google API returns error > [413—Payload Too > Large|https://cloud.google.com/storage/docs/json_api/v1/status-codes#413_Payload_Too_Large]. > The documentation for the error says: > {quote}The Cloud Storage JSON API supports up to 5 TB objects. > This error may, alternatively, arise if copying objects between locations > and/or storage classes can not complete within 30 seconds. In this case, use > the > [Rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] > method instead.{quote} > The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the > API {{copy}} method. > h3. Proposed Solution > There are two potential solutions: > # Implement {{GoogleCloudStorageHook.rewrite}} method which can be called > from operators and other objects to ensure successful execution. This method > is more flexible but requires changes both in the {{GoogleCloudStorageHook}} > class and any other classes that use it for copying files to ensure that they > explicitly call {{rewrite}} when needed. > # Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} > instead of {{copy}} underneath. This requires updating only the > {{GoogleCloudStorageHook}} class, but the logic might not cover all the edge > cases and could be difficult to implement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator
[ https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440620#comment-16440620 ] ASF subversion and git services commented on AIRFLOW-2299: -- Commit 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821 in incubator-airflow's branch refs/heads/master from [~sekikn] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=6e82f1d ] [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator Currently, S3FileTransformOperator downloads the whole file from S3 before transforming and uploading it. Adding extraction feature using S3 Select to this operator improves its efficiency and usablitily. Closes #3227 from sekikn/AIRFLOW-2299 > Add S3 Select functionarity to S3FileTransformOperator > -- > > Key: AIRFLOW-2299 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2299 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, operators >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > S3FileTransformOperator downloads the whole file from S3 before transforming > and uploading it, but it's inefficient if the original file is large but the > necessary part is small. > S3 Select, [which became GA > recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/], > can improve its efficiency and usablitily. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator
[ https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong resolved AIRFLOW-2299. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request #3227 [https://github.com/apache/incubator-airflow/pull/3227] > Add S3 Select functionarity to S3FileTransformOperator > -- > > Key: AIRFLOW-2299 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2299 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, operators >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > S3FileTransformOperator downloads the whole file from S3 before transforming > and uploading it, but it's inefficient if the original file is large but the > necessary part is small. > S3 Select, [which became GA > recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/], > can improve its efficiency and usablitily. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator
[ https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440619#comment-16440619 ] ASF subversion and git services commented on AIRFLOW-2299: -- Commit 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821 in incubator-airflow's branch refs/heads/master from [~sekikn] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=6e82f1d ] [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator Currently, S3FileTransformOperator downloads the whole file from S3 before transforming and uploading it. Adding extraction feature using S3 Select to this operator improves its efficiency and usablitily. Closes #3227 from sekikn/AIRFLOW-2299 > Add S3 Select functionarity to S3FileTransformOperator > -- > > Key: AIRFLOW-2299 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2299 > Project: Apache Airflow > Issue Type: Improvement > Components: aws, operators >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Major > Fix For: 2.0.0 > > > S3FileTransformOperator downloads the whole file from S3 before transforming > and uploading it, but it's inefficient if the original file is large but the > necessary part is small. > S3 Select, [which became GA > recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/], > can improve its efficiency and usablitily. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator
Repository: incubator-airflow Updated Branches: refs/heads/master a14804310 -> 6e82f1d7c [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator Currently, S3FileTransformOperator downloads the whole file from S3 before transforming and uploading it. Adding extraction feature using S3 Select to this operator improves its efficiency and usablitily. Closes #3227 from sekikn/AIRFLOW-2299 Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/6e82f1d7 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/6e82f1d7 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/6e82f1d7 Branch: refs/heads/master Commit: 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821 Parents: a148043 Author: Kengo Seki Authored: Tue Apr 17 10:53:05 2018 +0200 Committer: Fokko Driesprong Committed: Tue Apr 17 10:53:05 2018 +0200 -- airflow/hooks/S3_hook.py| 40 + airflow/operators/s3_file_transform_operator.py | 59 ++-- setup.py| 2 +- tests/hooks/test_s3_hook.py | 8 +++ .../test_s3_file_transform_operator.py | 30 +- 5 files changed, 121 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6e82f1d7/airflow/hooks/S3_hook.py -- diff --git a/airflow/hooks/S3_hook.py b/airflow/hooks/S3_hook.py index f75f5e6..7a4b8b0 100644 --- a/airflow/hooks/S3_hook.py +++ b/airflow/hooks/S3_hook.py @@ -177,6 +177,46 @@ class S3Hook(AwsHook): obj = self.get_key(key, bucket_name) return obj.get()['Body'].read().decode('utf-8') +def select_key(self, key, bucket_name=None, + expression='SELECT * FROM S3Object', + expression_type='SQL', + input_serialization={'CSV': {}}, + output_serialization={'CSV': {}}): +""" +Reads a key with S3 Select. + +:param key: S3 key that will point to the file +:type key: str +:param bucket_name: Name of the bucket in which the file is stored +:type bucket_name: str +:param expression: S3 Select expression +:type expression: str +:param expression_type: S3 Select expression type +:type expression_type: str +:param input_serialization: S3 Select input data serialization format +:type input_serialization: str +:param output_serialization: S3 Select output data serialization format +:type output_serialization: str + +.. seealso:: +For more details about S3 Select parameters: + http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.select_object_content +""" +if not bucket_name: +(bucket_name, key) = self.parse_s3_url(key) + +response = self.get_conn().select_object_content( +Bucket=bucket_name, +Key=key, +Expression=expression, +ExpressionType=expression_type, +InputSerialization=input_serialization, +OutputSerialization=output_serialization) + +return ''.join(event['Records']['Payload'] + for event in response['Payload'] + if 'Records' in event) + def check_for_wildcard_key(self, wildcard_key, bucket_name=None, delimiter=''): """ http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6e82f1d7/airflow/operators/s3_file_transform_operator.py -- diff --git a/airflow/operators/s3_file_transform_operator.py b/airflow/operators/s3_file_transform_operator.py index 1d39ace..67286b0 100644 --- a/airflow/operators/s3_file_transform_operator.py +++ b/airflow/operators/s3_file_transform_operator.py @@ -36,10 +36,13 @@ class S3FileTransformOperator(BaseOperator): The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the -data from source , transform it and write the output to the local +data from source, transform it and write the output to the local destination file. The operator then takes over control and uploads the local destination file to S3. +S3 Select is also available to filter the source contents. Users can +omit the transformation script if S3 Select expression is specified. + :param source_s3_key: The key to be retrieved from S3 :type source_s3_key: str :param source_aws_conn_id: source s3 conne