date:20180417

[jira] [Resolved] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

2018-04-17 Thread Arthur Wiedmer (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arthur Wiedmer resolved AIRFLOW-2335.
-
   Resolution: Fixed
Fix Version/s: 1.10.0

Issue resolved by pull request #3236
[https://github.com/apache/incubator-airflow/pull/3236]

> Issue downloading oracle jdk8 is preventing travis builds from running
> --
>
> Key: AIRFLOW-2335
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2335
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Daniel Imberman
>Assignee: Daniel Imberman
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently, all airflow build are dying after ~1 minute due to an issue with 
> how travis pulls jdk8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

2018-04-17 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441901#comment-16441901
 ] 

ASF subversion and git services commented on AIRFLOW-2335:
--

Commit 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa in incubator-airflow's branch 
refs/heads/master from [~dimberman]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=0f8507a ]

[AIRFLOW-2335] fix issue with jdk8 download for ci

Make sure you have checked _all_ steps below.

- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2335
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.

- [x] Here are some details about my PR, including
screenshots of any UI changes:

There is an issue with travis pulling jdk8 that is
preventing CI jobs from running. This blocks
further development of the project.

Reference: https://github.com/travis-ci/travis-
ci/issues/9512#issuecomment-382235301

- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

This PR can't be unit tested since it is just
configuration. However, the fact that unit tests
run successfully should show that it works.

- [ ] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

- [ ] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.

- [ ] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`

Closes #3236 from dimberman/AIRFLOW-
2335_travis_issue


> Issue downloading oracle jdk8 is preventing travis builds from running
> --
>
> Key: AIRFLOW-2335
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2335
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Daniel Imberman
>Assignee: Daniel Imberman
>Priority: Major
>
> Currently, all airflow build are dying after ~1 minute due to an issue with 
> how travis pulls jdk8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

incubator-airflow git commit: [AIRFLOW-2335] fix issue with jdk8 download for ci

2018-04-17 Thread arthur

Repository: incubator-airflow
Updated Branches:
  refs/heads/master 3f1bfd38c -> 0f8507ae3


[AIRFLOW-2335] fix issue with jdk8 download for ci

Make sure you have checked _all_ steps below.

- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2335
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.

- [x] Here are some details about my PR, including
screenshots of any UI changes:

There is an issue with travis pulling jdk8 that is
preventing CI jobs from running. This blocks
further development of the project.

Reference: https://github.com/travis-ci/travis-
ci/issues/9512#issuecomment-382235301

- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

This PR can't be unit tested since it is just
configuration. However, the fact that unit tests
run successfully should show that it works.

- [ ] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

- [ ] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.

- [ ] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`

Closes #3236 from dimberman/AIRFLOW-
2335_travis_issue


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/0f8507ae
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/0f8507ae
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/0f8507ae

Branch: refs/heads/master
Commit: 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa
Parents: 3f1bfd3
Author: Daniel Imberman 
Authored: Tue Apr 17 21:57:14 2018 -0700
Committer: Arthur Wiedmer 
Committed: Tue Apr 17 21:57:42 2018 -0700

--
 .travis.yml | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/0f8507ae/.travis.yml
--
diff --git a/.travis.yml b/.travis.yml
index d9a333d..883473d 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -6,9 +6,9 @@
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
-# 
+#
 #   http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -37,7 +37,6 @@ addons:
   - krb5-user
   - krb5-kdc
   - krb5-admin-server
-  - oracle-java8-installer
   - python-selinux
   postgresql: "9.2"
 python:
@@ -93,7 +92,19 @@ before_install:
   - cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
   - ln -s ~/.ssh/authorized_keys ~/.ssh/authorized_keys2
   - chmod 600 ~/.ssh/*
+  - sudo add-apt-repository -y ppa:webupd8team/java
+  - sudo apt-get update
+  - sudo apt-get install -y oracle-java8-installer || true
+  #todo remove this kludge and the above || true when the ppa is fixed
+  - cd /var/lib/dpkg/info
+  - sudo sed -i 's|JAVA_VERSION=8u161|JAVA_VERSION=8u172|' 
oracle-java8-installer.*
+  - sudo sed -i 
's|PARTNER_URL=http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/|PARTNER_URL=http://download.oracle.com/otn-pub/java/jdk/8u172-b11/a58eab1ec242421181065cdc37240b08/|'
 oracle-java8-installer.*
+  - sudo sed -i 
's|SHA256SUM_TGZ="6dbc56a0e3310b69e91bb64db63a485bd7b6a8083f08e48047276380a0e2021e"|SHA256SUM_TGZ="28a00b9400b6913563553e09e8024c286b506d8523334c93ddec6c9ec7e9d346"|'
 oracle-java8-installer.*
+  - sudo sed -i 's|J_DIR=jdk1.8.0_161|J_DIR=jdk1.8.0_172|' 
oracle-java8-installer.*
+  - sudo apt-get update
+  - sudo apt-get install -y oracle-java8-installer
   - jdk_switcher use oraclejdk8
+  - cd $TRAVIS_BUILD_DIR
 install:
   - pip install --upgrade pip
   - pip install tox

[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

2018-04-17 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441902#comment-16441902
 ] 

ASF subversion and git services commented on AIRFLOW-2335:
--

Commit 0f8507ae351787e086d1d1038f6f0ba52e6d9aaa in incubator-airflow's branch 
refs/heads/master from [~dimberman]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=0f8507a ]

[AIRFLOW-2335] fix issue with jdk8 download for ci

Make sure you have checked _all_ steps below.

- [x] My PR addresses the following [Airflow JIRA]
(https://issues.apache.org/jira/browse/AIRFLOW/)
issues and references them in the PR title. For
example, "\[AIRFLOW-XXX\] My Airflow PR"
-
https://issues.apache.org/jira/browse/AIRFLOW-2335
- In case you are fixing a typo in the
documentation you can prepend your commit with
\[AIRFLOW-XXX\], code changes always need a JIRA
issue.

- [x] Here are some details about my PR, including
screenshots of any UI changes:

There is an issue with travis pulling jdk8 that is
preventing CI jobs from running. This blocks
further development of the project.

Reference: https://github.com/travis-ci/travis-
ci/issues/9512#issuecomment-382235301

- [x] My PR adds the following unit tests __OR__
does not need testing for this extremely good
reason:

This PR can't be unit tested since it is just
configuration. However, the fact that unit tests
run successfully should show that it works.

- [ ] My commits all reference JIRA issues in
their subject lines, and I have squashed multiple
commits if they address the same issue. In
addition, my commits follow the guidelines from
"[How to write a good git commit
message](http://chris.beams.io/posts/git-
commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not
"adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

- [ ] In case of new functionality, my PR adds
documentation that describes how to use it.
- When adding new operators/hooks/sensors, the
autoclass documentation generation needs to be
added.

- [ ] Passes `git diff upstream/master -u --
"*.py" | flake8 --diff`

Closes #3236 from dimberman/AIRFLOW-
2335_travis_issue


> Issue downloading oracle jdk8 is preventing travis builds from running
> --
>
> Key: AIRFLOW-2335
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2335
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Daniel Imberman
>Assignee: Daniel Imberman
>Priority: Major
>
> Currently, all airflow build are dying after ~1 minute due to an issue with 
> how travis pulls jdk8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

2018-04-17 Thread Daniel Imberman (JIRA)

Daniel Imberman created AIRFLOW-2335:


 Summary: Issue downloading oracle jdk8 is preventing travis builds 
from running
 Key: AIRFLOW-2335
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2335
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Daniel Imberman
Assignee: Daniel Imberman


Currently, all airflow build are dying after ~1 minute due to an issue with how 
travis pulls jdk8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2334) Add AWS Database Migration Service operators and sensors

2018-04-17 Thread Jordan Zucker (JIRA)

Jordan Zucker created AIRFLOW-2334:
--

 Summary: Add AWS Database Migration Service operators and sensors
 Key: AIRFLOW-2334
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2334
 Project: Apache Airflow
  Issue Type: New Feature
  Components: aws, contrib, operators
Reporter: Jordan Zucker
Assignee: Jordan Zucker


[AWS Database Migration Service]([https://aws.amazon.com/dms/)] allows for long 
running, asynchronous tasks to move, copy, and back up databases. It is great 
for Airflow management.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2333) Add Segment Hook to Airflow

2018-04-17 Thread Jordan Zucker (JIRA)

Jordan Zucker created AIRFLOW-2333:
--

 Summary: Add Segment Hook to Airflow
 Key: AIRFLOW-2333
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2333
 Project: Apache Airflow
  Issue Type: New Feature
  Components: contrib, hooks
Reporter: Jordan Zucker
Assignee: Jordan Zucker


[Segment]([https://segment.com/)] is used by many to track analytics. Would be 
nice to allow Airflow to interact with Segment and store username and password 
with encryption in its database.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

2018-04-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441527#comment-16441527
 ] 

Guillermo Rodríguez Cano commented on AIRFLOW-1894:
---

I did read your comment, [~yiga2], and I did not say anything about that in my 
comment (In fact I think that what you did could be a complement to the 
standard google-cloud-python library as I haven't checked it enough to conclude 
whether it is possible to stream a file or not).

I assume the google-cloud-python library is better performing than the 
currently used one but the changes required in hooks are quite some and so I 
was requesting for more information on that, and offered my help to change it 
(and try to figure out if it is possible to emulate such storage transfer 
service you point and that I guess it is the same as the transfer service 
offered in the UI of Google Cloud Console).

> Rebase and migrate existing Airflow GCP operators to google-python-cloud
> 
>
> Key: AIRFLOW-1894
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1894
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Affects Versions: Airflow 2.0
>Reporter: Feng Lu
>Assignee: Feng Lu
>Priority: Minor
>
> [google-api-python-client|https://github.com/google/google-api-python-client] 
> is in maintenance mode and it's recommended that 
> [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python]
>  should be used whenever possible. Given that we don't have feature parity 
> between the two libraries, this issue is created to track the long-term 
> migration efforts moving from google-api-python-client to 
> google-cloud-python. Here are some general guidelines we try to follow in 
> this cleanup process:
> - add google-cloud-python dependency as part of gcp_api extra packages (make 
> sure there is no dependency conflict between the two).
> - new operators shall be based on google-cloud-python if possible.
> - migrate existing GCP operators when the underlying GCP service is available 
> in google-cloud-python. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

2018-04-17 Thread Yannick Einsweiler (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441510#comment-16441510
 ] 

Yannick Einsweiler commented on AIRFLOW-1894:
-

[~wileeam] read my initial comment, the idea was to leverage 
[https://cloud.google.com/storage/transfer/reference/rest/] *without* the need 
for a local or virtual machine passthrough.

We ended writing that operator (and a sensor that checks when job is complete), 
all triggered from within Gcloud but have since then abandoned the project and 
have our AWS Lambda save output to GCS directly.

> Rebase and migrate existing Airflow GCP operators to google-python-cloud
> 
>
> Key: AIRFLOW-1894
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1894
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Affects Versions: Airflow 2.0
>Reporter: Feng Lu
>Assignee: Feng Lu
>Priority: Minor
>
> [google-api-python-client|https://github.com/google/google-api-python-client] 
> is in maintenance mode and it's recommended that 
> [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python]
>  should be used whenever possible. Given that we don't have feature parity 
> between the two libraries, this issue is created to track the long-term 
> migration efforts moving from google-api-python-client to 
> google-cloud-python. Here are some general guidelines we try to follow in 
> this cleanup process:
> - add google-cloud-python dependency as part of gcp_api extra packages (make 
> sure there is no dependency conflict between the two).
> - new operators shall be based on google-cloud-python if possible.
> - migrate existing GCP operators when the underlying GCP service is available 
> in google-cloud-python. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2332) S3 download operator

2018-04-17 Thread JIRA

Guillermo Rodríguez Cano created AIRFLOW-2332:
-

 Summary: S3 download operator
 Key: AIRFLOW-2332
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2332
 Project: Apache Airflow
  Issue Type: New Feature
  Components: aws
Reporter: Guillermo Rodríguez Cano
Assignee: Guillermo Rodríguez Cano


An operator that will download a remote file on S3 to the same machine where 
Airflow is running (or saved in the xcom if not too big)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (AIRFLOW-2332) S3 Download Operator

2018-04-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guillermo Rodríguez Cano updated AIRFLOW-2332:
--
Summary: S3 Download Operator  (was: S3 download operator)

> S3 Download Operator
> 
>
> Key: AIRFLOW-2332
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2332
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: aws
>Reporter: Guillermo Rodríguez Cano
>Assignee: Guillermo Rodríguez Cano
>Priority: Major
>
> An operator that will download a remote file on S3 to the same machine where 
> Airflow is running (or saved in the xcom if not too big)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

2018-04-17 Thread JIRA


[ 
https://issues.apache.org/jira/browse/AIRFLOW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441453#comment-16441453
 ] 

Guillermo Rodríguez Cano commented on AIRFLOW-1894:
---

 Hello,

Do you have any update on this [~fenglu]? I suggest making up smaller chunks of 
what to do here. I have recently implemented a S3 to GCS operator but there are 
some performance flaws (file cannot be streamed all the way from S3 to GCS, and 
a lot of memory is used on the machine doing the task, so this is not 
reasonable for very large files).

I am happy to help but not sure how much progress there is on this migration 
already done (that is not committed obviously).

> Rebase and migrate existing Airflow GCP operators to google-python-cloud
> 
>
> Key: AIRFLOW-1894
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1894
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Affects Versions: Airflow 2.0
>Reporter: Feng Lu
>Assignee: Feng Lu
>Priority: Minor
>
> [google-api-python-client|https://github.com/google/google-api-python-client] 
> is in maintenance mode and it's recommended that 
> [google-cloud-python|https://github.com/GoogleCloudPlatform/google-cloud-python]
>  should be used whenever possible. Given that we don't have feature parity 
> between the two libraries, this issue is created to track the long-term 
> migration efforts moving from google-api-python-client to 
> google-cloud-python. Here are some general guidelines we try to follow in 
> this cleanup process:
> - add google-cloud-python dependency as part of gcp_api extra packages (make 
> sure there is no dependency conflict between the two).
> - new operators shall be based on google-cloud-python if possible.
> - migrate existing GCP operators when the underlying GCP service is available 
> in google-cloud-python. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work started] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.

2018-04-17 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-2331 started by Cristòfol Torrens.
--
> Add support for initialization action timeout on dataproc cluster creation.
> ---
>
> Key: AIRFLOW-2331
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2331
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Reporter: Cristòfol Torrens
>Assignee: Cristòfol Torrens
>Priority: Minor
>  Labels: contrib, dataproc, operator
> Fix For: Airflow 2.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Add support to customize timeout for initialization scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.

2018-04-17 Thread JIRA

Cristòfol Torrens created AIRFLOW-2331:
--

 Summary: Add support for initialization action timeout on dataproc 
cluster creation.
 Key: AIRFLOW-2331
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2331
 Project: Apache Airflow
  Issue Type: Improvement
  Components: contrib
Reporter: Cristòfol Torrens
Assignee: Cristòfol Torrens
 Fix For: Airflow 2.0


Add support to customize timeout for initialization scripts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work started] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

2018-04-17 Thread Berislav Lopac (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-2330 started by Berislav Lopac.
---
> GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends 
> destination_object even when not given
> -
>
> Key: AIRFLOW-2330
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2330
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Berislav Lopac
>Assignee: Berislav Lopac
>Priority: Major
>
> Currently, the operator builds the destination like this:
> {code}
> hook.copy(self.source_bucket, source_object,
>   self.destination_bucket, "{}/{}".format(self.destination_object,
>   source_object))
> {code}
> If destination is {{None}} (the default) the file will land in 
> {{None/\{source_object\}}}, and if it's an empty string it goes to 
> {{/\{source_object\}}}. Basically, it should not prepend 
> {{destination_object}} if it's empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

2018-04-17 Thread Berislav Lopac (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berislav Lopac reassigned AIRFLOW-2330:
---

Assignee: Berislav Lopac

> GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends 
> destination_object even when not given
> -
>
> Key: AIRFLOW-2330
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2330
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Berislav Lopac
>Assignee: Berislav Lopac
>Priority: Major
>
> Currently, the operator builds the destination like this:
> {code}
> hook.copy(self.source_bucket, source_object,
>   self.destination_bucket, "{}/{}".format(self.destination_object,
>   source_object))
> {code}
> If destination is {{None}} (the default) the file will land in 
> {{None/\{source_object\}}}, and if it's an empty string it goes to 
> {{/\{source_object\}}}. Basically, it should not prepend 
> {{destination_object}} if it's empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

2018-04-17 Thread Berislav Lopac (JIRA)

Berislav Lopac created AIRFLOW-2330:
---

 Summary: GoogleCloudStorageToGoogleCloudStorageOperator on 
wildcard appends destination_object even when not given
 Key: AIRFLOW-2330
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2330
 Project: Apache Airflow
  Issue Type: Bug
Reporter: Berislav Lopac


Currently, the operator builds the destination like this:

{code}
hook.copy(self.source_bucket, source_object,
  self.destination_bucket, "{}/{}".format(self.destination_object,
  source_object))
{code}

If destination is {{None}} (the default) the file will land in 
{{None/\{source_object\}}}, and if it's an empty string it goes to 
{{/\{source_object\}}}. Basically, it should not prepend {{destination_object}} 
if it's empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Issue Comment Deleted] (AIRFLOW-1929) TriggerDagRunOperator should allow to set the execution_date

2018-04-17 Thread Sreenath Kamath (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreenath Kamath updated AIRFLOW-1929:
-
Comment: was deleted

(was: [~bolke]  [~ashb] I have raised a PR for this JIRA. Can you please take a 
look https://github.com/apache/incubator-airflow/pull/3152)

> TriggerDagRunOperator should allow to set the execution_date
> 
>
> Key: AIRFLOW-1929
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1929
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Bolke de Bruin
>Assignee: Bolke de Bruin
>Priority: Major
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AIRFLOW-2329) airflow initdb fails: Specified key was too long; max key length is 767 bytes

2018-04-17 Thread JIRA

Andreas Költringer created AIRFLOW-2329:
---

 Summary: airflow initdb fails: Specified key was too long; max key 
length is 767 bytes 
 Key: AIRFLOW-2329
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2329
 Project: Apache Airflow
  Issue Type: Bug
  Components: db
Affects Versions: 1.9.0
Reporter: Andreas Költringer


Turns out that the default charset in MariaDB is {{utf8mb4}}, and the default 
max. keylength is 767 bytes ([MariaDB < 
10.2.2|https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_large_prefix]
 // [MySQL < 
5.7.7|https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_large_prefix]
 ). The field {{dag_id}} is defined as {{VARCHAR(250)}}, {{250 x 4 = 1000 > 
767}}, hence the problem. 

Possible workarounds:
 * Avoid database versions in question
 * change the encoding to {{utf8}}, which is [not recommended 
however|https://stackoverflow.com/a/766996/6699237]
 * use a MariaDB/MySQL Docker container

 

Solution in Airflow could be to turn on [{{innodb_large_prefix}} and related 
configs|https://wiki.archlinux.org/index.php/MySQL#Increase_character_limit]. 
However, this requires the option {{ROW_FORMAT=DYNAMIC}} (and maybe also 
{{ENGINE=InnoDB}}) to be set on each CREATE statement.

[SqlAlchemy supports 
this|http://docs.sqlalchemy.org/en/latest/dialects/mysql.html#create-table-arguments-including-storage-engines],
 but the question is whether Airflow has some mechanics built in to pass this 
option in via some config?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AIRFLOW-2184) Create a druid_checker operator

2018-04-17 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved AIRFLOW-2184.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

> Create a druid_checker operator
> ---
>
> Key: AIRFLOW-2184
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2184
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Tao Feng
>Assignee: Tao Feng
>Priority: Major
> Fix For: 1.10.0
>
>
> Once we agree on the extended interface provided through druid_hook in 
> AIRFLOW-2183, we would like to create a druid_checker operator to do basic 
> data quality checking on data in druid.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

incubator-airflow git commit: [AIRFLOW-2184] Add druid_checker_operator

2018-04-17 Thread fokko

Repository: incubator-airflow
Updated Branches:
  refs/heads/master 6e82f1d7c -> 3f1bfd38c


[AIRFLOW-2184] Add druid_checker_operator

Closes #3228 from feng-tao/airflow-2184


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/3f1bfd38
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/3f1bfd38
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/3f1bfd38

Branch: refs/heads/master
Commit: 3f1bfd38cd5a1c9c58045004390a6a766bec5e8d
Parents: 6e82f1d
Author: Tao feng 
Authored: Tue Apr 17 11:12:41 2018 +0200
Committer: Fokko Driesprong 
Committed: Tue Apr 17 11:12:41 2018 +0200

--
 airflow/hooks/druid_hook.py  |  7 +-
 airflow/operators/druid_check_operator.py| 91 +++
 docs/code.rst|  1 +
 tests/operators/test_druid_check_operator.py | 74 ++
 4 files changed, 170 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/3f1bfd38/airflow/hooks/druid_hook.py
--
diff --git a/airflow/hooks/druid_hook.py b/airflow/hooks/druid_hook.py
index 97f8c4d..e8b61c0 100644
--- a/airflow/hooks/druid_hook.py
+++ b/airflow/hooks/druid_hook.py
@@ -7,9 +7,9 @@
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
-# 
+#
 #   http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -127,7 +127,8 @@ class DruidDbApiHook(DbApiHook):
 path=conn.extra_dejson.get('endpoint', '/druid/v2/sql'),
 scheme=conn.extra_dejson.get('schema', 'http')
 )
-self.log('Get the connection to druid broker on 
{host}'.format(host=conn.host))
+self.log.info('Get the connection to druid '
+  'broker on {host}'.format(host=conn.host))
 return druid_broker_conn
 
 def get_uri(self):

http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/3f1bfd38/airflow/operators/druid_check_operator.py
--
diff --git a/airflow/operators/druid_check_operator.py 
b/airflow/operators/druid_check_operator.py
new file mode 100644
index 000..73f7915
--- /dev/null
+++ b/airflow/operators/druid_check_operator.py
@@ -0,0 +1,91 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from airflow.exceptions import AirflowException
+from airflow.hooks.druid_hook import DruidDbApiHook
+from airflow.operators.check_operator import CheckOperator
+from airflow.utils.decorators import apply_defaults
+
+
+class DruidCheckOperator(CheckOperator):
+"""
+Performs checks against Druid. The ``DruidCheckOperator`` expects
+a sql query that will return a single row. Each value on that
+first row is evaluated using python ``bool`` casting. If any of the
+values return ``False`` the check is failed and errors out.
+
+Note that Python bool casting evals the following as ``False``:
+
+* ``False``
+* ``0``
+* Empty string (``""``)
+* Empty list (``[]``)
+* Empty dictionary or set (``{}``)
+
+Given a query like ``SELECT COUNT(*) FROM foo``, it will fail only if
+the count ``== 0``. You can craft much more complex query that could,
+for instance, check that the table has the same number of rows as
+the source table upstream, or that the count of today's partition is
+greater than yesterday's partition, or that a set of metrics are less
+than 3 standard deviation for the 7 day average.
+This operator can be used as a data quality check in your pipeline, and
+depending on where you put it in your DAG, you have the choice to
+stop the critical path, preventing from

[jira] [Commented] (AIRFLOW-2184) Create a druid_checker operator

2018-04-17 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440642#comment-16440642
 ] 

ASF subversion and git services commented on AIRFLOW-2184:
--

Commit 3f1bfd38cd5a1c9c58045004390a6a766bec5e8d in incubator-airflow's branch 
refs/heads/master from Tao feng
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=3f1bfd3 ]

[AIRFLOW-2184] Add druid_checker_operator

Closes #3228 from feng-tao/airflow-2184


> Create a druid_checker operator
> ---
>
> Key: AIRFLOW-2184
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2184
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Tao Feng
>Assignee: Tao Feng
>Priority: Major
>
> Once we agree on the extended interface provided through druid_hook in 
> AIRFLOW-2183, we would like to create a druid_checker operator to do basic 
> data quality checking on data in druid.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work started] (AIRFLOW-2326) Duplicate GCS copy operator

2018-04-17 Thread Berislav Lopac (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-2326 started by Berislav Lopac.
---
> Duplicate GCS copy operator
> ---
>
> Key: AIRFLOW-2326
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2326
> Project: Apache Airflow
>  Issue Type: Improvement
>Reporter: Berislav Lopac
>Assignee: Berislav Lopac
>Priority: Minor
>
> I apologise if this is a known thing, but I have been wondering if anyone can 
> give a rationale why do we have two separate operators that perform Google 
> Cloud Storage objects copy -- specifically, 
> {{gcs_copy_operator.GoogleCloudStorageCopyOperator}} and 
> {{gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator}}. As far as I 
> can tell they have nearly the same functionality, with the latter being a bit 
> more flexible (with the {{move_object}} flag).
> If both are not needed, I would like to propose removing one of them 
> (specifically, the {{gcs_copy_operator}} one); if necessary it can be made 
> into a wrapper/subclass of the other one, marked for deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work started] (AIRFLOW-2222) GoogleCloudStorageHook.copy fails for large files between locations

2018-04-17 Thread Berislav Lopac (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW- started by Berislav Lopac.
---
> GoogleCloudStorageHook.copy fails for large files between locations
> ---
>
> Key: AIRFLOW-
> URL: https://issues.apache.org/jira/browse/AIRFLOW-
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Berislav Lopac
>Assignee: Berislav Lopac
>Priority: Major
>
> When copying large files (confirmed for around 3GB) between buckets in 
> different projects, the operation fails and the Google API returns error 
> [413—Payload Too 
> Large|https://cloud.google.com/storage/docs/json_api/v1/status-codes#413_Payload_Too_Large].
>  The documentation for the error says:
> {quote}The Cloud Storage JSON API supports up to 5 TB objects.
> This error may, alternatively, arise if copying objects between locations 
> and/or storage classes can not complete within 30 seconds. In this case, use 
> the 
> [Rewrite|https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite] 
> method instead.{quote}
> The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the 
> API {{copy}} method.
> h3. Proposed Solution
> There are two potential solutions:
> # Implement {{GoogleCloudStorageHook.rewrite}} method which can be called 
> from operators and other objects to ensure successful execution. This method 
> is more flexible but requires changes both in the {{GoogleCloudStorageHook}} 
> class and any other classes that use it for copying files to ensure that they 
> explicitly call {{rewrite}} when needed.
> # Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} 
> instead of {{copy}} underneath. This requires updating only the 
> {{GoogleCloudStorageHook}} class, but the logic might not cover all the edge 
> cases and could be difficult to implement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

2018-04-17 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440620#comment-16440620
 ] 

ASF subversion and git services commented on AIRFLOW-2299:
--

Commit 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821 in incubator-airflow's branch 
refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=6e82f1d ]

[AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator

Currently, S3FileTransformOperator downloads the
whole file from S3
before transforming and uploading it. Adding
extraction feature using
S3 Select to this operator improves its efficiency
and usablitily.

Closes #3227 from sekikn/AIRFLOW-2299


> Add S3 Select functionarity to S3FileTransformOperator
> --
>
> Key: AIRFLOW-2299
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2299
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, operators
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> S3FileTransformOperator downloads the whole file from S3 before transforming 
> and uploading it, but it's inefficient if the original file is large but the 
> necessary part is small.
> S3 Select, [which became GA 
> recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/],
>  can improve its efficiency and usablitily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

2018-04-17 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved AIRFLOW-2299.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request #3227
[https://github.com/apache/incubator-airflow/pull/3227]

> Add S3 Select functionarity to S3FileTransformOperator
> --
>
> Key: AIRFLOW-2299
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2299
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, operators
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> S3FileTransformOperator downloads the whole file from S3 before transforming 
> and uploading it, but it's inefficient if the original file is large but the 
> necessary part is small.
> S3 Select, [which became GA 
> recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/],
>  can improve its efficiency and usablitily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

2018-04-17 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440619#comment-16440619
 ] 

ASF subversion and git services commented on AIRFLOW-2299:
--

Commit 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821 in incubator-airflow's branch 
refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=6e82f1d ]

[AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator

Currently, S3FileTransformOperator downloads the
whole file from S3
before transforming and uploading it. Adding
extraction feature using
S3 Select to this operator improves its efficiency
and usablitily.

Closes #3227 from sekikn/AIRFLOW-2299


> Add S3 Select functionarity to S3FileTransformOperator
> --
>
> Key: AIRFLOW-2299
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2299
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, operators
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 2.0.0
>
>
> S3FileTransformOperator downloads the whole file from S3 before transforming 
> and uploading it, but it's inefficient if the original file is large but the 
> necessary part is small.
> S3 Select, [which became GA 
> recently|https://aws.amazon.com/about-aws/whats-new/2018/04/amazon-s3-select-is-now-generally-available/],
>  can improve its efficiency and usablitily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

incubator-airflow git commit: [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator

2018-04-17 Thread fokko

Repository: incubator-airflow
Updated Branches:
  refs/heads/master a14804310 -> 6e82f1d7c


[AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator

Currently, S3FileTransformOperator downloads the
whole file from S3
before transforming and uploading it. Adding
extraction feature using
S3 Select to this operator improves its efficiency
and usablitily.

Closes #3227 from sekikn/AIRFLOW-2299


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/6e82f1d7
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/6e82f1d7
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/6e82f1d7

Branch: refs/heads/master
Commit: 6e82f1d7c9fa391c636a0155cdb19aa6cbda0821
Parents: a148043
Author: Kengo Seki 
Authored: Tue Apr 17 10:53:05 2018 +0200
Committer: Fokko Driesprong 
Committed: Tue Apr 17 10:53:05 2018 +0200

--
 airflow/hooks/S3_hook.py| 40 +
 airflow/operators/s3_file_transform_operator.py | 59 ++--
 setup.py|  2 +-
 tests/hooks/test_s3_hook.py |  8 +++
 .../test_s3_file_transform_operator.py  | 30 +-
 5 files changed, 121 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6e82f1d7/airflow/hooks/S3_hook.py
--
diff --git a/airflow/hooks/S3_hook.py b/airflow/hooks/S3_hook.py
index f75f5e6..7a4b8b0 100644
--- a/airflow/hooks/S3_hook.py
+++ b/airflow/hooks/S3_hook.py
@@ -177,6 +177,46 @@ class S3Hook(AwsHook):
 obj = self.get_key(key, bucket_name)
 return obj.get()['Body'].read().decode('utf-8')
 
+def select_key(self, key, bucket_name=None,
+   expression='SELECT * FROM S3Object',
+   expression_type='SQL',
+   input_serialization={'CSV': {}},
+   output_serialization={'CSV': {}}):
+"""
+Reads a key with S3 Select.
+
+:param key: S3 key that will point to the file
+:type key: str
+:param bucket_name: Name of the bucket in which the file is stored
+:type bucket_name: str
+:param expression: S3 Select expression
+:type expression: str
+:param expression_type: S3 Select expression type
+:type expression_type: str
+:param input_serialization: S3 Select input data serialization format
+:type input_serialization: str
+:param output_serialization: S3 Select output data serialization format
+:type output_serialization: str
+
+.. seealso::
+For more details about S3 Select parameters:
+
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.select_object_content
+"""
+if not bucket_name:
+(bucket_name, key) = self.parse_s3_url(key)
+
+response = self.get_conn().select_object_content(
+Bucket=bucket_name,
+Key=key,
+Expression=expression,
+ExpressionType=expression_type,
+InputSerialization=input_serialization,
+OutputSerialization=output_serialization)
+
+return ''.join(event['Records']['Payload']
+   for event in response['Payload']
+   if 'Records' in event)
+
 def check_for_wildcard_key(self,
wildcard_key, bucket_name=None, delimiter=''):
 """

http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6e82f1d7/airflow/operators/s3_file_transform_operator.py
--
diff --git a/airflow/operators/s3_file_transform_operator.py 
b/airflow/operators/s3_file_transform_operator.py
index 1d39ace..67286b0 100644
--- a/airflow/operators/s3_file_transform_operator.py
+++ b/airflow/operators/s3_file_transform_operator.py
@@ -36,10 +36,13 @@ class S3FileTransformOperator(BaseOperator):
 The locations of the source and the destination files in the local
 filesystem is provided as an first and second arguments to the
 transformation script. The transformation script is expected to read the
-data from source , transform it and write the output to the local
+data from source, transform it and write the output to the local
 destination file. The operator then takes over control and uploads the
 local destination file to S3.
 
+S3 Select is also available to filter the source contents. Users can
+omit the transformation script if S3 Select expression is specified.
+
 :param source_s3_key: The key to be retrieved from S3
 :type source_s3_key: str
 :param source_aws_conn_id: source s3 conne

[jira] [Resolved] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

incubator-airflow git commit: [AIRFLOW-2335] fix issue with jdk8 download for ci

[jira] [Commented] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

[jira] [Created] (AIRFLOW-2335) Issue downloading oracle jdk8 is preventing travis builds from running

[jira] [Created] (AIRFLOW-2334) Add AWS Database Migration Service operators and sensors

[jira] [Created] (AIRFLOW-2333) Add Segment Hook to Airflow

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

[jira] [Created] (AIRFLOW-2332) S3 download operator

[jira] [Updated] (AIRFLOW-2332) S3 Download Operator

[jira] [Commented] (AIRFLOW-1894) Rebase and migrate existing Airflow GCP operators to google-python-cloud

[jira] [Work started] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.

[jira] [Created] (AIRFLOW-2331) Add support for initialization action timeout on dataproc cluster creation.

[jira] [Work started] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

[jira] [Assigned] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

[jira] [Created] (AIRFLOW-2330) GoogleCloudStorageToGoogleCloudStorageOperator on wildcard appends destination_object even when not given

[jira] [Issue Comment Deleted] (AIRFLOW-1929) TriggerDagRunOperator should allow to set the execution_date

[jira] [Created] (AIRFLOW-2329) airflow initdb fails: Specified key was too long; max key length is 767 bytes

[jira] [Resolved] (AIRFLOW-2184) Create a druid_checker operator

incubator-airflow git commit: [AIRFLOW-2184] Add druid_checker_operator

[jira] [Commented] (AIRFLOW-2184) Create a druid_checker operator

[jira] [Work started] (AIRFLOW-2326) Duplicate GCS copy operator

[jira] [Work started] (AIRFLOW-2222) GoogleCloudStorageHook.copy fails for large files between locations

[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

[jira] [Resolved] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

[jira] [Commented] (AIRFLOW-2299) Add S3 Select functionarity to S3FileTransformOperator

incubator-airflow git commit: [AIRFLOW-2299] Add S3 Select functionarity to S3FileTransformOperator

28 matches

Site Navigation

Mail list logo

Footer information