potiuk commented on a change in pull request #6675: [AIRFLOW-6038] AWS DataSync 
example_dags added
URL: https://github.com/apache/airflow/pull/6675#discussion_r355336660
 
 

 ##########
 File path: 
airflow/providers/amazon/aws/example_dags/example_datasync_complex.py
 ##########
 @@ -0,0 +1,198 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+This is an example dag for using some of the the AWS DataSync operators in a 
more complex manner.
+
+- Try to get a TaskArn. If one exists, update it.
+- If no tasks exist, try to create a new DataSync Task.
+    - If source and destination locations dont exist for the new task, create 
them first
+- If many tasks exist, raise an Exception
+- After getting or creating a DataSync Task, run it
+
+Specific operators used:
+* `AWSDataSyncCreateTaskOperator`
+* `AWSDataSyncGetTasksOperator`
+* `AWSDataSyncTaskOperator`
+* `AWSDataSyncUpdateTaskOperator`
+
+This DAG relies on the following environment variables:
+
+* SOURCE_LOCATION_URI - Source location URI, usually on premisis SMB or NFS
+* DESTINATION_LOCATION_URI - Destination location URI, usually S3
+* CREATE_TASK_KWARGS - Passed to boto3.create_task(**kwargs)
+* CREATE_SOURCE_LOCATION_KWARGS - Passed to boto3.create_location(**kwargs)
+* CREATE_DESTINATION_LOCATION_KWARGS - Passed to 
boto3.create_location(**kwargs)
+* UPDATE_TASK_KWARGS - Passed to boto3.update_task(**kwargs)
+"""
+
+import json
+from os import getenv
+
+from airflow import models, utils
+from airflow.exceptions import AirflowException
+from airflow.operators.python_operator import BranchPythonOperator, 
PythonOperator
+from airflow.providers.amazon.aws.operators.datasync import (
+    AWSDataSyncCreateTaskOperator, AWSDataSyncGetTasksOperator, 
AWSDataSyncTaskOperator,
+    AWSDataSyncUpdateTaskOperator,
+)
+
+# [START howto_operator_datasync_complex_args]
+SOURCE_LOCATION_URI = getenv(
+    "SOURCE_LOCATION_URI", "smb://hostname/directory/")
+
+DESTINATION_LOCATION_URI = getenv(
+    "DESTINATION_LOCATION_URI", "s3://mybucket/prefix")
+
+default_create_task_kwargs = '{"Name": "Created by Airflow"}'
+CREATE_TASK_KWARGS = json.loads(
+    getenv("CREATE_TASK_KWARGS", default_create_task_kwargs)
+)
+
+default_create_source_location_kwargs = "{}"
+CREATE_SOURCE_LOCATION_KWARGS = json.loads(
+    getenv("CREATE_SOURCE_LOCATION_KWARGS",
+           default_create_source_location_kwargs)
+)
+
+bucket_access_role_arn = (
+    "arn:aws:iam::11112223344:role/r-11112223344-my-bucket-access-role"
+)
+default_destination_location_kwargs = """\
+{"S3BucketArn": "arn:aws:s3:::mybucket",
+    "S3Config": {"BucketAccessRoleArn": bucket_access_role_arn}
+}"""
+CREATE_DESTINATION_LOCATION_KWARGS = json.loads(
+    getenv("CREATE_DESTINATION_LOCATION_KWARGS",
+           default_destination_location_kwargs)
+)
+
+default_update_task_kwargs = '{"Name": "Updated by Airflow"}'
+UPDATE_TASK_KWARGS = json.loads(
+    getenv("UPDATE_TASK_KWARGS", default_update_task_kwargs)
+)
+
+default_args = {"start_date": utils.dates.days_ago(1)}
+# [END howto_operator_datasync_complex_args]
+
+
+# [START howto_operator_datasync_complex_decide_function]
+
+
+def decide(**kwargs):
 
 Review comment:
   Just a thought. Should not this be part of the operators themselves? It 
makes more sense to have the operators idempotent rather than built-in the 
logic into DAGs. In this case I think It could be better to move the "decide" 
logic into the operator and have a single "Create" operator that checks if the 
DataSync already exists and performs update instead. It makes it then 
super-easy to write dags without having to add the Branch operator.
   
   We have done that for pretty much all GCP operators and we think it makes 
much more sense: See for exa ple dataproc operator - where we either create or 
use existinng cluster if already there.
   
    
https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/operators/dataproc.py#L504
   
   This makes it so much better for idempotency and back-filling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to