[jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998681#comment-15998681 ] Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:42 PM: --- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 - (installs spark on each airflow worker node and runs local spark jobs without use of spark submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd - (alternative mozilla implementation for emr spark job) https://github.com/mozilla/telemetry-airflow/blob/master/dags/operators/emr_spark_operator.py EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 - (installs spark on each airflow worker node and runs local spark jobs without use of spark submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998681#comment-15998681 ] Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:10 PM: --- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 - (installs spark on each airflow worker node and runs local spark jobs without use of spark submit) https://medium.com/@calvertmg/airflow-integrating-with-apache-spark-50a7704dcebd EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998681#comment-15998681 ] Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:08 PM: --- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ - (use shell script to spark-submit on a local spark installation) https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660 EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998681#comment-15998681 ] Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:07 PM: --- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: - (uses emr hooks, operators) https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 - (uses shells scripts to launch and terminate emr clusters) https://www.agari.com/automated-model-building-emr-spark-airflow/ EMR: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 EMR: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (AIRFLOW-247) EMR Hook, Operators, Sensor
[ https://issues.apache.org/jira/browse/AIRFLOW-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998681#comment-15998681 ] Al Johri edited comment on AIRFLOW-247 at 5/7/17 11:06 PM: --- I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: Spark, EMR: https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 EMR: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/emr_hook.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_create_job_flow_operator.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_add_steps_operator.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/emr_terminate_job_flow_operator.py Spark: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py was (Author: al.johri): I'm searching for documentation related to how Airflow works with EMR. I'm struggling to find anything here: https://airflow.incubator.apache.org/integration.html#aws My main question is, can Airflow create an EMR cluster and bring it back down like AWS Data Pipeline? Thanks! EDIT: Found some information here: https://docs.google.com/presentation/d/1NG1P86HRlX43qTVucCTOsFqIbCvYdOhq_np90VlbVRc/edit#slide=id.gd4067_1_0 > EMR Hook, Operators, Sensor > --- > > Key: AIRFLOW-247 > URL: https://issues.apache.org/jira/browse/AIRFLOW-247 > Project: Apache Airflow > Issue Type: New Feature >Reporter: Rob Froetscher >Assignee: Rob Froetscher >Priority: Minor > > Substory of https://issues.apache.org/jira/browse/AIRFLOW-115. It would be > nice to have an EMR hook and operators. > Hook to generally interact with EMR. > Operators to: > * setup and start a job flow > * add steps to an existing jobflow > A sensor to: > * monitor completion and status of EMR jobs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1178) @once may run more than one time
Ruslan Dautkhanov created AIRFLOW-1178: -- Summary: @once may run more than one time Key: AIRFLOW-1178 URL: https://issues.apache.org/jira/browse/AIRFLOW-1178 Project: Apache Airflow Issue Type: Bug Components: scheduler Affects Versions: Airflow 1.8, 1.8.1, 1.8.0 Environment: Airflow master snapshot from May 05 2017 Reporter: Ruslan Dautkhanov Priority: Critical Attachments: onceDAG_got_scheduled_twice.png My DAG is running second (2nd) time although it is declared as @once. Here's DAG definition : {noformat} main_dag = DAG( dag_id = 'Test-DAG-1', default_args = default_args, # dafeult operators' arguments - see above user_defined_macros= dag_macros, # I do not get different between ## params = dag_macros, # user_defined_macros and params # start_date = datetime.now(),# or e.g. datetime(2015, 6, 1) # 'end_date' = datetime(2016, 1, 1), catchup= True, # Perform scheduler catchup (or only run latest)? # - defaults to True schedule_interval = '@once', # '@once'=None? # doesn't create multiple dag runs automatically concurrency= 3, # task instances allowed to run concurrently max_active_runs= 1, # only one DAG run at a time dagrun_timeout = timedelta(days=4), # no way this dag should ran for 4 days orientation= 'TB', # default graph view ) {noformat} As a workaround for AIRFLOW-1013 I changed catchup from False to True. Suggested on dev list. It "worked around" AIRFLOW-1013 execution, but screwed @once logic - the DAG got scheduled twice (!) which is a no-go for us. The DAG actually has to run not more than 1 time. IMO, catchup=True should be explicitly disallowed for @once schedule. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-1177) variable json deserialize does not work at set defaults
[ https://issues.apache.org/jira/browse/AIRFLOW-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] barak schoster updated AIRFLOW-1177: Description: at line: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3685 obj has no attribute val, it is the val itself. it will throw error like: Variable.setdefault("some_key", deserialize_json=True,default=json.dumps(default_dag_variables)) File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 3586, in setdefault return json.loads(obj.val) AttributeError: 'unicode' object has no attribute 'val' was: at line: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3685 obj has no attribute val, it is the val itself. > variable json deserialize does not work at set defaults > --- > > Key: AIRFLOW-1177 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1177 > Project: Apache Airflow > Issue Type: Bug > Components: models >Affects Versions: Airflow 1.8 >Reporter: barak schoster >Assignee: barak schoster > > at line: > https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3685 > obj has no attribute val, it is the val itself. > it will throw error like: > Variable.setdefault("some_key", > deserialize_json=True,default=json.dumps(default_dag_variables)) > File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 3586, > in setdefault > return json.loads(obj.val) > AttributeError: 'unicode' object has no attribute 'val' -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1177) variable json deserialize does not work at set defaults
barak schoster created AIRFLOW-1177: --- Summary: variable json deserialize does not work at set defaults Key: AIRFLOW-1177 URL: https://issues.apache.org/jira/browse/AIRFLOW-1177 Project: Apache Airflow Issue Type: Bug Components: models Affects Versions: Airflow 1.8 Reporter: barak schoster Assignee: barak schoster at line: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L3685 obj has no attribute val, it is the val itself. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-1163) Cannot Access Airflow Webserver Behind AWS ELB
[ https://issues.apache.org/jira/browse/AIRFLOW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999691#comment-15999691 ] Dennis O'Brien commented on AIRFLOW-1163: - I don't think it's the same issue you are running into, but I ran into a problem with redirects changing the https protocol to http. In case it helps your debugging, the issue was AIRFLOW-571 > Cannot Access Airflow Webserver Behind AWS ELB > -- > > Key: AIRFLOW-1163 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1163 > Project: Apache Airflow > Issue Type: Bug >Affects Versions: Airflow 1.7.1 >Reporter: Tim > > Cannot access airflow from behind a load balancer. > If we go directly to the IP of the server it loads just fine. When trying to > use the load balancer cname and forward the request it does not load. > We updated the base_url to be the LB url but it still does not work. The page > sits and spins forever. Eventually it loads some ui elements. > Here is what I see on the network tab: > https://puu.sh/vC4Zp/e34131.png > Here is what our config looks like: > {code} > [webserver] > # The base url of your website as airflow cannot guess what domain or > # cname you are using. This is use in automated emails that > # airflow sends to point links to the right web server > base_url = > http://internal-st-airflow-lb-590109685.us-east-1.elb.amazonaws.com:80 > # The ip specified when starting the web server > web_server_host = 0.0.0.0 > # The port on which to run the web server > web_server_port = 8080 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)