http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/license.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/license.rst.txt b/_sources/license.rst.txt new file mode 100644 index 0000000..9da26c0 --- /dev/null +++ b/_sources/license.rst.txt @@ -0,0 +1,211 @@ +License +======= + +.. image:: img/apache.jpg + :width: 150 + +:: + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2015 Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + Status API Training Shop Blog About + © 2016 GitHub, Inc. Terms Privacy Security Contact Help
http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/plugins.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/plugins.rst.txt b/_sources/plugins.rst.txt new file mode 100644 index 0000000..8d2078f --- /dev/null +++ b/_sources/plugins.rst.txt @@ -0,0 +1,144 @@ +Plugins +======= + +Airflow has a simple plugin manager built-in that can integrate external +features to its core by simply dropping files in your +``$AIRFLOW_HOME/plugins`` folder. + +The python modules in the ``plugins`` folder get imported, +and **hooks**, **operators**, **macros**, **executors** and web **views** +get integrated to Airflow's main collections and become available for use. + +What for? +--------- + +Airflow offers a generic toolbox for working with data. Different +organizations have different stacks and different needs. Using Airflow +plugins can be a way for companies to customize their Airflow installation +to reflect their ecosystem. + +Plugins can be used as an easy way to write, share and activate new sets of +features. + +There's also a need for a set of more complex applications to interact with +different flavors of data and metadata. + +Examples: + +* A set of tools to parse Hive logs and expose Hive metadata (CPU /IO / phases/ skew /...) +* An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts +* An auditing tool, helping understand who accesses what +* A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time + they should land, alert people, and expose visualizations of outages +* ... + +Why build on top of Airflow? +---------------------------- + +Airflow has many components that can be reused when building an application: + +* A web server you can use to render your views +* A metadata database to store your models +* Access to your databases, and knowledge of how to connect to them +* An array of workers that your application can push workload to +* Airflow is deployed, you can just piggy back on it's deployment logistics +* Basic charting capabilities, underlying libraries and abstractions + + +Interface +--------- + +To create a plugin you will need to derive the +``airflow.plugins_manager.AirflowPlugin`` class and reference the objects +you want to plug into Airflow. Here's what the class you need to derive +looks like: + + +.. code:: python + + class AirflowPlugin(object): + # The name of your plugin (str) + name = None + # A list of class(es) derived from BaseOperator + operators = [] + # A list of class(es) derived from BaseHook + hooks = [] + # A list of class(es) derived from BaseExecutor + executors = [] + # A list of references to inject into the macros namespace + macros = [] + # A list of objects created from a class derived + # from flask_admin.BaseView + admin_views = [] + # A list of Blueprint object created from flask.Blueprint + flask_blueprints = [] + # A list of menu links (flask_admin.base.MenuLink) + menu_links = [] + + +Example +------- + +The code below defines a plugin that injects a set of dummy object +definitions in Airflow. + +.. code:: python + + # This is the class you derive to create a plugin + from airflow.plugins_manager import AirflowPlugin + + from flask import Blueprint + from flask_admin import BaseView, expose + from flask_admin.base import MenuLink + + # Importing base classes that we need to derive + from airflow.hooks.base_hook import BaseHook + from airflow.models import BaseOperator + from airflow.executors.base_executor import BaseExecutor + + # Will show up under airflow.hooks.test_plugin.PluginHook + class PluginHook(BaseHook): + pass + + # Will show up under airflow.operators.test_plugin.PluginOperator + class PluginOperator(BaseOperator): + pass + + # Will show up under airflow.executors.test_plugin.PluginExecutor + class PluginExecutor(BaseExecutor): + pass + + # Will show up under airflow.macros.test_plugin.plugin_macro + def plugin_macro(): + pass + + # Creating a flask admin BaseView + class TestView(BaseView): + @expose('/') + def test(self): + # in this example, put your test_plugin/test.html template at airflow/plugins/templates/test_plugin/test.html + return self.render("test_plugin/test.html", content="Hello galaxy!") + v = TestView(category="Test Plugin", name="Test View") + + # Creating a flask blueprint to intergrate the templates and static folder + bp = Blueprint( + "test_plugin", __name__, + template_folder='templates', # registers airflow/plugins/templates as a Jinja template folder + static_folder='static', + static_url_path='/static/test_plugin') + + ml = MenuLink( + category='Test Plugin', + name='Test Menu Link', + url='http://pythonhosted.org/airflow/') + + # Defining the plugin class + class AirflowTestPlugin(AirflowPlugin): + name = "test_plugin" + operators = [PluginOperator] + hooks = [PluginHook] + executors = [PluginExecutor] + macros = [plugin_macro] + admin_views = [v] + flask_blueprints = [bp] + menu_links = [ml] http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/profiling.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/profiling.rst.txt b/_sources/profiling.rst.txt new file mode 100644 index 0000000..93e6b6b --- /dev/null +++ b/_sources/profiling.rst.txt @@ -0,0 +1,39 @@ +Data Profiling +============== + +Part of being productive with data is having the right weapons to +profile the data you are working with. Airflow provides a simple query +interface to write SQL and get results quickly, and a charting application +letting you visualize data. + +Adhoc Queries +------------- +The adhoc query UI allows for simple SQL interactions with the database +connections registered in Airflow. + +.. image:: img/adhoc.png + +Charts +------ +A simple UI built on top of flask-admin and highcharts allows building +data visualizations and charts easily. Fill in a form with a label, SQL, +chart type, pick a source database from your environment's connectons, +select a few other options, and save it for later use. + +You can even use the same templating and macros available when writing +airflow pipelines, parameterizing your queries and modifying parameters +directly in the URL. + +These charts are basic, but they're easy to create, modify and share. + +Chart Screenshot +................ + +.. image:: img/chart.png + +----- + +Chart Form Screenshot +..................... + +.. image:: img/chart_form.png http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/project.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/project.rst.txt b/_sources/project.rst.txt new file mode 100644 index 0000000..2fbd516 --- /dev/null +++ b/_sources/project.rst.txt @@ -0,0 +1,49 @@ +Project +======= + +History +------- + +Airflow was started in October 2014 by Maxime Beauchemin at Airbnb. +It was open source from the very first commit and officially brought under +the Airbnb Github and announced in June 2015. + +The project joined the Apache Software Foundation's incubation program in March 2016. + + +Committers +---------- + +- @mistercrunch (Maxime "Max" Beauchemin) +- @r39132 (Siddharth "Sid" Anand) +- @criccomini (Chris Riccomini) +- @bolkedebruin (Bolke de Bruin) +- @artwr (Arthur Wiedmer) +- @jlowin (Jeremiah Lowin) +- @patrickleotardif (Patrick Leo Tardif) +- @aoen (Dan Davydov) +- @syvineckruyk (Steven Yvinec-Kruyk) + +For the full list of contributors, take a look at `Airflow's Github +Contributor page: +<https://github.com/apache/incubator-airflow/graphs/contributors>`_ + + +Resources & links +----------------- + +* `Airflow's official documentation <http://airflow.apache.org/>`_ +* Mailing list (send emails to + ``dev-subscr...@airflow.incubator.apache.org`` and/or + ``commits-subscr...@airflow.incubator.apache.org`` + to subscribe to each) +* `Issues on Apache's Jira <https://issues.apache.org/jira/browse/AIRFLOW>`_ +* `Gitter (chat) Channel <https://gitter.im/airbnb/airflow>`_ +* `More resources and links to Airflow related content on the Wiki <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Links>`_ + + + +Roadmap +------- + +Please refer to the Roadmap on `the wiki <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home>`_ http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/scheduler.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/scheduler.rst.txt b/_sources/scheduler.rst.txt new file mode 100644 index 0000000..749d58a --- /dev/null +++ b/_sources/scheduler.rst.txt @@ -0,0 +1,153 @@ +Scheduling & Triggers +===================== + +The Airflow scheduler monitors all tasks and all DAGs, and triggers the +task instances whose dependencies have been met. Behind the scenes, +it monitors and stays in sync with a folder for all DAG objects it may contain, +and periodically (every minute or so) inspects active tasks to see whether +they can be triggered. + +The Airflow scheduler is designed to run as a persistent service in an +Airflow production environment. To kick it off, all you need to do is +execute ``airflow scheduler``. It will use the configuration specified in +``airflow.cfg``. + +Note that if you run a DAG on a ``schedule_interval`` of one day, +the run stamped ``2016-01-01`` will be trigger soon after ``2016-01-01T23:59``. +In other words, the job instance is started once the period it covers +has ended. + +**Let's Repeat That** The scheduler runs your job one ``schedule_interval`` AFTER the +start date, at the END of the period. + +The scheduler starts an instance of the executor specified in the your +``airflow.cfg``. If it happens to be the ``LocalExecutor``, tasks will be +executed as subprocesses; in the case of ``CeleryExecutor`` and +``MesosExecutor``, tasks are executed remotely. + +To start a scheduler, simply run the command: + +.. code:: bash + + airflow scheduler + + +DAG Runs +'''''''' + +A DAG Run is an object representing an instantiation of the DAG in time. + +Each DAG may or may not have a schedule, which informs how ``DAG Runs`` are +created. ``schedule_interval`` is defined as a DAG arguments, and receives +preferably a +`cron expression <https://en.wikipedia.org/wiki/Cron#CRON_expression>`_ as +a ``str``, or a ``datetime.timedelta`` object. Alternatively, you can also +use one of these cron "preset": + ++--------------+----------------------------------------------------------------+---------------+ +| preset | Run once a year at midnight of January 1 | cron | ++==============+================================================================+===============+ +| ``None`` | Don't schedule, use for exclusively "externally triggered" | | +| | DAGs | | ++--------------+----------------------------------------------------------------+---------------+ +| ``@once`` | Schedule once and only once | | ++--------------+----------------------------------------------------------------+---------------+ +| ``@hourly`` | Run once an hour at the beginning of the hour | ``0 * * * *`` | ++--------------+----------------------------------------------------------------+---------------+ +| ``@daily`` | Run once a day at midnight | ``0 0 * * *`` | ++--------------+----------------------------------------------------------------+---------------+ +| ``@weekly`` | Run once a week at midnight on Sunday morning | ``0 0 * * 0`` | ++--------------+----------------------------------------------------------------+---------------+ +| ``@monthly`` | Run once a month at midnight of the first day of the month | ``0 0 1 * *`` | ++--------------+----------------------------------------------------------------+---------------+ +| ``@yearly`` | Run once a year at midnight of January 1 | ``0 0 1 1 *`` | ++--------------+----------------------------------------------------------------+---------------+ + + +Your DAG will be instantiated +for each schedule, while creating a ``DAG Run`` entry for each schedule. + +DAG runs have a state associated to them (running, failed, success) and +informs the scheduler on which set of schedules should be evaluated for +task submissions. Without the metadata at the DAG run level, the Airflow +scheduler would have much more work to do in order to figure out what tasks +should be triggered and come to a crawl. It might also create undesired +processing when changing the shape of your DAG, by say adding in new +tasks. + +Backfill and Catchup +'''''''''''''''''''' + +An Airflow DAG with a ``start_date``, possibly an ``end_date``, and a ``schedule_interval`` defines a +series of intervals which the scheduler turn into individual Dag Runs and execute. A key capability of +Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine +the lifetime of the DAG (from start to end/now, one interval at a time) and kick off a DAG Run for any +interval that has not been run (or has been cleared). This concept is called Catchup. + +If your DAG is written to handle it's own catchup (IE not limited to the interval, but instead to "Now" +for instance.), then you will want to turn catchup off (Either on the DAG itself with ``dag.catchup = +False``) or by default at the configuration file level with ``catchup_by_default = False``. What this +will do, is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG +interval series. + +.. code:: python + """ + Code that goes along with the Airflow tutorial located at: + https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py + """ + from airflow import DAG + from airflow.operators.bash_operator import BashOperator + from datetime import datetime, timedelta + + + default_args = { + 'owner': 'airflow', + 'depends_on_past': False, + 'start_date': datetime(2015, 12, 1), + 'email': ['airf...@airflow.com'], + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5), + 'schedule_interval': '@hourly', + } + + dag = DAG('tutorial', catchup=False, default_args=default_args) + +In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the +command line), a single DAG Run will be created, with an ``execution_date`` of 2016-01-01, and the next +one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02. + +If the ``dag.catchup`` value had been True instead, the scheduler would have created a DAG Run for each +completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval +hasn't completed) and the scheduler will execute them sequentially. This behavior is great for atomic +datasets that can easily be split into periods. Turning catchup off is great if your DAG Runs perform +backfill internally. + +External Triggers +''''''''''''''''' + +Note that ``DAG Runs`` can also be created manually through the CLI while +running an ``airflow trigger_dag`` command, where you can define a +specific ``run_id``. The ``DAG Runs`` created externally to the +scheduler get associated to the trigger's timestamp, and will be displayed +in the UI alongside scheduled ``DAG runs``. + + +To Keep in Mind +''''''''''''''' +* The first ``DAG Run`` is created based on the minimum ``start_date`` for the + tasks in your DAG. +* Subsequent ``DAG Runs`` are created by the scheduler process, based on + your DAG's ``schedule_interval``, sequentially. +* When clearing a set of tasks' state in hope of getting them to re-run, + it is important to keep in mind the ``DAG Run``'s state too as it defines + whether the scheduler should look into triggering tasks for that run. + +Here are some of the ways you can **unblock tasks**: + +* From the UI, you can **clear** (as in delete the status of) individual task instances from the task instances dialog, while defining whether you want to includes the past/future and the upstream/downstream dependencies. Note that a confirmation window comes next and allows you to see the set you are about to clear. +* The CLI command ``airflow clear -h`` has lots of options when it comes to clearing task instance states, including specifying date ranges, targeting task_ids by specifying a regular expression, flags for including upstream and downstream relatives, and targeting task instances in specific states (``failed``, or ``success``) +* Marking task instances as successful can be done through the UI. This is mostly to fix false negatives, or for instance when the fix has been applied outside of Airflow. +* The ``airflow backfill`` CLI subcommand has a flag to ``--mark_success`` and allows selecting subsections of the DAG as well as specifying date ranges. + http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/security.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/security.rst.txt b/_sources/security.rst.txt new file mode 100644 index 0000000..70db606 --- /dev/null +++ b/_sources/security.rst.txt @@ -0,0 +1,334 @@ +Security +======== + +By default, all gates are opened. An easy way to restrict access +to the web application is to do it at the network level, or by using +SSH tunnels. + +It is however possible to switch on authentication by either using one of the supplied +backends or create your own. + +Web Authentication +------------------ + +Password +'''''''' + +One of the simplest mechanisms for authentication is requiring users to specify a password before logging in. +Password authentication requires the used of the ``password`` subpackage in your requirements file. Password hashing +uses bcrypt before storing passwords. + +.. code-block:: bash + + [webserver] + authenticate = True + auth_backend = airflow.contrib.auth.backends.password_auth + +When password auth is enabled, an initial user credential will need to be created before anyone can login. An initial +user was not created in the migrations for this authenication backend to prevent default Airflow installations from +attack. Creating a new user has to be done via a Python REPL on the same machine Airflow is installed. + +.. code-block:: bash + + # navigate to the airflow installation directory + $ cd ~/airflow + $ python + Python 2.7.9 (default, Feb 10 2015, 03:28:08) + Type "help", "copyright", "credits" or "license" for more information. + >>> import airflow + >>> from airflow import models, settings + >>> from airflow.contrib.auth.backends.password_auth import PasswordUser + >>> user = PasswordUser(models.User()) + >>> user.username = 'new_user_name' + >>> user.email = 'new_user_em...@example.com' + >>> user.password = 'set_the_password' + >>> session = settings.Session() + >>> session.add(user) + >>> session.commit() + >>> session.close() + >>> exit() + +LDAP +'''' + +To turn on LDAP authentication configure your ``airflow.cfg`` as follows. Please note that the example uses +an encrypted connection to the ldap server as you probably do not want passwords be readable on the network level. +It is however possible to configure without encryption if you really want to. + +Additionally, if you are using Active Directory, and are not explicitly specifying an OU that your users are in, +you will need to change ``search_scope`` to "SUBTREE". + +Valid search_scope options can be found in the `ldap3 Documentation <http://ldap3.readthedocs.org/searches.html?highlight=search_scope>`_ + +.. code-block:: bash + + [webserver] + authenticate = True + auth_backend = airflow.contrib.auth.backends.ldap_auth + + [ldap] + # set a connection without encryption: uri = ldap://<your.ldap.server>:<port> + uri = ldaps://<your.ldap.server>:<port> + user_filter = objectClass=* + # in case of Active Directory you would use: user_name_attr = sAMAccountName + user_name_attr = uid + superuser_filter = memberOf=CN=airflow-super-users,OU=Groups,OU=RWC,OU=US,OU=NORAM,DC=example,DC=com + data_profiler_filter = memberOf=CN=airflow-data-profilers,OU=Groups,OU=RWC,OU=US,OU=NORAM,DC=example,DC=com + bind_user = cn=Manager,dc=example,dc=com + bind_password = insecure + basedn = dc=example,dc=com + cacert = /etc/ca/ldap_ca.crt + # Set search_scope to one of them: BASE, LEVEL , SUBTREE + # Set search_scope to SUBTREE if using Active Directory, and not specifying an Organizational Unit + search_scope = LEVEL + +The superuser_filter and data_profiler_filter are optional. If defined, these configurations allow you to specify LDAP groups that users must belong to in order to have superuser (admin) and data-profiler permissions. If undefined, all users will be superusers and data profilers. + +Roll your own +''''''''''''' + +Airflow uses ``flask_login`` and +exposes a set of hooks in the ``airflow.default_login`` module. You can +alter the content and make it part of the ``PYTHONPATH`` and configure it as a backend in ``airflow.cfg```. + +.. code-block:: bash + + [webserver] + authenticate = True + auth_backend = mypackage.auth + +Multi-tenancy +------------- + +You can filter the list of dags in webserver by owner name, when authentication +is turned on, by setting webserver.filter_by_owner as true in your ``airflow.cfg`` +With this, when a user authenticates and logs into webserver, it will see only the dags +which it is owner of. A super_user, will be able to see all the dags although. +This makes the web UI a multi-tenant UI, where a user will only be able to see dags +created by itself. + + +Kerberos +-------- + +Airflow has initial support for Kerberos. This means that airflow can renew kerberos +tickets for itself and store it in the ticket cache. The hooks and dags can make use of ticket +to authenticate against kerberized services. + +Limitations +''''''''''' + +Please note that at this time not all hooks have been adjusted to make use of this functionality yet. +Also it does not integrate kerberos into the web interface and you will have to rely on network +level security for now to make sure your service remains secure. + +Celery integration has not been tried and tested yet. However if you generate a key tab for every host +and launch a ticket renewer next to every worker it will most likely work. + +Enabling kerberos +''''''''''''''''' + +#### Airflow + +To enable kerberos you will need to generate a (service) key tab. + +.. code-block:: bash + + # in the kadmin.local or kadmin shell, create the airflow principal + kadmin: addprinc -randkey airflow/fully.qualified.domain.n...@your-realm.com + + # Create the airflow keytab file that will contain the airflow principal + kadmin: xst -norandkey -k airflow.keytab airflow/fully.qualified.domain.name + +Now store this file in a location where the airflow user can read it (chmod 600). And then add the following to +your ``airflow.cfg`` + +.. code-block:: bash + + [core] + security = kerberos + + [kerberos] + keytab = /etc/airflow/airflow.keytab + reinit_frequency = 3600 + principal = airflow + +Launch the ticket renewer by + +.. code-block:: bash + + # run ticket renewer + airflow kerberos + +#### Hadoop + +If want to use impersonation this needs to be enabled in ``core-site.xml`` of your hadoop config. + +.. code-block:: bash + + <property> + <name>hadoop.proxyuser.airflow.groups</name> + <value>*</value> + </property> + + <property> + <name>hadoop.proxyuser.airflow.users</name> + <value>*</value> + </property> + + <property> + <name>hadoop.proxyuser.airflow.hosts</name> + <value>*</value> + </property> + +Of course if you need to tighten your security replace the asterisk with something more appropriate. + +Using kerberos authentication +''''''''''''''''''''''''''''' + +The hive hook has been updated to take advantage of kerberos authentication. To allow your DAGs to use it simply +update the connection details with, for example: + +.. code-block:: bash + + { "use_beeline": true, "principal": "hive/_h...@example.com"} + +Adjust the principal to your settings. The _HOST part will be replaced by the fully qualified domain name of +the server. + +You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login +section of the connection. For the login user specify the following as extra: + +.. code-block:: bash + + { "use_beeline": true, "principal": "hive/_h...@example.com", "proxy_user": "login"} + +For the DAG owner use: + +.. code-block:: bash + + { "use_beeline": true, "principal": "hive/_h...@example.com", "proxy_user": "owner"} + +and in your DAG, when initializing the HiveOperator, specify + +.. code-block:: bash + + run_as_owner=True + +OAuth Authentication +-------------------- + +GitHub Enterprise (GHE) Authentication +'''''''''''''''''''''''''''''''''''''' + +The GitHub Enterprise authentication backend can be used to authenticate users +against an installation of GitHub Enterprise using OAuth2. You can optionally +specify a team whitelist (composed of slug cased team names) to restrict login +to only members of those teams. + +*NOTE* If you do not specify a team whitelist, anyone with a valid account on +your GHE installation will be able to login to Airflow. + +.. code-block:: bash + + [webserver] + authenticate = True + auth_backend = airflow.contrib.auth.backends.github_enterprise_auth + + [github_enterprise] + host = github.example.com + client_id = oauth_key_from_github_enterprise + client_secret = oauth_secret_from_github_enterprise + oauth_callback_route = /example/ghe_oauth/callback + allowed_teams = 1, 345, 23 + +Setting up GHE Authentication +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An application must be setup in GHE before you can use the GHE authentication +backend. In order to setup an application: + +1. Navigate to your GHE profile +2. Select 'Applications' from the left hand nav +3. Select the 'Developer Applications' tab +4. Click 'Register new application' +5. Fill in the required information (the 'Authorization callback URL' must be fully qualifed e.g. http://airflow.example.com/example/ghe_oauth/callback) +6. Click 'Register application' +7. Copy 'Client ID', 'Client Secret', and your callback route to your airflow.cfg according to the above example + +Google Authentication +''''''''''''''''''''' + +The Google authentication backend can be used to authenticate users +against Google using OAuth2. You must specify a domain to restrict login +to only members of that domain. + +.. code-block:: bash + + [webserver] + authenticate = True + auth_backend = airflow.contrib.auth.backends.google_auth + + [google] + client_id = google_client_id + client_secret = google_client_secret + oauth_callback_route = /oauth2callback + domain = example.com + +Setting up Google Authentication +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An application must be setup in the Google API Console before you can use the Google authentication +backend. In order to setup an application: + +1. Navigate to https://console.developers.google.com/apis/ +2. Select 'Credentials' from the left hand nav +3. Click 'Create credentials' and choose 'OAuth client ID' +4. Choose 'Web application' +5. Fill in the required information (the 'Authorized redirect URIs' must be fully qualifed e.g. http://airflow.example.com/oauth2callback) +6. Click 'Create' +7. Copy 'Client ID', 'Client Secret', and your redirect URI to your airflow.cfg according to the above example + +SSL +--- + +SSL can be enabled by providing a certificate and key. Once enabled, be sure to use +"https://" in your browser. + +.. code-block:: bash + + [webserver] + web_server_ssl_cert = <path to cert> + web_server_ssl_key = <path to key> + +Enabling SSL will not automatically change the web server port. If you want to use the +standard port 443, you'll need to configure that too. Be aware that super user privileges +(or cap_net_bind_service on Linux) are required to listen on port 443. + +.. code-block:: bash + + # Optionally, set the server to listen on the standard SSL port. + web_server_port = 443 + base_url = http://<hostname or IP>:443 + +Impersonation +''''''''''''' + +Airflow has the ability to impersonate a unix user while running task +instances based on the task's ``run_as_user`` parameter, which takes a user's name. + +*NOTE* For impersonations to work, Airflow must be run with `sudo` as subtasks are run +with `sudo -u` and permissions of files are changed. Furthermore, the unix user needs to +exist on the worker. Here is what a simple sudoers file entry could look like to achieve +this, assuming as airflow is running as the `airflow` user. Note that this means that +the airflow user must be trusted and treated the same way as the root user. + +.. code-block:: none + airflow ALL=(ALL) NOPASSWD: ALL + +Subtasks with impersonation will still log to the same folder, except that the files they +log to will have permissions changed such that only the unix user can write to it. + +*Default impersonation* To prevent tasks that don't use impersonation to be run with +`sudo` privileges, you can set the `default_impersonation` config in `core` which sets a +default user impersonate if `run_as_user` is not set. http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/start.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/start.rst.txt b/_sources/start.rst.txt new file mode 100644 index 0000000..cc41d4b --- /dev/null +++ b/_sources/start.rst.txt @@ -0,0 +1,49 @@ +Quick Start +----------- + +The installation is quick and straightforward. + +.. code-block:: bash + + # airflow needs a home, ~/airflow is the default, + # but you can lay foundation somewhere else if you prefer + # (optional) + export AIRFLOW_HOME=~/airflow + + # install from pypi using pip + pip install airflow + + # initialize the database + airflow initdb + + # start the web server, default port is 8080 + airflow webserver -p 8080 + +Upon running these commands, Airflow will create the ``$AIRFLOW_HOME`` folder +and lay an "airflow.cfg" file with defaults that get you going fast. You can +inspect the file either in ``$AIRFLOW_HOME/airflow.cfg``, or through the UI in +the ``Admin->Configuration`` menu. The PID file for the webserver will be stored +in ``$AIRFLOW_HOME/airflow-webserver.pid`` or in ``/run/airflow/webserver.pid`` +if started by systemd. + +Out of the box, Airflow uses a sqlite database, which you should outgrow +fairly quickly since no parallelization is possible using this database +backend. It works in conjunction with the ``SequentialExecutor`` which will +only run task instances sequentially. While this is very limiting, it allows +you to get up and running quickly and take a tour of the UI and the +command line utilities. + +Here are a few commands that will trigger a few task instances. You should +be able to see the status of the jobs change in the ``example1`` DAG as you +run the commands below. + +.. code-block:: bash + + # run your first task instance + airflow run example_bash_operator runme_0 2015-01-01 + # run a backfill over 2 days + airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02 + +What's Next? +'''''''''''' +From this point, you can head to the :doc:`tutorial` section for further examples or the :doc:`configuration` section if you're ready to get your hands dirty. http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/tutorial.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/tutorial.rst.txt b/_sources/tutorial.rst.txt new file mode 100644 index 0000000..97bbe11 --- /dev/null +++ b/_sources/tutorial.rst.txt @@ -0,0 +1,429 @@ + +Tutorial +================ + +This tutorial walks you through some of the fundamental Airflow concepts, +objects, and their usage while writing your first pipeline. + +Example Pipeline definition +--------------------------- + +Here is an example of a basic pipeline definition. Do not worry if this looks +complicated, a line by line explanation follows below. + +.. code:: python + + """ + Code that goes along with the Airflow tutorial located at: + https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py + """ + from airflow import DAG + from airflow.operators.bash_operator import BashOperator + from datetime import datetime, timedelta + + + default_args = { + 'owner': 'airflow', + 'depends_on_past': False, + 'start_date': datetime(2015, 6, 1), + 'email': ['airf...@airflow.com'], + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5), + # 'queue': 'bash_queue', + # 'pool': 'backfill', + # 'priority_weight': 10, + # 'end_date': datetime(2016, 1, 1), + } + + dag = DAG('tutorial', default_args=default_args) + + # t1, t2 and t3 are examples of tasks created by instantiating operators + t1 = BashOperator( + task_id='print_date', + bash_command='date', + dag=dag) + + t2 = BashOperator( + task_id='sleep', + bash_command='sleep 5', + retries=3, + dag=dag) + + templated_command = """ + {% for i in range(5) %} + echo "{{ ds }}" + echo "{{ macros.ds_add(ds, 7)}}" + echo "{{ params.my_param }}" + {% endfor %} + """ + + t3 = BashOperator( + task_id='templated', + bash_command=templated_command, + params={'my_param': 'Parameter I passed in'}, + dag=dag) + + t2.set_upstream(t1) + t3.set_upstream(t1) + + +It's a DAG definition file +-------------------------- + +One thing to wrap your head around (it may not be very intuitive for everyone +at first) is that this Airflow Python script is really +just a configuration file specifying the DAG's structure as code. +The actual tasks defined here will run in a different context from +the context of this script. Different tasks run on different workers +at different points in time, which means that this script cannot be used +to cross communicate between tasks. Note that for this +purpose we have a more advanced feature called ``XCom``. + +People sometimes think of the DAG definition file as a place where they +can do some actual data processing - that is not the case at all! +The script's purpose is to define a DAG object. It needs to evaluate +quickly (seconds, not minutes) since the scheduler will execute it +periodically to reflect the changes if any. + + +Importing Modules +----------------- + +An Airflow pipeline is just a Python script that happens to define an +Airflow DAG object. Let's start by importing the libraries we will need. + +.. code:: python + + # The DAG object; we'll need this to instantiate a DAG + from airflow import DAG + + # Operators; we need this to operate! + from airflow.operators.bash_operator import BashOperator + +Default Arguments +----------------- +We're about to create a DAG and some tasks, and we have the choice to +explicitly pass a set of arguments to each task's constructor +(which would become redundant), or (better!) we can define a dictionary +of default parameters that we can use when creating tasks. + +.. code:: python + + from datetime import datetime, timedelta + + default_args = { + 'owner': 'airflow', + 'depends_on_past': False, + 'start_date': datetime(2015, 6, 1), + 'email': ['airf...@airflow.com'], + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5), + # 'queue': 'bash_queue', + # 'pool': 'backfill', + # 'priority_weight': 10, + # 'end_date': datetime(2016, 1, 1), + } + +For more information about the BaseOperator's parameters and what they do, +refer to the :py:class:``airflow.models.BaseOperator`` documentation. + +Also, note that you could easily define different sets of arguments that +would serve different purposes. An example of that would be to have +different settings between a production and development environment. + + +Instantiate a DAG +----------------- + +We'll need a DAG object to nest our tasks into. Here we pass a string +that defines the ``dag_id``, which serves as a unique identifier for your DAG. +We also pass the default argument dictionary that we just defined and +define a ``schedule_interval`` of 1 day for the DAG. + +.. code:: python + + dag = DAG( + 'tutorial', default_args=default_args, schedule_interval=timedelta(1)) + +Tasks +----- +Tasks are generated when instantiating operator objects. An object +instantiated from an operator is called a constructor. The first argument +``task_id`` acts as a unique identifier for the task. + +.. code:: python + + t1 = BashOperator( + task_id='print_date', + bash_command='date', + dag=dag) + + t2 = BashOperator( + task_id='sleep', + bash_command='sleep 5', + retries=3, + dag=dag) + +Notice how we pass a mix of operator specific arguments (``bash_command``) and +an argument common to all operators (``retries``) inherited +from BaseOperator to the operator's constructor. This is simpler than +passing every argument for every constructor call. Also, notice that in +the second task we override the ``retries`` parameter with ``3``. + +The precedence rules for a task are as follows: + +1. Explicitly passed arguments +2. Values that exist in the ``default_args`` dictionary +3. The operator's default value, if one exists + +A task must include or inherit the arguments ``task_id`` and ``owner``, +otherwise Airflow will raise an exception. + +Templating with Jinja +--------------------- +Airflow leverages the power of +`Jinja Templating <http://jinja.pocoo.org/docs/dev/>`_ and provides +the pipeline author +with a set of built-in parameters and macros. Airflow also provides +hooks for the pipeline author to define their own parameters, macros and +templates. + +This tutorial barely scratches the surface of what you can do with +templating in Airflow, but the goal of this section is to let you know +this feature exists, get you familiar with double curly brackets, and +point to the most common template variable: ``{{ ds }}``. + +.. code:: python + + templated_command = """ + {% for i in range(5) %} + echo "{{ ds }}" + echo "{{ macros.ds_add(ds, 7) }}" + echo "{{ params.my_param }}" + {% endfor %} + """ + + t3 = BashOperator( + task_id='templated', + bash_command=templated_command, + params={'my_param': 'Parameter I passed in'}, + dag=dag) + +Notice that the ``templated_command`` contains code logic in ``{% %}`` blocks, +references parameters like ``{{ ds }}``, calls a function as in +``{{ macros.ds_add(ds, 7)}}``, and references a user-defined parameter +in ``{{ params.my_param }}``. + +The ``params`` hook in ``BaseOperator`` allows you to pass a dictionary of +parameters and/or objects to your templates. Please take the time +to understand how the parameter ``my_param`` makes it through to the template. + +Files can also be passed to the ``bash_command`` argument, like +``bash_command='templated_command.sh'``, where the file location is relative to +the directory containing the pipeline file (``tutorial.py`` in this case). This +may be desirable for many reasons, like separating your script's logic and +pipeline code, allowing for proper code highlighting in files composed in +different languages, and general flexibility in structuring pipelines. It is +also possible to define your ``template_searchpath`` as pointing to any folder +locations in the DAG constructor call. + +For more information on the variables and macros that can be referenced +in templates, make sure to read through the :ref:`macros` section + +Setting up Dependencies +----------------------- +We have two simple tasks that do not depend on each other. Here's a few ways +you can define dependencies between them: + +.. code:: python + + t2.set_upstream(t1) + + # This means that t2 will depend on t1 + # running successfully to run + # It is equivalent to + # t1.set_downstream(t2) + + t3.set_upstream(t1) + + # all of this is equivalent to + # dag.set_dependency('print_date', 'sleep') + # dag.set_dependency('print_date', 'templated') + +Note that when executing your script, Airflow will raise exceptions when +it finds cycles in your DAG or when a dependency is referenced more +than once. + +Recap +----- +Alright, so we have a pretty basic DAG. At this point your code should look +something like this: + +.. code:: python + + """ + Code that goes along with the Airflow located at: + http://airflow.readthedocs.org/en/latest/tutorial.html + """ + from airflow import DAG + from airflow.operators.bash_operator import BashOperator + from datetime import datetime, timedelta + + + default_args = { + 'owner': 'airflow', + 'depends_on_past': False, + 'start_date': datetime(2015, 6, 1), + 'email': ['airf...@airflow.com'], + 'email_on_failure': False, + 'email_on_retry': False, + 'retries': 1, + 'retry_delay': timedelta(minutes=5), + # 'queue': 'bash_queue', + # 'pool': 'backfill', + # 'priority_weight': 10, + # 'end_date': datetime(2016, 1, 1), + } + + dag = DAG( + 'tutorial', default_args=default_args, schedule_interval=timedelta(1)) + + # t1, t2 and t3 are examples of tasks created by instantiating operators + t1 = BashOperator( + task_id='print_date', + bash_command='date', + dag=dag) + + t2 = BashOperator( + task_id='sleep', + bash_command='sleep 5', + retries=3, + dag=dag) + + templated_command = """ + {% for i in range(5) %} + echo "{{ ds }}" + echo "{{ macros.ds_add(ds, 7)}}" + echo "{{ params.my_param }}" + {% endfor %} + """ + + t3 = BashOperator( + task_id='templated', + bash_command=templated_command, + params={'my_param': 'Parameter I passed in'}, + dag=dag) + + t2.set_upstream(t1) + t3.set_upstream(t1) + +Testing +-------- + +Running the Script +'''''''''''''''''' + +Time to run some tests. First let's make sure that the pipeline +parses. Let's assume we're saving the code from the previous step in +``tutorial.py`` in the DAGs folder referenced in your ``airflow.cfg``. +The default location for your DAGs is ``~/airflow/dags``. + +.. code-block:: bash + + python ~/airflow/dags/tutorial.py + +If the script does not raise an exception it means that you haven't done +anything horribly wrong, and that your Airflow environment is somewhat +sound. + +Command Line Metadata Validation +''''''''''''''''''''''''''''''''' +Let's run a few commands to validate this script further. + +.. code-block:: bash + + # print the list of active DAGs + airflow list_dags + + # prints the list of tasks the "tutorial" dag_id + airflow list_tasks tutorial + + # prints the hierarchy of tasks in the tutorial DAG + airflow list_tasks tutorial --tree + + +Testing +''''''' +Let's test by running the actual task instances on a specific date. The +date specified in this context is an ``execution_date``, which simulates the +scheduler running your task or dag at a specific date + time: + +.. code-block:: bash + + # command layout: command subcommand dag_id task_id date + + # testing print_date + airflow test tutorial print_date 2015-06-01 + + # testing sleep + airflow test tutorial sleep 2015-06-01 + +Now remember what we did with templating earlier? See how this template +gets rendered and executed by running this command: + +.. code-block:: bash + + # testing templated + airflow test tutorial templated 2015-06-01 + +This should result in displaying a verbose log of events and ultimately +running your bash command and printing the result. + +Note that the ``airflow test`` command runs task instances locally, outputs +their log to stdout (on screen), doesn't bother with dependencies, and +doesn't communicate state (running, success, failed, ...) to the database. +It simply allows testing a single task instance. + +Backfill +'''''''' +Everything looks like it's running fine so let's run a backfill. +``backfill`` will respect your dependencies, emit logs into files and talk to +the database to record status. If you do have a webserver up, you'll be able +to track the progress. ``airflow webserver`` will start a web server if you +are interested in tracking the progress visually as your backfill progresses. + +Note that if you use ``depends_on_past=True``, individual task instances +will depend on the success of the preceding task instance, except for the +start_date specified itself, for which this dependency is disregarded. + +The date range in this context is a ``start_date`` and optionally an ``end_date``, +which are used to populate the run schedule with task instances from this dag. + +.. code-block:: bash + + # optional, start a web server in debug mode in the background + # airflow webserver --debug & + + # start your backfill on a date range + airflow backfill tutorial -s 2015-06-01 -e 2015-06-07 + +What's Next? +------------- +That's it, you've written, tested and backfilled your very first Airflow +pipeline. Merging your code into a code repository that has a master scheduler +running against it should get it to get triggered and run every day. + +Here's a few things you might want to do next: + +* Take an in-depth tour of the UI - click all the things! +* Keep reading the docs! Especially the sections on: + + * Command line interface + * Operators + * Macros + +* Write your first pipeline! http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_sources/ui.rst.txt ---------------------------------------------------------------------- diff --git a/_sources/ui.rst.txt b/_sources/ui.rst.txt new file mode 100644 index 0000000..4b232fa --- /dev/null +++ b/_sources/ui.rst.txt @@ -0,0 +1,102 @@ +UI / Screenshots +================= +The Airflow UI make it easy to monitor and troubleshoot your data pipelines. +Here's a quick overview of some of the features and visualizations you +can find in the Airflow UI. + + +DAGs View +......... +List of the DAGs in your environment, and a set of shortcuts to useful pages. +You can see exactly how many tasks succeeded, failed, or are currently +running at a glance. + +------------ + +.. image:: img/dags.png + +------------ + + +Tree View +......... +A tree representation of the DAG that spans across time. If a pipeline is +late, you can quickly see where the different steps are and identify +the blocking ones. + +------------ + +.. image:: img/tree.png + +------------ + +Graph View +.......... +The graph view is perhaps the most comprehensive. Visualize your DAG's +dependencies and their current status for a specific run. + +------------ + +.. image:: img/graph.png + +------------ + +Variable View +............. +The variable view allows you to list, create, edit or delete the key-value pair +of a variable used during jobs. Value of a variable will be hidden if the key contains +any words in ('password', 'secret', 'passwd', 'authorization', 'api_key', 'apikey', 'access_token') +by default, but can be configured to show in clear-text. + +------------ + +.. image:: img/variable_hidden.png + +------------ + +Gantt Chart +........... +The Gantt chart lets you analyse task duration and overlap. You can quickly +identify bottlenecks and where the bulk of the time is spent for specific +DAG runs. + +------------ + +.. image:: img/gantt.png + +------------ + +Task Duration +............. +The duration of your different tasks over the past N runs. This view lets +you find outliers and quickly understand where the time is spent in your +DAG over many runs. + + +------------ + +.. image:: img/duration.png + +------------ + +Code View +......... +Transparency is everything. While the code for your pipeline is in source +control, this is a quick way to get to the code that generates the DAG and +provide yet more context. + +------------ + +.. image:: img/code.png + +------------ + +Task Instance Context Menu +.......................... +From the pages seen above (tree view, graph view, gantt, ...), it is always +possible to click on a task instance, and get to this rich context menu +that can take you to more detailed metadata, and perform some actions. + +------------ + +.. image:: img/context.png http://git-wip-us.apache.org/repos/asf/incubator-airflow-site/blob/5e574012/_static/fonts/Inconsolata.ttf ---------------------------------------------------------------------- diff --git a/_static/fonts/Inconsolata.ttf b/_static/fonts/Inconsolata.ttf new file mode 100644 index 0000000..4b8a36d Binary files /dev/null and b/_static/fonts/Inconsolata.ttf differ