Re: [Discussion] In Prep for AIP: Stateful XComs and Poke Rescheduling in Operators

2020-01-07 Thread Jacob Ferriero
Image not working on dev list here is link to the github review comment
containing said image:
https://github.com/apache/airflow/pull/6370#issuecomment-546582724.

On Tue, Jan 7, 2020 at 5:40 PM Jacob Ferriero  wrote:

> Hello Dev List,
>
> The inspiration for this is to allow operators to start a long running
> task on an external system and reschedule pokes for completion (e.g spark
> job on dataproc), instead of blocking a worker (sketched out in #6210
> ) to allow freeing up of
> slots between pokes. To do this requires supporting a method for storing
> task state between reschedules.
> It's worth noting that a task would maintain state only during reschedules
> but clear state on retries. In this way the task is idempotent before
> reaching a terminal state [SUCCES, FAIL, UP_FOR_RETRY]. This brings up a
> question of the scope of commitment to idempotency of operators. If it is
> deemed acceptable for reschedules to maintain some state, then we can free
> up workers between pokes.
>
> Because this is very similar to the purpose of XCom it's been postulated
> that we should support this behavior in XCom rather than provide a new
> model in the db for TaskState. (Though discussion here on which is more
> appropriate is more than welcome.)
>
> I'd like to put forward a proposal to resurrect the reverted #6370
>  in order to provide a
> modification to the lifetime of XComs under certain conditions. The diagram
> below helps illustrate the change originally proposed in #6370. There was
> concern about changing existing behavior (potentially breaking) and the
> fact that this makes operators stateful. Per the review comments and an
> informal discussion (meetings notes
> 
> and #sig-async-operators) I'd like to modify the approach #6370 to only
> skip clearing of XCom if the Xom key is prefixed with
> `airflow.models.xcom.DO_NOT_CLEAR_PREFIX = "_STATEFUL_"` or similar.
>
> [image: image.png]
> --
>
> *Jacob Ferriero*
>
> Strategic Cloud Engineer: Data Engineering
>
> jferri...@google.com
>
> 617-714-2509 <(617)%20714-2509>
>


-- 

*Jacob Ferriero*

Strategic Cloud Engineer: Data Engineering

jferri...@google.com

617-714-2509


[Discussion] In Prep for AIP: Stateful XComs and Poke Rescheduling in Operators

2020-01-07 Thread Jacob Ferriero
Hello Dev List,

The inspiration for this is to allow operators to start a long running task
on an external system and reschedule pokes for completion (e.g spark job on
dataproc), instead of blocking a worker (sketched out in #6210
) to allow freeing up of slots
between pokes. To do this requires supporting a method for storing task
state between reschedules.
It's worth noting that a task would maintain state only during reschedules
but clear state on retries. In this way the task is idempotent before
reaching a terminal state [SUCCES, FAIL, UP_FOR_RETRY]. This brings up a
question of the scope of commitment to idempotency of operators. If it is
deemed acceptable for reschedules to maintain some state, then we can free
up workers between pokes.

Because this is very similar to the purpose of XCom it's been postulated
that we should support this behavior in XCom rather than provide a new
model in the db for TaskState. (Though discussion here on which is more
appropriate is more than welcome.)

I'd like to put forward a proposal to resurrect the reverted #6370
 in order to provide a
modification to the lifetime of XComs under certain conditions. The diagram
below helps illustrate the change originally proposed in #6370. There was
concern about changing existing behavior (potentially breaking) and the
fact that this makes operators stateful. Per the review comments and an
informal discussion (meetings notes

and #sig-async-operators) I'd like to modify the approach #6370 to only
skip clearing of XCom if the Xom key is prefixed with
`airflow.models.xcom.DO_NOT_CLEAR_PREFIX = "_STATEFUL_"` or similar.

[image: image.png]
-- 

*Jacob Ferriero*

Strategic Cloud Engineer: Data Engineering

jferri...@google.com

617-714-2509 <(617)%20714-2509>


Re: Grouping tests using pytest markers

2020-01-07 Thread Jarek Potiuk
Since we need to tackle CI stability I would like to bump this one if
anyone wants to say something :) I am full speed ahead with implementing
those.

*Proposed Breeze changes:*
>
>- `./breeze` by default will start only the main 'airflow-testing'
>image. This way no huge resource usage will be needed when breeze is
>started by default
>- './breeze --all-integrations` will start all dependent images (so we
>will be able to run all tests)
>- './breeze --integrations [kubernetes,cassandra,mongo,
>rabbitmq,redis,openldap,kerberos] - you will be able to choose which
>integrations you want to start
>- When you run `breeze --backend postgres` it will only start postgres
>not mysql and the other way round.
>
> I have PR in progress: https://github.com/apache/airflow/pull/7091
(depends on few others)

After this is merged ./breeze will only start 'airflow-testing' image. You
will be able to launch other docker images (mongo/cassandra and others)
with *--integration mongo --integration cassandra* etc. (or --integration
all to launch all of them). This will be great for local testing (resource
usage!). This will also work in CI ( I will split the test jobs into
separate ones).


> *Proposed Pytest marks:*
>
>-
>
> pytest.mark.integrations('kubernetes'),pytest.mark.integrations('cassandra'),.
>- pytest,mark.backends("postgres"), pytest,mark.backends("mysql"),
>pytest.mark.backends("sqlite")
>
> During tests I will identify the tests that need particular integrations
and will mark/skip them appropriately and work out the right pytest
behaviour.

*Proposed Pytest behaviour:*
>
>- `pytest` -> in Breeze will run all tests that are applicable within
>the current environment:
>   - it will only run non-marked tests by default, applicable with
>   current selected backend
>   - when (for example) you stared cassandra is added it will
>   additionally run pytest.mark.integrations('cassandra')
>- `pytest` in local environment by default will only run non-marked
>tests
>- `pytest --integrations [kubernetes, ]` will only run the
>integration tests selected (will convert the switch into the corresponding
>markers (as explained in the example above)
>- `pytest --backends [postgres| mysql | sqlite] will only run the
>specific tests that use postgres/mysql/sqlite specific tests
>
> More details when I get to this one. Ideally all should be autodetected -
i.e. when you have no integration enabled, the corresponding tests should
be skipped, we should also be able to run tests for particular integration
or selected integrations with one command.

J.


Often failing tests in CI (and a way to fix them quickly and future-proof)

2020-01-07 Thread Jarek Potiuk
*TL;DR; I have a proposal how we can remedy often failing CI tests and I
have a kind request to other committers to help me to fix it in a good way.*

As we all noticed we had recently some often (far too often) failing tests
in Travis. The situation is not very good and we have to remedy it fairly
quickly.

I think we can do it without compromising the quality and without temporary
disabling some of the tests.

*Root cause of the problem*

The root cause of the problem seems to be memory used during tests. After
adding instafail we know that often the tests are failing because there is
not enough memory on Travis machines. This is a combination of the way how
our virtual machines are allocated in Travis infrastructure, more and more
tests we have, the fact that our tests require a lot of "integrations"
(running as separate images - cassandra, rabbitmq, postgres/mysql databases
etc) and the fact that running them with pytests (pytest apparently uses
more memory).

*Proposal*

One of the proposals on slack was to get rid of Cassandra tests and disable
cassandra temporarily - but I think we can do better and I can get it
merged in a day or two and get it sorted out for now (and good for the
future).

I already wrote an integration test proposal recently

(I
will resurrect that thread now) how we can split our integration tests
using pytest markers and get support from Breeze and our CI testing
framework into separate integrations. I already have working code for that
(it is a result of my resumed work on Production Image) and most of the
code is already Green in Travis and they need to get a review from other
committers. At the end of the message I copy the excerpt from the docs how
this will work.

Once we have that in, we will have a very easy (and maintainable for the
future) way that helps both with CI resources but also make Breeze far more
usable (and less resource-hungry):

   - add pytest markers so that we know which tests are "integration" ones.
   - start Breeze locally without any external integrations (right now only
   Kerberos is needed) - most of the tests works there. Far less resource usage
   - start Breeze easily with *--integration mongo --integration
cassandra *etc. whenever
   we need to run tests for that integration
   - run all the "non-integration" tests in CI without the integrations
   started
   - run only the "integration-related" tests in CI with the integrations
   started
   - we will have more jobs in CI but they should run much more reliably
   and faster in general
   - also one of the changes is to improve the way we build kind/kubernetes
   tests in order to unblock migration to GithubActions that Tomek works on -
   that might be our ultimate speedup/stabilisation
   - For those curious ones is updated documentation in my PR: "Launching
   Breeze integrations"
   
,
   "Running
   

   tests with Kubernetes in Breeze"
   


*PRs - kind request to other committers*

I have a series of PRs that are already implementing almost all of it (I
needed that in order to implement Production Image support). They are
depending on each other - I added unit test support for Bash scripts and
several improvements and added simplifications:

   - [AIRFLOW-6489] Add BATS support for Bash unit testing
    [ready for review] -
   needed to get more control over other changes.
   - [AIRFLOW-6475] Remove duplication of volume mount specs in Breeze.
   [ready for review]-
   improves the consistency on how we run Breeze/CI
   - [AIRFLOW-6491] improve parameter handling in breeze
    [ready for review] -
   tested and improved way how we handle --options in Breeze (needed for
   Kubernetes improvements
   - [AIRFLOW-5704] Improve Kind Kubernetes scripts for local testing
   . - [testing]  improve
   handling Kubernetes Kind testing (fixes issues with loading images/
   upgrades kind to latest version).
   - [AIRFLOW-6489] Separate integrations [WIP]
    - [WIP] this is the test
   introducing different integrations - it already works for Breeze and has
   support for deciding which integrations should be started  - I just need to
   separate out the "Integration tests" to separate jobs.

I have a kind request to other committers - can you please take a look at
those and help to merge it quickly?

I