Chesnay Schepler created FLINK-11463:
----------------------------------------

             Summary: Rework end-to-end tests in Java
                 Key: FLINK-11463
                 URL: https://issues.apache.org/jira/browse/FLINK-11463
             Project: Flink
          Issue Type: New Feature
          Components: E2E Tests
            Reporter: Chesnay Schepler
            Assignee: Chesnay Schepler


This is the (long-term) umbrella issue for reworking our end-to-tests in Java 
on top of a new set of utilities.

Below are some areas where problems have been identified that I want to address 
with a prototype soon. This prototype primarily aims to introduce certain 
patterns to be built upon in the future.

h2. Environments

h4. Problem

Our current tests directly work against flink-dist and setup local clusters 
with/-out HA. Similar issues apply to Kafka and ElasticSearch.
This prevents us from re-using tests for other environments (Yarn, Docker) and 
distributed settings.

We also frequently have issues with cleaning up resources as it is the 
responsibility of the test itself.

h4. Proposal

Introduce a common interface for a given resource type (i.e. Flink, Kafka) that 
tests will work against.
These resources should be implemented as jUnit external resources to allow 
reasonable life-cycle management.

Tests get access to an instance of this resource through a factory method.

Each resource implementation has a dedicated factory that is loaded with a 
{{ServiceLoader}}. Factories evaluate system-properties to determine whether 
the implementation should be loaded, and then optionally configure the resource.

Example:
{code}
public interface FlinkResource {
        ... common methods ...

/**
         * Returns the configured FlinkResource implementation, or a {@link 
LocalStandaloneFlinkResource} if none is configured.
         *
         * @return configured FlinkResource, or {@link 
LocalStandaloneFlinkResource} is none is configured
         */
        FlinkResource get() {
                // load factories
                // evaluate system properties
                // return instance
        }
}

public interface FlinkResourceFactory {

        /**
         * Returns a {@link FlinkResource} instance. If the instance could not 
be instantiated (for example, because a
         * mandatory parameter was missing), then an empty {@link Optional} 
should be returned.
         *
         * @return FlinkResource instance, or an empty Optional if the instance 
could not be instantiated
         */
        Optional<FlinkResource> create();
}
{code}

As example, running {{mvn verify -De2e.flink.mode=localStandalone}} could load 
a FlinkResource that sets up a local standalone cluster, while for {{mvn verify 
-De2e.flink.mode=distributedStandalone -De2e.flink.hosts=...}} it would connect 
to the given host and setup a distributed cluster.

Tests are not _required_ to work against the common interface, and may be 
hard-wired to run against specific implementations. Simply put, the resource 
implementations should be public.

h4. Future considerations

The factory method may be extended to allow tests to specify a set of 
conditions that must be fulfilled, for example HA to be enabled. If this 
requirement cannot be fulfilled the test should be skipped.

h2. Split Management

h4. Problem

End-to-end tests are run in separate {{cron-<version>-e2e}} branches. To 
accommodate the Travis time limits we run a total of 6 jobs each covering a 
subset of the tests.
These so-called splits are currently managed in the respective branches, and 
not on master/release branches.

This is a rather hidden detail that not everyone is aware of, nor is it easily 
discoverable. This has resulted several times in newly added tests not actually 
being run. Furthermore, if the arguments for tests are modified these changes 
have to be replicated to each branch.

h4. Proposal

Use jUnit Categories to assign each test explicitly to one of the Travis jobs.
{code}
@Category(TravisGroup1.class)
public class MyTestRunningInTheFirstJob {
        ...
}
{code}

It's a bit on the nose but a rather simple solution.

A given group of tests could be executed by running {{mvn verify 
-Dcategories="org.apache.flink.tests.util.TravisGroup1"}}.
All tests can be executed by running {{mvn verify 
-Dcategories=""org.apache.flink.tests.util.TravisGroup1""}}

h4. Future considerations

Tests may furthermore be categorized based on what they are testing (e.g. 
"Metrics", "Checkpointing", "Kafka") to allow running a certain subset of tests 
quickly.

h2. Caching of downloaded artifacts

h4. Problem

Several tests download archives for setting up systems, like Kafka of 
Elasticsearch. We currently do not cache downloads in any way, resulting in 
less stable tests (as mirrors aren't always available) and overall increased 
test duration (since the downloads at times are quite slow). The duration issue 
becomes especially apparent when running tests in a loop for debugging or 
release-testing purposes.
Finally, it also puts unnecessary strain on the download mirrors.

h4. Proposal

Add a {{DownloadCache}} interface with a single {{Path getOrDownload(String 
url, Path targetDir)}} method.
Access to and loading of implementations are handled like resources (see above).

The caching behavior is implementation-dependent.

A reasonable implementation should allow files may be cached in a user-provided 
directory, with an optional time-to-live for long-term setups.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to