[jira] [Created] (FLINK-20410) SQLClientSchemaRegistryITCase.testWriting failed with "Subject 'user_behavior' not found.; error code: 40401"

2020-11-27 Thread Dian Fu (Jira)
Dian Fu created FLINK-20410:
---

 Summary: SQLClientSchemaRegistryITCase.testWriting failed with 
"Subject 'user_behavior' not found.; error code: 40401"
 Key: FLINK-20410
 URL: https://issues.apache.org/jira/browse/FLINK-20410
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.12.0
Reporter: Dian Fu
 Fix For: 1.12.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10276=logs=91bf6583-3fb2-592f-e4d4-d79d79c3230a=3425d8ba-5f03-540a-c64b-51b8481bf7d6

{code}
2020-11-28T01:14:08.6444305Z Nov 28 01:14:08 [ERROR] 
testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase)  
Time elapsed: 74.818 s  <<< ERROR!
2020-11-28T01:14:08.6445353Z Nov 28 01:14:08 
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: 
Subject 'user_behavior' not found.; error code: 40401
2020-11-28T01:14:08.6446071Z Nov 28 01:14:08at 
io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:292)
2020-11-28T01:14:08.6446910Z Nov 28 01:14:08at 
io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352)
2020-11-28T01:14:08.6447522Z Nov 28 01:14:08at 
io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:769)
2020-11-28T01:14:08.6448352Z Nov 28 01:14:08at 
io.confluent.kafka.schemaregistry.client.rest.RestService.getAllVersions(RestService.java:760)
2020-11-28T01:14:08.6449091Z Nov 28 01:14:08at 
io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getAllVersions(CachedSchemaRegistryClient.java:364)
2020-11-28T01:14:08.6449878Z Nov 28 01:14:08at 
org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.testWriting(SQLClientSchemaRegistryITCase.java:195)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink

2020-11-27 Thread Xingbo Huang
Thanks Matthias,

I have created sub-tasks for PyFlink related e2e tests.

Best,
Xingbo

> 2020年11月27日 下午10:53,Jark Wu  写道:
> 
> Thanks Matthias,
> 
> I have created sub-tasks for the Table SQL related bash-based e2e tests.
> 
> Best,
> Jark
> 
> On Fri, 27 Nov 2020 at 21:25, Matthias Pohl  wrote:
> 
>> Thanks Robert for pushing this. +1 for creating java-based e2e tests.
>> 
>> In the engine team, we decided to work towards the goal of migrating the
>> bash-based e2e tests to Java/Docker. We plan to migrate the existing
>> bash-based e2e tests located in the Engine team's component space
>> step-by-step. I created an umbrella Jira issue [1] to collect and document
>> the migration efforts. Feel free to do the same by creating subtasks under
>> [1].
>> 
>> Best,
>> Matthias
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-20392
>> 
>> On Thu, Nov 19, 2020 at 8:02 AM Yun Tang  wrote:
>> 
>>> +1 for java-based E2E test as bash scripts lack of the power to handle
>>> more complicated cases.
>>> 
>>> For the docker image improvement, I think we should be more cautious as
>>> developers at China might suffer with the network issue, or at least we
>>> should ensure or find some guides to speed up the image downloading.
>>> 
>>> Best
>>> Yun Tang
>>> 
>>> From: Xingbo Huang 
>>> Sent: Thursday, November 19, 2020 12:09
>>> To: dev 
>>> Subject: Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink
>>> 
>>> Big +1 for java-based e2e tests. Currently PyFlink related tests each
>> take
>>> ~15minutes in bash e2e tests because we are using a secured YARN cluster
>>> which is the only convenient way of starting a YARN cluster in the bash
>> e2e
>>> tests. I think if we migrate these tests to the java-based testing
>>> framework, we will start a Yarn Cluster more conveniently, which will
>>> greatly reduce our testing time.
>>> 
>>> Best,
>>> Xingbo
>>> 
>>> Rui Li  于2020年11月19日周四 上午10:47写道:
>>> 
 Big +1 to java-based e2e tests. It'll be much easier to write/debug
>> these
 tests.
 
 On Wed, Nov 18, 2020 at 9:44 PM Leonard Xu  wrote:
 
> +1 to stop using bash scripts,
> and I also have experienced the bash scripts that is really hard to
> maintain and debug, thanks @Robert for the great work again.
> 
> I think testcontainers is a nice candidate.
> 
> Best,
> Leonard
> 
>> 在 2020年11月18日,19:46,Aljoscha Krettek  写道:
>> 
>> +1
>> 
>> And I want to second Arvid's mention of testcontainers [1].
>> 
>> [1] https://www.testcontainers.org/
>> 
>> On 18.11.20 10:43, Yang Wang wrote:
>>> Thanks till and Jark for sharing the information.
>>> I am also +1 for this proposal and glad to wire the new introduced
>>> K8s
> HA
>>> e2e tests to java based framework.
>>> Best,
>>> Yang
>>> Jark Wu  于2020年11月18日周三 下午5:23写道:
 +1 to use the Java-based testing framework and +1 for using
>> docker
> images
 in the future.
 IIUC, the Java-based testing framework refers to the
 `flink-end-to-end-tests-common` module.
 The java-based framework helped us a lot when debugging the
>>> unstable
> e2e
 tests.
 
 Best,
 Jark
 
 On Wed, 18 Nov 2020 at 14:42, Yang Wang 
 wrote:
 
> Thanks for starting this discussion.
> 
> In general, I agree with you that a java-based testing framework
>>> is
 better
> than the bash-based. It will
> help a lot for the commons and utilities.
> 
> Since I am trying to add a new bash-based Kubernetes HA test, I
>>> have
> some
> quick questions.
> * I am not sure where the java-based framework is. Do you mean
> "flink-jepsen" module or sth else?
> * Maybe it will be harder to run a cli command(e.g. flink run /
> run-application) to submit a Flink job in the java-based
>>> framework.
> * It will be harder to inject some operations. For example, kill
>>> the
> JobManager in Kubernetes. Currently, I
> am trying to use "kubectl exec" to do this.
> 
> 
> Best,
> Yang
> 
> Robert Metzger  于2020年11月17日周二 下午11:36写道:
> 
>> Hi all,
>> 
>> Since we are currently testing the 1.12 release, and
>> potentially
> adding
>> more automated e2e tests, I would like to bring up our end to
>> end
> tests
> for
>> discussion.
>> 
>> Some time ago, we introduced a Java-based testing framework,
>> with
 the
> idea
>> of replacing the current bash-based end to end tests.
>> Since the introduction of the java-based framework, more bash
>>> tests
 were
>> actually added, making a future migration even harder.
>> 
>> *For that reason, I would like to propose that we are 

[jira] [Created] (FLINK-20409) Migrate test_kubernetes_pyflink_application.sh

2020-11-27 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-20409:


 Summary: Migrate test_kubernetes_pyflink_application.sh
 Key: FLINK-20409
 URL: https://issues.apache.org/jira/browse/FLINK-20409
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Tests
Reporter: Huang Xingbo






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20408) Migrate test_pyflink_yarn.sh

2020-11-27 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-20408:


 Summary: Migrate test_pyflink_yarn.sh
 Key: FLINK-20408
 URL: https://issues.apache.org/jira/browse/FLINK-20408
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Tests
Reporter: Huang Xingbo






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20407) Migrate test_pyflink.sh

2020-11-27 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-20407:


 Summary: Migrate test_pyflink.sh
 Key: FLINK-20407
 URL: https://issues.apache.org/jira/browse/FLINK-20407
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Tests
Reporter: Huang Xingbo






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20406) Return the Checkpoint ID of the restored Checkpoint in CheckpointCoordinator.restoreLatestCheckpointedStateToSubtasks()

2020-11-27 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-20406:


 Summary: Return the Checkpoint ID of the restored Checkpoint in 
CheckpointCoordinator.restoreLatestCheckpointedStateToSubtasks()
 Key: FLINK-20406
 URL: https://issues.apache.org/jira/browse/FLINK-20406
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.12.0


To allow the scheduler to notify Operator Coordinators of subtask restores 
(local failover), we need to know which checkpoint ID was restored. 

This change does not adjust the other restore methods of the Checkpoint 
Coordinator, because the fact that the Scheduler needs to be involved in the 
subtask restore notification at all is only due to a shortcoming of the 
Checkpoint Coordinator: The CC is not aware of subtask restores, it always 
restores all subtasks and relies on the fact that assigning state to a running 
execution attempt has no effect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20405) The LAG function in over window is not implements correctly

2020-11-27 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-20405:
--

 Summary: The LAG function in over window is not implements 
correctly
 Key: FLINK-20405
 URL: https://issues.apache.org/jira/browse/FLINK-20405
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.12.0
Reporter: Leonard Xu


For LAG(input, offset, default) function in over window, it always return 
current row's input no matter how the offset is set.

After see the codegen code of the function, I think the implementation is not 
correct and need to correct.
{code:java}

// the offset and default value is never used
public UnboundedOverAggregateHelper$24(java.lang.Object[] references) throws 
Exception {constant$14 = ((int) 1); 
   constant$14isNull = false;
constant$15 = ((org.apache.flink.table.data.binary.BinaryStringData) str$13);   
 constant$15isNull = false;typeSerializer$19 = 
(((org.apache.flink.table.runtime.typeutils.StringDataSerializer) 
references[0]));  }

public void accumulate(org.apache.flink.table.data.RowData accInput) throws 
Exception {
org.apache.flink.table.data.binary.BinaryStringData field$21;
boolean isNull$21;
org.apache.flink.table.data.binary.BinaryStringData field$22;
isNull$21 = accInput.isNullAt(2);field$21 = 
org.apache.flink.table.data.binary.BinaryStringData.EMPTY_UTF8;if 
(!isNull$21) {  field$21 = 
((org.apache.flink.table.data.binary.BinaryStringData) accInput.getString(2));  
  }field$22 = field$21;if (!isNull$21) {
  field$22 = (org.apache.flink.table.data.binary.BinaryStringData) 
(typeSerializer$19.copy(field$22));}
if (agg0_leadlag != field$22) {  agg0_leadlag = 
((org.apache.flink.table.data.binary.BinaryStringData) 
typeSerializer$19.copy(field$22));}   ;
agg0_leadlagIsNull = isNull$21; }
{code}
 

The question comes from user mail list[1]

[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkSQL-kafka-gt-dedup-gt-kafka-td39335.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20404) ZooKeeper quorum fails to start due to missing log4j library

2020-11-27 Thread Pedro Miguel Rainho Chaves (Jira)
Pedro Miguel Rainho Chaves created FLINK-20404:
--

 Summary: ZooKeeper quorum fails to start due to missing log4j 
library
 Key: FLINK-20404
 URL: https://issues.apache.org/jira/browse/FLINK-20404
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.11.2
Reporter: Pedro Miguel Rainho Chaves


Upon starting a zookeeper quorum using flink's bootstrapped zookeeper, it 
throws the following exception.

 
{code:java}
2020-11-27 13:13:38,371 ERROR 
org.apache.flink.runtime.zookeeper.FlinkZooKeeperQuorumPeer  [] - Error running 
ZooKeeper quorum peer: org/apache/log4j/jmx/HierarchyDynamicMBean
java.lang.NoClassDefFoundError: org/apache/log4j/jmx/HierarchyDynamicMBean
at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.jmx.ManagedUtil.registerLog4jMBeans(ManagedUtil.java:51)
 ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:125)
 ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
at 
org.apache.flink.runtime.zookeeper.FlinkZooKeeperQuorumPeer.runFlinkZkQuorumPeer(FlinkZooKeeperQuorumPeer.java:123)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.zookeeper.FlinkZooKeeperQuorumPeer.main(FlinkZooKeeperQuorumPeer.java:79)
 [flink-dist_2.11-1.11.2.jar:1.11.2]
Caused by: java.lang.ClassNotFoundException: 
org.apache.log4j.jmx.HierarchyDynamicMBean
at java.net.URLClassLoader.findClass(URLClassLoader.java:382) 
~[?:1.8.0_262]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_262]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 
~[?:1.8.0_262]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_262]
... 4 more
{code}
This happens because the new flink version is missing a log4j library. This can 
be solved by adding log4j-1.2.17.jar to the classpath, nonetheless the 
bootstrapped zookeepeer version should be compatible with the log4j2 libraries 
that come with flink's default installation.

 

*Steps to reproduce:*
 # Fresh install of flink version 1.11.2 
 # Change the zookeeper config to start as a quorum
{code:java}
server.1=:2888:3888
server.2=:2888:3888{code}

 # Start zookeeper
 # /bin/zookeeper.sh start-foreground 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink

2020-11-27 Thread Jark Wu
Thanks Matthias,

I have created sub-tasks for the Table SQL related bash-based e2e tests.

Best,
Jark

On Fri, 27 Nov 2020 at 21:25, Matthias Pohl  wrote:

> Thanks Robert for pushing this. +1 for creating java-based e2e tests.
>
> In the engine team, we decided to work towards the goal of migrating the
> bash-based e2e tests to Java/Docker. We plan to migrate the existing
> bash-based e2e tests located in the Engine team's component space
> step-by-step. I created an umbrella Jira issue [1] to collect and document
> the migration efforts. Feel free to do the same by creating subtasks under
> [1].
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-20392
>
> On Thu, Nov 19, 2020 at 8:02 AM Yun Tang  wrote:
>
> > +1 for java-based E2E test as bash scripts lack of the power to handle
> > more complicated cases.
> >
> > For the docker image improvement, I think we should be more cautious as
> > developers at China might suffer with the network issue, or at least we
> > should ensure or find some guides to speed up the image downloading.
> >
> > Best
> > Yun Tang
> > 
> > From: Xingbo Huang 
> > Sent: Thursday, November 19, 2020 12:09
> > To: dev 
> > Subject: Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink
> >
> > Big +1 for java-based e2e tests. Currently PyFlink related tests each
> take
> > ~15minutes in bash e2e tests because we are using a secured YARN cluster
> > which is the only convenient way of starting a YARN cluster in the bash
> e2e
> > tests. I think if we migrate these tests to the java-based testing
> > framework, we will start a Yarn Cluster more conveniently, which will
> > greatly reduce our testing time.
> >
> > Best,
> > Xingbo
> >
> > Rui Li  于2020年11月19日周四 上午10:47写道:
> >
> > > Big +1 to java-based e2e tests. It'll be much easier to write/debug
> these
> > > tests.
> > >
> > > On Wed, Nov 18, 2020 at 9:44 PM Leonard Xu  wrote:
> > >
> > > > +1 to stop using bash scripts,
> > > > and I also have experienced the bash scripts that is really hard to
> > > > maintain and debug, thanks @Robert for the great work again.
> > > >
> > > > I think testcontainers is a nice candidate.
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > > > 在 2020年11月18日,19:46,Aljoscha Krettek  写道:
> > > > >
> > > > > +1
> > > > >
> > > > > And I want to second Arvid's mention of testcontainers [1].
> > > > >
> > > > > [1] https://www.testcontainers.org/
> > > > >
> > > > > On 18.11.20 10:43, Yang Wang wrote:
> > > > >> Thanks till and Jark for sharing the information.
> > > > >> I am also +1 for this proposal and glad to wire the new introduced
> > K8s
> > > > HA
> > > > >> e2e tests to java based framework.
> > > > >> Best,
> > > > >> Yang
> > > > >> Jark Wu  于2020年11月18日周三 下午5:23写道:
> > > > >>> +1 to use the Java-based testing framework and +1 for using
> docker
> > > > images
> > > > >>> in the future.
> > > > >>> IIUC, the Java-based testing framework refers to the
> > > > >>> `flink-end-to-end-tests-common` module.
> > > > >>> The java-based framework helped us a lot when debugging the
> > unstable
> > > > e2e
> > > > >>> tests.
> > > > >>>
> > > > >>> Best,
> > > > >>> Jark
> > > > >>>
> > > > >>> On Wed, 18 Nov 2020 at 14:42, Yang Wang 
> > > wrote:
> > > > >>>
> > > >  Thanks for starting this discussion.
> > > > 
> > > >  In general, I agree with you that a java-based testing framework
> > is
> > > > >>> better
> > > >  than the bash-based. It will
> > > >  help a lot for the commons and utilities.
> > > > 
> > > >  Since I am trying to add a new bash-based Kubernetes HA test, I
> > have
> > > > some
> > > >  quick questions.
> > > >  * I am not sure where the java-based framework is. Do you mean
> > > >  "flink-jepsen" module or sth else?
> > > >  * Maybe it will be harder to run a cli command(e.g. flink run /
> > > >  run-application) to submit a Flink job in the java-based
> > framework.
> > > >  * It will be harder to inject some operations. For example, kill
> > the
> > > >  JobManager in Kubernetes. Currently, I
> > > >  am trying to use "kubectl exec" to do this.
> > > > 
> > > > 
> > > >  Best,
> > > >  Yang
> > > > 
> > > >  Robert Metzger  于2020年11月17日周二 下午11:36写道:
> > > > 
> > > > > Hi all,
> > > > >
> > > > > Since we are currently testing the 1.12 release, and
> potentially
> > > > adding
> > > > > more automated e2e tests, I would like to bring up our end to
> end
> > > > tests
> > > >  for
> > > > > discussion.
> > > > >
> > > > > Some time ago, we introduced a Java-based testing framework,
> with
> > > the
> > > >  idea
> > > > > of replacing the current bash-based end to end tests.
> > > > > Since the introduction of the java-based framework, more bash
> > tests
> > > > >>> were
> > > > > actually added, making a future migration even harder.
> > > > >
> > > > > *For that reason, I 

[jira] [Created] (FLINK-20403) Migrate test_table_shaded_dependencies.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20403:
---

 Summary: Migrate test_table_shaded_dependencies.sh
 Key: FLINK-20403
 URL: https://issues.apache.org/jira/browse/FLINK-20403
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Ecosystem, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20401) Migrate test_tpcds.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20401:
---

 Summary: Migrate test_tpcds.sh
 Key: FLINK-20401
 URL: https://issues.apache.org/jira/browse/FLINK-20401
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Ecosystem, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20402) Migrate test_tpch.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20402:
---

 Summary: Migrate test_tpch.sh
 Key: FLINK-20402
 URL: https://issues.apache.org/jira/browse/FLINK-20402
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Ecosystem, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20400) Migrate test_streaming_sql.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20400:
---

 Summary: Migrate test_streaming_sql.sh
 Key: FLINK-20400
 URL: https://issues.apache.org/jira/browse/FLINK-20400
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20399) Migrate test_sql_client.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20399:
---

 Summary: Migrate test_sql_client.sh
 Key: FLINK-20399
 URL: https://issues.apache.org/jira/browse/FLINK-20399
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API, Table SQL / Client, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20398) Migrate test_batch_sql.sh

2020-11-27 Thread Jark Wu (Jira)
Jark Wu created FLINK-20398:
---

 Summary: Migrate test_batch_sql.sh
 Key: FLINK-20398
 URL: https://issues.apache.org/jira/browse/FLINK-20398
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API, Tests
Reporter: Jark Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Moving to JUnit5

2020-11-27 Thread Arvid Heise
Dear devs,

I'd like to start a discussion to migrate to a higher JUnit version.

The main motivations are:
- Making full use of Java 8 Lambdas for writing easier to read tests and a
better performing way of composing failure messages.
- Improved test structures with nested and dynamic tests.
- Much better support for parameterized tests to avoid separating
parameterized and non-parameterized parts into different test classes.
- Composable dependencies and better hooks for advanced use cases
(TestLogger).
- Better exception verification
- More current infrastructure
- Better parallelizable

Why now?
- JUnit5 is now mature enough to consider it for such a complex project
- We are porting more and more e2e tests to JUnit and it would be a pity to
do all the work twice (okay some already has been done and would result in
adjustments, but the sooner we migrate, the less needs to be touched twice)

Why JUnit5?
There are other interesting alternatives, such as TestNG. I'm happy to hear
specific alternatives. For now, I'd like to focus on JUnit4 for an easier
migration path.

Please discuss if you would also be interested in moving onward. To get
some overview, I'd like to see some informal +1 for the options:

[ ] Stick to JUnit4 for the time being
[ ] Move to JUnit5 (see migration path below)
[ ] Alternative idea + advantages over JUnit5 + some very rough migration
path

---

Migrating from JUnit4 to JUnit5 can be done in some steps, so that we can
gradually move from JUnit4 to JUnit5.

0. (There is a way to use JUnit4 + 5 at the same time in a project - you'd
use a specific JUnit4 runner to execute JUnit5. I'd like to skip this step
as it would slow down migration significantly)
1. Use JUnit5 with vintage runner. JUnit4 tests run mostly out of the box.
The most important difference is that only 3 base rules are supported and
the remainder needs to be migrated. Luckily, most of our rules derive from
the supported ExternalResource. So in this step, we would need to migrate
the rules.
2. Implement new tests in JUnit5.
3. Soft-migrate old tests in JUnit5. This is mostly a renaming of
annotation (@Before -> @BeforeEach, etc.). Adjust parameterized tests
(~400), replace rule usages (~670) with extensions, exception handling
(~1600 tests), and timeouts (~200). This can be done on a test class by
test class base and there is no hurry.
4. Remove vintage runner, once most tests are migrated by doing a final
push for lesser used modules.

Let me know what you think and I'm happy to answer all questions.

-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng


[jira] [Created] (FLINK-20397) Pass checkpointId to OperatorCoordinator.resetToCheckpoint().

2020-11-27 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-20397:


 Summary: Pass checkpointId to 
OperatorCoordinator.resetToCheckpoint().
 Key: FLINK-20397
 URL: https://issues.apache.org/jira/browse/FLINK-20397
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.11.2
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.12.0, 1.11.3


The OperatorCoordinator.resetToCheckpoint() currently lacks the information 
which checkpoint it recovers to.

That forces implementers to assume strict ordering of method calls between 
restore and failure. While that is currently guaranteed in this case, it is not 
guaranteed in other places (see parent issue).

Because of that, we want implementations to not assume method order at all, but 
rely on explicit information passed to the methods (checkpoint IDs). Otherwise 
we end up with mixed implementations that partially infer context from the 
order of method calls, and partially use explicit information that was passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20396) Replace "OperatorCoordinator.subtaskFailed()" with "subtaskRestored()"

2020-11-27 Thread Stephan Ewen (Jira)
Stephan Ewen created FLINK-20396:


 Summary: Replace "OperatorCoordinator.subtaskFailed()" with 
"subtaskRestored()"
 Key: FLINK-20396
 URL: https://issues.apache.org/jira/browse/FLINK-20396
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.11.2
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.12.0, 1.11.3


There are no strong order guarantees between 
{{OperatorCoordinator.subtaskFailed()}} and 
{{OperatorCoordinator.notifyCheckpointComplete()}}.

It can happen that a checkpoint completes after the notification for task 
failure is sent:
  - {{OperatorCoordinator.checkpoint()}}
  - {{OperatorCoordinator.subtaskFailed()}}
  - {{OperatorCoordinator.checkpointComplete()}}

The subtask failure here does not know whether the previous checkpoint 
completed or not. It cannot decide what state the subtask will be in after 
recovery.
There is no easy fix right now to strictly guarantee the order of the method 
calls, so alternatively we need to provide the necessary information to reason 
about the status of tasks.

We should replace {{OperatorCoordinator.subtaskFailed(int subtask)}} with 
{{OperatorCoordinator.subtaskRestored(int subtask, long checkpoint)}}. That 
implementations get the explicit checkpoint ID for the subtask recovery, and 
can align that with the IDs of checkpoints that were taken.

It is still (in rare cases) possible that for a specific checkpoint C, 
{{OperatorCoordinator.subtaskRestored(subtaskIndex, C)) comes before 
{{OperatorCoordinator.checkpointComplete(C)}}.


h3. Background

The Checkpointing Procedure is partially asynchronous on the {{JobManager}} / 
{{CheckpointCoordinator}}: After all subtasks acknowledged the checkpoint, the 
finalization (writing out metadata and registering the checkpoint in ZooKeeper) 
happens in an I/O thread, and the checkpoint completes after that.

This sequence of events can happen:
  - tasks acks checkpoint
  - checkpoint fully acknowledged, finalization starts
  - task fails
  - task failure notification is dispatched
  - checkpoint completes.

For task failures and checkpoint completion, no order is defined.

However, for task restore and checkpoint completion, the order is well defined: 
When a task is restored, pending checkpoints are either canceled or complete. 
None can be within finalization. That is currently guaranteed with a lock in 
the {{CheckpointCoordinator}}.
(An implication of that being that restores can be blocking operations in the 
scheduler, which is not ideal from the perspective of making the scheduler 
async/non-blocking, but it is currently essential for correctness).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink

2020-11-27 Thread Matthias Pohl
Thanks Robert for pushing this. +1 for creating java-based e2e tests.

In the engine team, we decided to work towards the goal of migrating the
bash-based e2e tests to Java/Docker. We plan to migrate the existing
bash-based e2e tests located in the Engine team's component space
step-by-step. I created an umbrella Jira issue [1] to collect and document
the migration efforts. Feel free to do the same by creating subtasks under
[1].

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-20392

On Thu, Nov 19, 2020 at 8:02 AM Yun Tang  wrote:

> +1 for java-based E2E test as bash scripts lack of the power to handle
> more complicated cases.
>
> For the docker image improvement, I think we should be more cautious as
> developers at China might suffer with the network issue, or at least we
> should ensure or find some guides to speed up the image downloading.
>
> Best
> Yun Tang
> 
> From: Xingbo Huang 
> Sent: Thursday, November 19, 2020 12:09
> To: dev 
> Subject: Re: [DISCUSS] Stop adding new bash-based e2e tests to Flink
>
> Big +1 for java-based e2e tests. Currently PyFlink related tests each take
> ~15minutes in bash e2e tests because we are using a secured YARN cluster
> which is the only convenient way of starting a YARN cluster in the bash e2e
> tests. I think if we migrate these tests to the java-based testing
> framework, we will start a Yarn Cluster more conveniently, which will
> greatly reduce our testing time.
>
> Best,
> Xingbo
>
> Rui Li  于2020年11月19日周四 上午10:47写道:
>
> > Big +1 to java-based e2e tests. It'll be much easier to write/debug these
> > tests.
> >
> > On Wed, Nov 18, 2020 at 9:44 PM Leonard Xu  wrote:
> >
> > > +1 to stop using bash scripts,
> > > and I also have experienced the bash scripts that is really hard to
> > > maintain and debug, thanks @Robert for the great work again.
> > >
> > > I think testcontainers is a nice candidate.
> > >
> > > Best,
> > > Leonard
> > >
> > > > 在 2020年11月18日,19:46,Aljoscha Krettek  写道:
> > > >
> > > > +1
> > > >
> > > > And I want to second Arvid's mention of testcontainers [1].
> > > >
> > > > [1] https://www.testcontainers.org/
> > > >
> > > > On 18.11.20 10:43, Yang Wang wrote:
> > > >> Thanks till and Jark for sharing the information.
> > > >> I am also +1 for this proposal and glad to wire the new introduced
> K8s
> > > HA
> > > >> e2e tests to java based framework.
> > > >> Best,
> > > >> Yang
> > > >> Jark Wu  于2020年11月18日周三 下午5:23写道:
> > > >>> +1 to use the Java-based testing framework and +1 for using docker
> > > images
> > > >>> in the future.
> > > >>> IIUC, the Java-based testing framework refers to the
> > > >>> `flink-end-to-end-tests-common` module.
> > > >>> The java-based framework helped us a lot when debugging the
> unstable
> > > e2e
> > > >>> tests.
> > > >>>
> > > >>> Best,
> > > >>> Jark
> > > >>>
> > > >>> On Wed, 18 Nov 2020 at 14:42, Yang Wang 
> > wrote:
> > > >>>
> > >  Thanks for starting this discussion.
> > > 
> > >  In general, I agree with you that a java-based testing framework
> is
> > > >>> better
> > >  than the bash-based. It will
> > >  help a lot for the commons and utilities.
> > > 
> > >  Since I am trying to add a new bash-based Kubernetes HA test, I
> have
> > > some
> > >  quick questions.
> > >  * I am not sure where the java-based framework is. Do you mean
> > >  "flink-jepsen" module or sth else?
> > >  * Maybe it will be harder to run a cli command(e.g. flink run /
> > >  run-application) to submit a Flink job in the java-based
> framework.
> > >  * It will be harder to inject some operations. For example, kill
> the
> > >  JobManager in Kubernetes. Currently, I
> > >  am trying to use "kubectl exec" to do this.
> > > 
> > > 
> > >  Best,
> > >  Yang
> > > 
> > >  Robert Metzger  于2020年11月17日周二 下午11:36写道:
> > > 
> > > > Hi all,
> > > >
> > > > Since we are currently testing the 1.12 release, and potentially
> > > adding
> > > > more automated e2e tests, I would like to bring up our end to end
> > > tests
> > >  for
> > > > discussion.
> > > >
> > > > Some time ago, we introduced a Java-based testing framework, with
> > the
> > >  idea
> > > > of replacing the current bash-based end to end tests.
> > > > Since the introduction of the java-based framework, more bash
> tests
> > > >>> were
> > > > actually added, making a future migration even harder.
> > > >
> > > > *For that reason, I would like to propose that we are stopping to
> > add
> > > >>> any
> > > > new bash end to end tests to Flink. All new end to end tests must
> > be
> > > > written in Java and rely on the existing testing framework.*
> > > >
> > > > For the 1.13 release, I'm trying to find some time to revisit
> > > potential
> > > > improvements for the existing java e2e framework (such as using
> > > Docker
> > > > images 

[jira] [Created] (FLINK-20395) Migrate test_netty_shuffle_memory_control.sh

2020-11-27 Thread Matthias (Jira)
Matthias created FLINK-20395:


 Summary: Migrate test_netty_shuffle_memory_control.sh
 Key: FLINK-20395
 URL: https://issues.apache.org/jira/browse/FLINK-20395
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20394) Migrate test_mesos_multiple_submissions.sh

2020-11-27 Thread Matthias (Jira)
Matthias created FLINK-20394:


 Summary: Migrate test_mesos_multiple_submissions.sh
 Key: FLINK-20394
 URL: https://issues.apache.org/jira/browse/FLINK-20394
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20393) Migrate test_mesos_wordcount.sh

2020-11-27 Thread Matthias (Jira)
Matthias created FLINK-20393:


 Summary: Migrate test_mesos_wordcount.sh
 Key: FLINK-20393
 URL: https://issues.apache.org/jira/browse/FLINK-20393
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20392) Migrating bash e2e tests to Java/Docker

2020-11-27 Thread Matthias (Jira)
Matthias created FLINK-20392:


 Summary: Migrating bash e2e tests to Java/Docker
 Key: FLINK-20392
 URL: https://issues.apache.org/jira/browse/FLINK-20392
 Project: Flink
  Issue Type: Test
  Components: Test Infrastructure
Reporter: Matthias


This Jira issue serves as an umbrella ticket for single e2e test migration 
tasks. This should enable us to migrate all bash-based e2e tests step-by-step.

The goal is to utilize the e2e test framework (see 
[flink-end-to-end-tests-common|https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common]).
 Ideally, the test should use Docker containers as much as possible disconnect 
the execution from the environment. A good source to achieve that is 
[testcontainers.org|https://www.testcontainers.org/].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20391) Set FORWARD_EDGES_PIPELINED for BATCH ExecutionMode

2020-11-27 Thread Dawid Wysakowicz (Jira)
Dawid Wysakowicz created FLINK-20391:


 Summary: Set FORWARD_EDGES_PIPELINED for BATCH ExecutionMode
 Key: FLINK-20391
 URL: https://issues.apache.org/jira/browse/FLINK-20391
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Affects Versions: 1.12.0
Reporter: Dawid Wysakowicz
Assignee: Dawid Wysakowicz
 Fix For: 1.12.0


It would be better to treat the {{rescale}} operation similar to {{keyBy}} or 
{{rebalance}} and make it a possible pipeline border.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20390) Programmatic access to the back-pressure

2020-11-27 Thread Jira
Gaël Renoux created FLINK-20390:
---

 Summary: Programmatic access to the back-pressure
 Key: FLINK-20390
 URL: https://issues.apache.org/jira/browse/FLINK-20390
 Project: Flink
  Issue Type: New Feature
  Components: API / Core
Reporter: Gaël Renoux


It would be useful to access the back-pressure monitoring from within functions.

Here is our use case: we have a real-time Flink job, which takes decisions 
based on input data. Sometimes, we have traffic spikes on the input and the 
decisions process cannot processe records fast enough. Back-pressure starts 
mounting, all the way back to the Source. What we want to do is to start 
dropping records in this case, because it's better to make decisions based on 
just a sample of the data rather than accumulate too much lag.

Right now, the only way is to have a filter with a hard-limit on the number of 
records per-interval-of-time, and to drop records once we are over this limit. 
However, this requires a lot of tuning to find out what the correct limit is, 
especially since it may depend on the nature of the inputs (some decisions take 
longer to make than others). It's also heavily dependent on the buffers: the 
limit needs to be low enough that all records that pass the limit can fit in 
the downstream buffers, or the back-pressure will will go back past the 
filtering task and we're back to square one. Finally, it's not very resilient 
to change: whenever we scale the infrastructure up, we need to redo the whole 
tuning thing.

With programmatic access to the back-pressure, we could simply start dropping 
records based on its current level. No tuning, and adjusted to the actual 
issue. For performance, I assume it would be better if it reused the existing 
back-pressure monitoring mechanism, rather than looking directly into the 
buffer. A sampling of the back-pressure should be enough, and if more precision 
is needed you can simply change the existing back-pressure configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)