[jira] [Commented] (FLINK-8915) CheckpointingStatisticsHandler fails to return PendingCheckpointStats

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403241#comment-16403241
 ] 

ASF GitHub Bot commented on FLINK-8915:
---

Github user yanghua commented on the issue:

https://github.com/apache/flink/pull/5703
  
cc @GJL : thanks for your suggestion, please review.


> CheckpointingStatisticsHandler fails to return PendingCheckpointStats 
> --
>
> Key: FLINK-8915
> URL: https://issues.apache.org/jira/browse/FLINK-8915
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip6
> Fix For: 1.5.0
>
>
> {noformat}
> 2018-03-10 21:47:52,487 ERROR 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler
>   - Implementation error: Unhandled exception.
> java.lang.IllegalArgumentException: Given checkpoint stats object of type 
> org.apache.flink.runtime.checkpoint.PendingCheckpointStats cannot be 
> converted.
>   at 
> org.apache.flink.runtime.rest.messages.checkpoints.CheckpointStatistics.generateCheckpointStatistics(CheckpointStatistics.java:276)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:146)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:54)
>   at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:81)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5703: [FLINK-8915] CheckpointingStatisticsHandler fails to retu...

2018-03-16 Thread yanghua
Github user yanghua commented on the issue:

https://github.com/apache/flink/pull/5703
  
cc @GJL : thanks for your suggestion, please review.


---


[jira] [Commented] (FLINK-9011) YarnResourceManager spamming log file at INFO level

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403233#comment-16403233
 ] 

ASF GitHub Bot commented on FLINK-9011:
---

Github user yanghua commented on the issue:

https://github.com/apache/flink/pull/5712
  
Hi @GJL please look at this, if you have time, thanks!


> YarnResourceManager spamming log file at INFO level
> ---
>
> Key: FLINK-9011
> URL: https://issues.apache.org/jira/browse/FLINK-9011
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> For every requested resource, the {{YarnResourceManager}} spams the log with 
> log-level INFO and the following messages:
> {code}
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5712: [FLINK-9011] YarnResourceManager spamming log file at INF...

2018-03-16 Thread yanghua
Github user yanghua commented on the issue:

https://github.com/apache/flink/pull/5712
  
Hi @GJL please look at this, if you have time, thanks!


---


[jira] [Commented] (FLINK-9011) YarnResourceManager spamming log file at INFO level

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403232#comment-16403232
 ] 

ASF GitHub Bot commented on FLINK-9011:
---

GitHub user yanghua opened a pull request:

https://github.com/apache/flink/pull/5712

[FLINK-9011] YarnResourceManager spamming log file at INFO level

## What is the purpose of the change

*This pull request changed some log's level*


## Brief change log

  - *changed some log's level in `YarnResourceManager` and `Utils`*

## Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (not applicable / docs / 
JavaDocs / **not documented**)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanghua/flink FLINK-9011

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5712.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5712


commit c2c45ba436842505ac602c3f684f058af18d00c7
Author: yanghua 
Date:   2018-03-17T02:54:41Z

[FLINK-9011] YarnResourceManager spamming log file at INFO level




> YarnResourceManager spamming log file at INFO level
> ---
>
> Key: FLINK-9011
> URL: https://issues.apache.org/jira/browse/FLINK-9011
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> For every requested resource, the {{YarnResourceManager}} spams the log with 
> log-level INFO and the following messages:
> {code}
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with 

[GitHub] flink pull request #5712: [FLINK-9011] YarnResourceManager spamming log file...

2018-03-16 Thread yanghua
GitHub user yanghua opened a pull request:

https://github.com/apache/flink/pull/5712

[FLINK-9011] YarnResourceManager spamming log file at INFO level

## What is the purpose of the change

*This pull request changed some log's level*


## Brief change log

  - *changed some log's level in `YarnResourceManager` and `Utils`*

## Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (not applicable / docs / 
JavaDocs / **not documented**)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanghua/flink FLINK-9011

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5712.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5712


commit c2c45ba436842505ac602c3f684f058af18d00c7
Author: yanghua 
Date:   2018-03-17T02:54:41Z

[FLINK-9011] YarnResourceManager spamming log file at INFO level




---


[jira] [Commented] (FLINK-8970) Add more automated end-to-end tests

2018-03-16 Thread mingleizhang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403210#comment-16403210
 ] 

mingleizhang commented on FLINK-8970:
-

Very nice to me.

> Add more automated end-to-end tests
> ---
>
> Key: FLINK-8970
> URL: https://issues.apache.org/jira/browse/FLINK-8970
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Till Rohrmann
>Priority: Critical
>
> In order to improve Flink's test coverage and make releasing easier, we 
> should add more automated end-to-end tests which test Flink more like a user 
> would interact with the system. Additionally, these end-to-end tests should 
> test the integration of various other systems with Flink.
> With FLINK-6539, we added a new module \{{flink-end-to-end-tests}} which 
> contains the set of currently available end-to-end tests.
> With FLINK-8911, a script was added to trigger these tests.
>  
> This issue is an umbrella issue collecting all different end-to-end tests 
> which we want to add to the Flink repository.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-9011) YarnResourceManager spamming log file at INFO level

2018-03-16 Thread vinoyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned FLINK-9011:
---

Assignee: vinoyang

> YarnResourceManager spamming log file at INFO level
> ---
>
> Key: FLINK-9011
> URL: https://issues.apache.org/jira/browse/FLINK-9011
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> For every requested resource, the {{YarnResourceManager}} spams the log with 
> log-level INFO and the following messages:
> {code}
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-6105) Properly handle InterruptedException in HadoopInputFormatBase

2018-03-16 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307281#comment-16307281
 ] 

Ted Yu edited comment on FLINK-6105 at 3/17/18 1:54 AM:


In 
flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/RollingSink.java
 :

{code}
  try {
Thread.sleep(500);
  } catch (InterruptedException e1) {
// ignore it
  }
{code}
Interrupt status should be restored, or throw InterruptedIOException .


was (Author: yuzhih...@gmail.com):
In 
flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/RollingSink.java
 :
{code}
  try {
Thread.sleep(500);
  } catch (InterruptedException e1) {
// ignore it
  }
{code}
Interrupt status should be restored, or throw InterruptedIOException .

> Properly handle InterruptedException in HadoopInputFormatBase
> -
>
> Key: FLINK-6105
> URL: https://issues.apache.org/jira/browse/FLINK-6105
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API
>Reporter: Ted Yu
>Assignee: mingleizhang
>Priority: Major
>
> When catching InterruptedException, we should throw InterruptedIOException 
> instead of IOException.
> The following example is from HadoopInputFormatBase :
> {code}
> try {
>   splits = this.mapreduceInputFormat.getSplits(jobContext);
> } catch (InterruptedException e) {
>   throw new IOException("Could not get Splits.", e);
> }
> {code}
> There may be other places where IOE is thrown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8843) Decouple bind REST address from advertised address

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403198#comment-16403198
 ] 

ASF GitHub Bot commented on FLINK-8843:
---

Github user yanghua closed the pull request at:

https://github.com/apache/flink/pull/5632


> Decouple bind REST address from advertised address
> --
>
> Key: FLINK-8843
> URL: https://issues.apache.org/jira/browse/FLINK-8843
> Project: Flink
>  Issue Type: Improvement
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> The {{RestServerEndpoint}} is currently bound to the same address which is 
> also advertised to the client, namely {{RestOptions#REST_ADDRESS}}. It would 
> be better to start the {{RestServerEndpoint}} listening on all address by 
> binding to {{0.0.0.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5632: [FLINK-8843][REST] Decouple bind REST address from...

2018-03-16 Thread yanghua
Github user yanghua closed the pull request at:

https://github.com/apache/flink/pull/5632


---


[jira] [Updated] (FLINK-7917) The return of taskInformationOrBlobKey should be placed inside synchronized in ExecutionJobVertex

2018-03-16 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated FLINK-7917:
--
Description: 
Currently in ExecutionJobVertex#getTaskInformationOrBlobKey:
{code}
}

return taskInformationOrBlobKey;
{code}
The return should be placed inside synchronized block.

  was:
Currently in ExecutionJobVertex#getTaskInformationOrBlobKey:

{code}
}

return taskInformationOrBlobKey;
{code}
The return should be placed inside synchronized block.


> The return of taskInformationOrBlobKey should be placed inside synchronized 
> in ExecutionJobVertex
> -
>
> Key: FLINK-7917
> URL: https://issues.apache.org/jira/browse/FLINK-7917
> Project: Flink
>  Issue Type: Bug
>  Components: Local Runtime
>Reporter: Ted Yu
>Priority: Minor
>
> Currently in ExecutionJobVertex#getTaskInformationOrBlobKey:
> {code}
> }
> return taskInformationOrBlobKey;
> {code}
> The return should be placed inside synchronized block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5700: [FLINK-8833] [sql-client] Create a SQL Client JSON format...

2018-03-16 Thread fhueske
Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5700
  
Since we are not shading anything we could also use the Maven JAR plugin:

```

org.apache.maven.plugins
maven-jar-plugin
3.0.2


package

jar


sql-jar




```


---


[jira] [Commented] (FLINK-8833) Create a SQL Client JSON format fat-jar

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403171#comment-16403171
 ] 

ASF GitHub Bot commented on FLINK-8833:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5700
  
Since we are not shading anything we could also use the Maven JAR plugin:

```

org.apache.maven.plugins
maven-jar-plugin
3.0.2


package

jar


sql-jar




```


> Create a SQL Client JSON format fat-jar
> ---
>
> Key: FLINK-8833
> URL: https://issues.apache.org/jira/browse/FLINK-8833
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API  SQL
>Reporter: Timo Walther
>Assignee: Timo Walther
>Priority: Major
>
> Create a fat-jar for flink-json.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8852) SQL Client does not work with new FLIP-6 mode

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403168#comment-16403168
 ] 

ASF GitHub Bot commented on FLINK-8852:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5704
  
Since we are not shading anything we could also use the Maven JAR plugin:

```

org.apache.maven.plugins
maven-jar-plugin
3.0.2


package

jar


sql-jar




```


> SQL Client does not work with new FLIP-6 mode
> -
>
> Key: FLINK-8852
> URL: https://issues.apache.org/jira/browse/FLINK-8852
> Project: Flink
>  Issue Type: Sub-task
>  Components: Table API  SQL
>Affects Versions: 1.5.0
>Reporter: Fabian Hueske
>Assignee: Timo Walther
>Priority: Blocker
> Fix For: 1.5.0
>
>
> The SQL client does not submit queries to local Flink cluster that runs in 
> FLIP-6 mode. It doesn't throw an exception either.
> Job submission works if the legacy Flink cluster mode is used (`mode: old`)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5704: [FLINK-8852] [sql-client] Add FLIP-6 support to SQL Clien...

2018-03-16 Thread fhueske
Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5704
  
Since we are not shading anything we could also use the Maven JAR plugin:

```

org.apache.maven.plugins
maven-jar-plugin
3.0.2


package

jar


sql-jar




```


---


[jira] [Commented] (FLINK-8903) Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in Group Windows

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403095#comment-16403095
 ] 

ASF GitHub Bot commented on FLINK-8903:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5706
  
updated PR


> Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in 
> Group Windows
> ---
>
> Key: FLINK-8903
> URL: https://issues.apache.org/jira/browse/FLINK-8903
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.3.2, 1.5.0, 1.4.2
>Reporter: lilizhao
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: QQ图片20180312180143.jpg, TableAndSQLTest.java
>
>
> The built-in aggregation functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP 
> are translated into regular AVG functions if they are applied in the context 
> of a Group Window aggregation (\{{GROUP BY TUMBLE/HOP/SESSION}}).
> The reason is that these functions are internally represented as 
> {{SqlAvgAggFunction}} but with different {{SqlKind}}. When translating 
> Calcite aggregation functions to Flink Table agg functions, we only look at 
> the type of the class, not at the value of the {{kind}} field. We did not 
> notice that before, because in all other cases (regular {{GROUP BY}} without 
> windows or {{OVER}} windows, we have a translation rule 
> {{AggregateReduceFunctionsRule}} that decomposes the more complex functions 
> into expressions of {{COUNT}} and {{SUM}} functions such that we never 
> execute an {{AVG}} Flink function. That rule can only be applied on 
> {{LogicalAggregate}}, however, we represent group windows as 
> {{LogicalWindowAggregate}}, so the rule does not match.
> We should fix this by:
> 1. restrict the translation to Flink avg functions in {{AggregateUtil}} to 
> {{SqlKind.AVG}}. 
> 2. implement a rule (hopefully based on {{AggregateReduceFunctionsRule}}) 
> that decomposes the complex agg functions into the {{SUM}} and {{COUNT}}.
> Step 1. is easy and a quick fix but we would get an exception "Unsupported 
> Function" if {{VAR_POP}} is used in a {{GROUP BY}} window.
> Step 2. might be more involved, depending on how difficult it is to port the 
> rule.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5706: [FLINK-8903] [table] Fix VAR_SAMP, VAR_POP, STDDEV_SAMP, ...

2018-03-16 Thread fhueske
Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5706
  
updated PR


---


[jira] [Commented] (FLINK-8903) Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in Group Windows

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402691#comment-16402691
 ] 

ASF GitHub Bot commented on FLINK-8903:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175230116
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

sounds good to me. Will update the PR


> Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in 
> Group Windows
> ---
>
> Key: FLINK-8903
> URL: https://issues.apache.org/jira/browse/FLINK-8903
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.3.2, 1.5.0, 1.4.2
>Reporter: lilizhao
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: QQ图片20180312180143.jpg, TableAndSQLTest.java
>
>
> The built-in aggregation functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP 
> are translated into regular AVG functions if they are applied in the context 
> of a Group Window aggregation (\{{GROUP BY TUMBLE/HOP/SESSION}}).
> The reason is that these functions are internally represented as 
> {{SqlAvgAggFunction}} but with different {{SqlKind}}. When translating 
> Calcite aggregation functions to Flink Table agg functions, we only look at 
> the type of the class, not at the value of the {{kind}} field. We did not 
> notice that before, because in all other cases (regular {{GROUP BY}} without 
> windows or {{OVER}} windows, we have a translation rule 
> {{AggregateReduceFunctionsRule}} that decomposes the more complex functions 
> into expressions of {{COUNT}} and {{SUM}} functions such that we never 
> execute an {{AVG}} Flink function. That rule can only be applied on 
> {{LogicalAggregate}}, however, we represent group windows as 
> {{LogicalWindowAggregate}}, so the rule does not match.
> We should fix this by:
> 1. restrict the translation to Flink avg functions in {{AggregateUtil}} to 
> {{SqlKind.AVG}}. 
> 2. implement a rule (hopefully based on {{AggregateReduceFunctionsRule}}) 
> that decomposes the complex agg functions into the {{SUM}} and {{COUNT}}.
> Step 1. is easy and a quick fix but we would get an exception "Unsupported 
> Function" if {{VAR_POP}} is used in a {{GROUP BY}} window.
> Step 2. might be more involved, depending on how difficult it is to port the 
> rule.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5706: [FLINK-8903] [table] Fix VAR_SAMP, VAR_POP, STDDEV...

2018-03-16 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175230116
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

sounds good to me. Will update the PR


---


[jira] [Created] (FLINK-9015) Upgrade Calcite dependency to 1.17

2018-03-16 Thread Shuyi Chen (JIRA)
Shuyi Chen created FLINK-9015:
-

 Summary: Upgrade Calcite dependency to 1.17
 Key: FLINK-9015
 URL: https://issues.apache.org/jira/browse/FLINK-9015
 Project: Flink
  Issue Type: Task
Reporter: Shuyi Chen
Assignee: Shuyi Chen






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-8981) End-to-end test: Kerberos security

2018-03-16 Thread Shuyi Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyi Chen reassigned FLINK-8981:
-

Assignee: Shuyi Chen

> End-to-end test: Kerberos security
> --
>
> Key: FLINK-8981
> URL: https://issues.apache.org/jira/browse/FLINK-8981
> Project: Flink
>  Issue Type: Sub-task
>  Components: Security, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Shuyi Chen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> We should add an end-to-end test which verifies Flink's integration with 
> Kerberos security. In order to do this, we should start a Kerberos secured 
> Hadoop, ZooKeeper and Kafka cluster. Then we should start a Flink cluster 
> with HA enabled and run a job which reads from and writes to Kafka. We could 
> use a simple pipe job for that purpose which has some state for checkpointing 
> to HDFS.
> See [security docs| 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html]
>  for how more information about Flink's Kerberos integration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8988) End-to-end test: Cassandra connector

2018-03-16 Thread Shuyi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402515#comment-16402515
 ] 

Shuyi Chen commented on FLINK-8988:
---

[~till.rohrmann], wondering if the current CassandraConnectorITCase is enough?

> End-to-end test: Cassandra connector
> 
>
> Key: FLINK-8988
> URL: https://issues.apache.org/jira/browse/FLINK-8988
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cassandra Connector, Tests
>Reporter: Till Rohrmann
>Priority: Major
>
> In order to test the integration with Cassandra, we should add an end-to-end 
> test which tests the Cassandra connector. In order to do this, we need to add 
> a script/function which sets up a {{Cassandra}} cluster. Then we can run a 
> simple job writing information to {{Cassandra}} using the 
> {{CassandraRowWriteAheadSink}} and the {{CassandraTupleWriteAheadSink}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-8988) End-to-end test: Cassandra connector

2018-03-16 Thread Shuyi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402515#comment-16402515
 ] 

Shuyi Chen edited comment on FLINK-8988 at 3/16/18 8:50 PM:


Hi [~till.rohrmann], wondering if the current CassandraConnectorITCase is 
enough?


was (Author: suez1224):
[~till.rohrmann], wondering if the current CassandraConnectorITCase is enough?

> End-to-end test: Cassandra connector
> 
>
> Key: FLINK-8988
> URL: https://issues.apache.org/jira/browse/FLINK-8988
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cassandra Connector, Tests
>Reporter: Till Rohrmann
>Priority: Major
>
> In order to test the integration with Cassandra, we should add an end-to-end 
> test which tests the Cassandra connector. In order to do this, we need to add 
> a script/function which sets up a {{Cassandra}} cluster. Then we can run a 
> simple job writing information to {{Cassandra}} using the 
> {{CassandraRowWriteAheadSink}} and the {{CassandraTupleWriteAheadSink}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8894) CurrentJobIdsHandler fails to serialize response

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402416#comment-16402416
 ] 

ASF GitHub Bot commented on FLINK-8894:
---

GitHub user GJL opened a pull request:

https://github.com/apache/flink/pull/5711

[FLINK-8894][REST] CurrentJobIdsHandler fails to serialize response

## What is the purpose of the change

*This fixes a serialization error in `CurrentJobIdsHandler`.*

cc: @tillrohrmann 

## Brief change log

  - *Set object codec for `JsonGenerator` used in handler.*
  - *Add unit test.*


## Verifying this change

This change added tests and can be verified as follows:

  - *Added test to `CurrentJobIdsHandler`.*
  - *Manually verified the changes.*

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GJL/flink FLINK-8894

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5711.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5711


commit 559a80836c76cabcc687f4a7e129e4dec6e5de6e
Author: gyao 
Date:   2018-03-16T19:52:11Z

[FLINK-8894][REST] Set object codec for JsonGenerator used by 
CurrentJobIdsHandler




> CurrentJobIdsHandler fails to serialize response
> 
>
> Key: FLINK-8894
> URL: https://issues.apache.org/jira/browse/FLINK-8894
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Blocker
> Fix For: 1.5.0
>
>
> *Description*
> The non-flip6 {{CurrentJobIdsHandler}} fails to serialize instances of 
> {{JobIdWithStatus}}.
> *Steps to reproduce*
> # Add {{mode: old}} to {{flink-conf.yaml}}
> # {{bin/start-cluster.sh}}
> # {{nc -l 9001}}
> # {{bin/flink run  examples/streaming/SocketWindowWordCount.jar --port 9001}}
> # {{curl localhost:8081/jobs}}
> *Stacktrace*
> {noformat}
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkException: Failed to fetch list of all running 
> jobs.
>   at 
> org.apache.flink.runtime.rest.handler.legacy.CurrentJobIdsHandler.lambda$handleJsonRequest$0(CurrentJobIdsHandler.java:93)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Failed to fetch list of all 
> running jobs.
>   ... 9 more
> Caused by: java.lang.IllegalStateException: No ObjectCodec defined for the 
> generator, can only serialize simple wrapper types (type passed 
> org.apache.flink.runtime.messages.webmonitor.JobIdsWithStatusOverview$JobIdWithStatus)
>   at 
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonGenerator._writeSimpleObject(JsonGenerator.java:1798)
>   at 
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:369)
>   at org.apache.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5711: [FLINK-8894][REST] CurrentJobIdsHandler fails to s...

2018-03-16 Thread GJL
GitHub user GJL opened a pull request:

https://github.com/apache/flink/pull/5711

[FLINK-8894][REST] CurrentJobIdsHandler fails to serialize response

## What is the purpose of the change

*This fixes a serialization error in `CurrentJobIdsHandler`.*

cc: @tillrohrmann 

## Brief change log

  - *Set object codec for `JsonGenerator` used in handler.*
  - *Add unit test.*


## Verifying this change

This change added tests and can be verified as follows:

  - *Added test to `CurrentJobIdsHandler`.*
  - *Manually verified the changes.*

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GJL/flink FLINK-8894

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5711.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5711


commit 559a80836c76cabcc687f4a7e129e4dec6e5de6e
Author: gyao 
Date:   2018-03-16T19:52:11Z

[FLINK-8894][REST] Set object codec for JsonGenerator used by 
CurrentJobIdsHandler




---


[jira] [Assigned] (FLINK-8894) CurrentJobIdsHandler fails to serialize response

2018-03-16 Thread Gary Yao (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao reassigned FLINK-8894:
---

Assignee: Gary Yao

> CurrentJobIdsHandler fails to serialize response
> 
>
> Key: FLINK-8894
> URL: https://issues.apache.org/jira/browse/FLINK-8894
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: Gary Yao
>Priority: Blocker
> Fix For: 1.5.0
>
>
> *Description*
> The non-flip6 {{CurrentJobIdsHandler}} fails to serialize instances of 
> {{JobIdWithStatus}}.
> *Steps to reproduce*
> # Add {{mode: old}} to {{flink-conf.yaml}}
> # {{bin/start-cluster.sh}}
> # {{nc -l 9001}}
> # {{bin/flink run  examples/streaming/SocketWindowWordCount.jar --port 9001}}
> # {{curl localhost:8081/jobs}}
> *Stacktrace*
> {noformat}
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkException: Failed to fetch list of all running 
> jobs.
>   at 
> org.apache.flink.runtime.rest.handler.legacy.CurrentJobIdsHandler.lambda$handleJsonRequest$0(CurrentJobIdsHandler.java:93)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Failed to fetch list of all 
> running jobs.
>   ... 9 more
> Caused by: java.lang.IllegalStateException: No ObjectCodec defined for the 
> generator, can only serialize simple wrapper types (type passed 
> org.apache.flink.runtime.messages.webmonitor.JobIdsWithStatusOverview$JobIdWithStatus)
>   at 
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonGenerator._writeSimpleObject(JsonGenerator.java:1798)
>   at 
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:369)
>   at org.apache.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8843) Decouple bind REST address from advertised address

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402379#comment-16402379
 ] 

ASF GitHub Bot commented on FLINK-8843:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5632#discussion_r175192336
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/configuration/RestOptions.java ---
@@ -33,7 +33,7 @@
 */
public static final ConfigOption REST_ADDRESS =
key("rest.address")
-   .defaultValue("localhost")
+   .defaultValue("0.0.0.0")
--- End diff --

yes you can close it


> Decouple bind REST address from advertised address
> --
>
> Key: FLINK-8843
> URL: https://issues.apache.org/jira/browse/FLINK-8843
> Project: Flink
>  Issue Type: Improvement
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> The {{RestServerEndpoint}} is currently bound to the same address which is 
> also advertised to the client, namely {{RestOptions#REST_ADDRESS}}. It would 
> be better to start the {{RestServerEndpoint}} listening on all address by 
> binding to {{0.0.0.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5632: [FLINK-8843][REST] Decouple bind REST address from...

2018-03-16 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5632#discussion_r175192336
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/configuration/RestOptions.java ---
@@ -33,7 +33,7 @@
 */
public static final ConfigOption REST_ADDRESS =
key("rest.address")
-   .defaultValue("localhost")
+   .defaultValue("0.0.0.0")
--- End diff --

yes you can close it


---


[jira] [Updated] (FLINK-8946) TaskManager stop sending metrics after JobManager failover

2018-03-16 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek updated FLINK-8946:

Fix Version/s: 1.5.0

> TaskManager stop sending metrics after JobManager failover
> --
>
> Key: FLINK-8946
> URL: https://issues.apache.org/jira/browse/FLINK-8946
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics, TaskManager
>Affects Versions: 1.4.2
>Reporter: Truong Duc Kien
>Assignee: vinoyang
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Running in Yarn-standalone mode, when the Job Manager performs a failover, 
> all TaskManager that are inherited from the previous JobManager will not send 
> metrics to the new JobManager and other registered metric reporters.
>  
> A cursory glance reveal that these line of code might be the cause
> [https://github.com/apache/flink/blob/a3478fdfa0f792104123fefbd9bdf01f5029de51/flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala#L1082-L1086]
> Perhap the TaskManager close its metrics group when disassociating 
> JobManager, but not creating a new one on fail-over association ?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8946) TaskManager stop sending metrics after JobManager failover

2018-03-16 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek updated FLINK-8946:

Priority: Blocker  (was: Major)

> TaskManager stop sending metrics after JobManager failover
> --
>
> Key: FLINK-8946
> URL: https://issues.apache.org/jira/browse/FLINK-8946
> Project: Flink
>  Issue Type: Bug
>  Components: Metrics, TaskManager
>Affects Versions: 1.4.2
>Reporter: Truong Duc Kien
>Assignee: vinoyang
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Running in Yarn-standalone mode, when the Job Manager performs a failover, 
> all TaskManager that are inherited from the previous JobManager will not send 
> metrics to the new JobManager and other registered metric reporters.
>  
> A cursory glance reveal that these line of code might be the cause
> [https://github.com/apache/flink/blob/a3478fdfa0f792104123fefbd9bdf01f5029de51/flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala#L1082-L1086]
> Perhap the TaskManager close its metrics group when disassociating 
> JobManager, but not creating a new one on fail-over association ?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8903) Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in Group Windows

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402345#comment-16402345
 ] 

ASF GitHub Bot commented on FLINK-8903:
---

Github user suez1224 commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175183811
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

How about SqlKink.AVG_AGG_FUNCTIONS.contains(kind) && kind != SqlKind.SUM 
&& kind != SqlKind.AVG?


> Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in 
> Group Windows
> ---
>
> Key: FLINK-8903
> URL: https://issues.apache.org/jira/browse/FLINK-8903
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.3.2, 1.5.0, 1.4.2
>Reporter: lilizhao
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: QQ图片20180312180143.jpg, TableAndSQLTest.java
>
>
> The built-in aggregation functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP 
> are translated into regular AVG functions if they are applied in the context 
> of a Group Window aggregation (\{{GROUP BY TUMBLE/HOP/SESSION}}).
> The reason is that these functions are internally represented as 
> {{SqlAvgAggFunction}} but with different {{SqlKind}}. When translating 
> Calcite aggregation functions to Flink Table agg functions, we only look at 
> the type of the class, not at the value of the {{kind}} field. We did not 
> notice that before, because in all other cases (regular {{GROUP BY}} without 
> windows or {{OVER}} windows, we have a translation rule 
> {{AggregateReduceFunctionsRule}} that decomposes the more complex functions 
> into expressions of {{COUNT}} and {{SUM}} functions such that we never 
> execute an {{AVG}} Flink function. That rule can only be applied on 
> {{LogicalAggregate}}, however, we represent group windows as 
> {{LogicalWindowAggregate}}, so the rule does not match.
> We should fix this by:
> 1. restrict the translation to Flink avg functions in {{AggregateUtil}} to 
> {{SqlKind.AVG}}. 
> 2. implement a rule (hopefully based on {{AggregateReduceFunctionsRule}}) 
> that decomposes the complex agg functions into the {{SUM}} and {{COUNT}}.
> Step 1. is easy and a quick fix but we would get an exception "Unsupported 
> Function" if {{VAR_POP}} is used in a {{GROUP BY}} window.
> Step 2. might be more involved, depending on how difficult it is to port the 
> rule.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5706: [FLINK-8903] [table] Fix VAR_SAMP, VAR_POP, STDDEV...

2018-03-16 Thread suez1224
Github user suez1224 commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175183811
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

How about SqlKink.AVG_AGG_FUNCTIONS.contains(kind) && kind != SqlKind.SUM 
&& kind != SqlKind.AVG?


---


[jira] [Commented] (FLINK-8903) Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in Group Windows

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402335#comment-16402335
 ] 

ASF GitHub Bot commented on FLINK-8903:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175181100
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

Replacing the current code by `SqlKind.AVG_AGG_FUNCTIONS.contains()` lead 
to several test failures. These tests expected an `AVG` aggregation function 
that was now replaced by `SUM / COUNT`.


> Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in 
> Group Windows
> ---
>
> Key: FLINK-8903
> URL: https://issues.apache.org/jira/browse/FLINK-8903
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.3.2, 1.5.0, 1.4.2
>Reporter: lilizhao
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: QQ图片20180312180143.jpg, TableAndSQLTest.java
>
>
> The built-in aggregation functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP 
> are translated into regular AVG functions if they are applied in the context 
> of a Group Window aggregation (\{{GROUP BY TUMBLE/HOP/SESSION}}).
> The reason is that these functions are internally represented as 
> {{SqlAvgAggFunction}} but with different {{SqlKind}}. When translating 
> Calcite aggregation functions to Flink Table agg functions, we only look at 
> the type of the class, not at the value of the {{kind}} field. We did not 
> notice that before, because in all other cases (regular {{GROUP BY}} without 
> windows or {{OVER}} windows, we have a translation rule 
> {{AggregateReduceFunctionsRule}} that decomposes the more complex functions 
> into expressions of {{COUNT}} and {{SUM}} functions such that we never 
> execute an {{AVG}} Flink function. That rule can only be applied on 
> {{LogicalAggregate}}, however, we represent group windows as 
> {{LogicalWindowAggregate}}, so the rule does not match.
> We should fix this by:
> 1. restrict the translation to Flink avg functions in {{AggregateUtil}} to 
> {{SqlKind.AVG}}. 
> 2. implement a rule (hopefully based on {{AggregateReduceFunctionsRule}}) 
> that decomposes the complex agg functions into the {{SUM}} and {{COUNT}}.
> Step 1. is easy and a quick fix but we would get an exception "Unsupported 
> Function" if {{VAR_POP}} is used in a {{GROUP BY}} window.
> Step 2. might be more involved, depending on how difficult it is to port the 
> rule.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5706: [FLINK-8903] [table] Fix VAR_SAMP, VAR_POP, STDDEV...

2018-03-16 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/5706#discussion_r175181100
  
--- Diff: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/logical/FlinkLogicalWindowAggregate.scala
 ---
@@ -103,6 +106,19 @@ class FlinkLogicalWindowAggregateConverter
 FlinkConventions.LOGICAL,
 "FlinkLogicalWindowAggregateConverter") {
 
+  override def matches(call: RelOptRuleCall): Boolean = {
+val agg = call.rel(0).asInstanceOf[LogicalWindowAggregate]
+
+// we do not support these functions natively
+// they have to be converted using the 
WindowAggregateReduceFunctionsRule
+val supported = 
agg.getAggCallList.asScala.map(_.getAggregation.getKind).forall {
+  case SqlKind.STDDEV_POP | SqlKind.STDDEV_SAMP | SqlKind.VAR_POP | 
SqlKind.VAR_SAMP => false
--- End diff --

Replacing the current code by `SqlKind.AVG_AGG_FUNCTIONS.contains()` lead 
to several test failures. These tests expected an `AVG` aggregation function 
that was now replaced by `SUM / COUNT`.


---


[jira] [Commented] (FLINK-8903) Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in Group Windows

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402332#comment-16402332
 ] 

ASF GitHub Bot commented on FLINK-8903:
---

Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5706
  
Updated the PR with the Calcite issue. 


> Built-in agg functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP are broken in 
> Group Windows
> ---
>
> Key: FLINK-8903
> URL: https://issues.apache.org/jira/browse/FLINK-8903
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.3.2, 1.5.0, 1.4.2
>Reporter: lilizhao
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: QQ图片20180312180143.jpg, TableAndSQLTest.java
>
>
> The built-in aggregation functions VAR_POP, VAR_SAMP, STDEV_POP, STDEV_SAMP 
> are translated into regular AVG functions if they are applied in the context 
> of a Group Window aggregation (\{{GROUP BY TUMBLE/HOP/SESSION}}).
> The reason is that these functions are internally represented as 
> {{SqlAvgAggFunction}} but with different {{SqlKind}}. When translating 
> Calcite aggregation functions to Flink Table agg functions, we only look at 
> the type of the class, not at the value of the {{kind}} field. We did not 
> notice that before, because in all other cases (regular {{GROUP BY}} without 
> windows or {{OVER}} windows, we have a translation rule 
> {{AggregateReduceFunctionsRule}} that decomposes the more complex functions 
> into expressions of {{COUNT}} and {{SUM}} functions such that we never 
> execute an {{AVG}} Flink function. That rule can only be applied on 
> {{LogicalAggregate}}, however, we represent group windows as 
> {{LogicalWindowAggregate}}, so the rule does not match.
> We should fix this by:
> 1. restrict the translation to Flink avg functions in {{AggregateUtil}} to 
> {{SqlKind.AVG}}. 
> 2. implement a rule (hopefully based on {{AggregateReduceFunctionsRule}}) 
> that decomposes the complex agg functions into the {{SUM}} and {{COUNT}}.
> Step 1. is easy and a quick fix but we would get an exception "Unsupported 
> Function" if {{VAR_POP}} is used in a {{GROUP BY}} window.
> Step 2. might be more involved, depending on how difficult it is to port the 
> rule.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5706: [FLINK-8903] [table] Fix VAR_SAMP, VAR_POP, STDDEV_SAMP, ...

2018-03-16 Thread fhueske
Github user fhueske commented on the issue:

https://github.com/apache/flink/pull/5706
  
Updated the PR with the Calcite issue. 


---


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402328#comment-16402328
 ] 

Stephan Ewen commented on FLINK-8922:
-

[~sihuazhou]'s findings could be the right fix.

Could it be that RocksDB itself releases in some case the native object for the 
WriteOptions (or something that this internally points to)?

In that case, having a dedicated options object per state and for restore may 
be the best way to handle this.

Technically, I would still consider this a bug in RocksDB and this solution a 
workaround for the bug. But that what we need to do sometimes...

> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-9009) Error| You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few

2018-03-16 Thread Pankaj (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402323#comment-16402323
 ] 

Pankaj edited comment on FLINK-9009 at 3/16/18 6:35 PM:


No, Is not related with Kafka. I have already tried and check the problem only 
occurs when we introduced  more parallelism and flink is writing to cassandra 
with two cluster. Lets say in my case I introduced parallelism =10 coz i have 
10 partition in kafka topic.

I do not face any problem if i use parallelism=1  and cassandra writing from 
flink. But it faled with more parallelism

Problem can be replicated with steps i shared in description.

I'm not sure if flink has the fix of below two tickets in the cassandra 
connector api i shared

https://issues.apache.org/jira/browse/CASSANDRA-11243

https://issues.apache.org/jira/browse/CASSANDRA-10837

 


was (Author: pmishra01):
No, Is not related with Kafka. I have already tried and check the problem only 
occurs when we introduced  more parallelism and flink is writing to cassandra 
with two cluster. Lets say in my case I introduced parallelism =10 coz i have 
10 partition in kafka topic.

I do not face any problem with same scenario with no cassandra writing from 
flink.

Problem can be replicated with steps i shared in description.

I'm not sure if flink has the fix of below two tickets in the cassandra 
connector api i shared

https://issues.apache.org/jira/browse/CASSANDRA-11243

https://issues.apache.org/jira/browse/CASSANDRA-10837

 

> Error| You are creating too many HashedWheelTimer instances.  
> HashedWheelTimer is a shared resource that must be reused across the 
> application, so that only a few instances are created.
> -
>
> Key: FLINK-9009
> URL: https://issues.apache.org/jira/browse/FLINK-9009
> Project: Flink
>  Issue Type: Bug
> Environment: Pass platform: Openshit
>Reporter: Pankaj
>Priority: Blocker
>
> Steps to reproduce:
> 1- Flink with Kafka as a consumer -> Writing stream to Cassandra using flink 
> cassandra sink.
> 2- In memory Job manager and task manager with checkpointing 5000ms.
> 3- env.setpararllelism(10)-> As kafka topic has 10 partition.
> 4- There are around 13 unique streams in a single flink run time environment 
> which are reading from kafka -> processing and writing to cassandra.
> Hardware: CPU 200 milli core . It is deployed on Paas platform on one node
> Memory: 526 MB.
>  
> When i start the server, It starts flink and all off sudden stops with above 
> error. It also shows out of memory error.
>  
> It would be nice if any body can suggest if something is wrong.
>  
> Maven:
> flink-connector-cassandra_2.11: 1.3.2
> flink-streaming-java_2.11: 1.4.0
> flink-connector-kafka-0.11_2.11:1.4.0
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-9009) Error| You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few

2018-03-16 Thread Pankaj (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402323#comment-16402323
 ] 

Pankaj edited comment on FLINK-9009 at 3/16/18 6:34 PM:


No, Is not related with Kafka. I have already tried and check the problem only 
occurs when we introduced  more parallelism and flink is writing to cassandra 
with two cluster. Lets say in my case I introduced parallelism =10 coz i have 
10 partition in kafka topic.

I do not face any problem with same scenario with no cassandra writing from 
flink.

Problem can be replicated with steps i shared in description.

I'm not sure if flink has the fix of below two tickets in the cassandra 
connector api i shared

https://issues.apache.org/jira/browse/CASSANDRA-11243

https://issues.apache.org/jira/browse/CASSANDRA-10837

 


was (Author: pmishra01):
No, Is not related with Kafka. I have already tried and check the problem only 
occurs when we introduced  more parallelism and flink is writing two cassandra 
with two cluster. Lets say in my case I introduced parallelism =10 coz i have 
10 partition in kafka topic.

I do not face any problem with same scenario with no cassandra writing from 
flink.

Problem can be replicated with steps i shared in description.

I'm not sure if flink has the fix of below two tickets in the cassandra 
connector api i shared

https://issues.apache.org/jira/browse/CASSANDRA-11243

https://issues.apache.org/jira/browse/CASSANDRA-10837

 

> Error| You are creating too many HashedWheelTimer instances.  
> HashedWheelTimer is a shared resource that must be reused across the 
> application, so that only a few instances are created.
> -
>
> Key: FLINK-9009
> URL: https://issues.apache.org/jira/browse/FLINK-9009
> Project: Flink
>  Issue Type: Bug
> Environment: Pass platform: Openshit
>Reporter: Pankaj
>Priority: Blocker
>
> Steps to reproduce:
> 1- Flink with Kafka as a consumer -> Writing stream to Cassandra using flink 
> cassandra sink.
> 2- In memory Job manager and task manager with checkpointing 5000ms.
> 3- env.setpararllelism(10)-> As kafka topic has 10 partition.
> 4- There are around 13 unique streams in a single flink run time environment 
> which are reading from kafka -> processing and writing to cassandra.
> Hardware: CPU 200 milli core . It is deployed on Paas platform on one node
> Memory: 526 MB.
>  
> When i start the server, It starts flink and all off sudden stops with above 
> error. It also shows out of memory error.
>  
> It would be nice if any body can suggest if something is wrong.
>  
> Maven:
> flink-connector-cassandra_2.11: 1.3.2
> flink-streaming-java_2.11: 1.4.0
> flink-connector-kafka-0.11_2.11:1.4.0
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9009) Error| You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few insta

2018-03-16 Thread Pankaj (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402323#comment-16402323
 ] 

Pankaj commented on FLINK-9009:
---

No, Is not related with Kafka. I have already tried and check the problem only 
occurs when we introduced  more parallelism and flink is writing two cassandra 
with two cluster. Lets say in my case I introduced parallelism =10 coz i have 
10 partition in kafka topic.

I do not face any problem with same scenario with no cassandra writing from 
flink.

Problem can be replicated with steps i shared in description.

I'm not sure if flink has the fix of below two tickets in the cassandra 
connector api i shared

https://issues.apache.org/jira/browse/CASSANDRA-11243

https://issues.apache.org/jira/browse/CASSANDRA-10837

 

> Error| You are creating too many HashedWheelTimer instances.  
> HashedWheelTimer is a shared resource that must be reused across the 
> application, so that only a few instances are created.
> -
>
> Key: FLINK-9009
> URL: https://issues.apache.org/jira/browse/FLINK-9009
> Project: Flink
>  Issue Type: Bug
> Environment: Pass platform: Openshit
>Reporter: Pankaj
>Priority: Blocker
>
> Steps to reproduce:
> 1- Flink with Kafka as a consumer -> Writing stream to Cassandra using flink 
> cassandra sink.
> 2- In memory Job manager and task manager with checkpointing 5000ms.
> 3- env.setpararllelism(10)-> As kafka topic has 10 partition.
> 4- There are around 13 unique streams in a single flink run time environment 
> which are reading from kafka -> processing and writing to cassandra.
> Hardware: CPU 200 milli core . It is deployed on Paas platform on one node
> Memory: 526 MB.
>  
> When i start the server, It starts flink and all off sudden stops with above 
> error. It also shows out of memory error.
>  
> It would be nice if any body can suggest if something is wrong.
>  
> Maven:
> flink-connector-cassandra_2.11: 1.3.2
> flink-streaming-java_2.11: 1.4.0
> flink-connector-kafka-0.11_2.11:1.4.0
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-8364) Add iterator() to ListState which returns empty iterator when it has no value

2018-03-16 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-8364.
---
   Resolution: Invalid
Fix Version/s: (was: 1.5.0)

> Add iterator() to ListState which returns empty iterator when it has no value
> -
>
> Key: FLINK-8364
> URL: https://issues.apache.org/jira/browse/FLINK-8364
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.4.0
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Major
>
> Add iterator() to ListState which returns empty iterator when it has no value



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (FLINK-8364) Add iterator() to ListState which returns empty iterator when it has no value

2018-03-16 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek reopened FLINK-8364:
-

reopen to change fixVersion

> Add iterator() to ListState which returns empty iterator when it has no value
> -
>
> Key: FLINK-8364
> URL: https://issues.apache.org/jira/browse/FLINK-8364
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.4.0
>Reporter: Bowen Li
>Assignee: Bowen Li
>Priority: Major
>
> Add iterator() to ListState which returns empty iterator when it has no value



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-7488) TaskManagerHeapSizeCalculationJavaBashTest sometimes fails

2018-03-16 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402299#comment-16402299
 ] 

Ted Yu commented on FLINK-7488:
---

{code}
compareNetworkBufShellScriptWithJava(org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest)
  Time elapsed: 0.163 sec  <<< FAILURE!
org.junit.ComparisonFailure: Different network buffer memory sizes with 
configuration: {taskmanager.network.memory.fraction=0.1, 
taskmanager.memory.off-heap=false, taskmanager.memory.fraction=0.7, 
taskmanager.memory.size=-1, taskmanager.network.memory.max=1073741824, 
taskmanager.heap.mb=1000, taskmanager.network.memory.min=67108864} 
expected:<[]104857600> but was:<[Setting HADOOP_CONF_DIR=/etc/hadoop/conf 
because no HADOOP_CONF_DIR was set.]104857600>
  at org.junit.Assert.assertEquals(Assert.java:115)
  at 
org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareNetworkBufJavaVsScript(TaskManagerHeapSizeCalculationJavaBashTest.java:235)
  at 
org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareNetworkBufShellScriptWithJava(TaskManagerHeapSizeCalculationJavaBashTest.java:81)

compareHeapSizeShellScriptWithJava(org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest)
  Time elapsed: 0.076 sec  <<< FAILURE!
org.junit.ComparisonFailure: Different heap sizes with configuration: 
{taskmanager.network.memory.fraction=0.1, taskmanager.memory.off-heap=false, 
taskmanager.memory.fraction=0.7, taskmanager.memory.size=-1, 
taskmanager.network.memory.max=1073741824, taskmanager.heap.mb=1000, 
taskmanager.network.memory.min=67108864} expected:<[]900> but was:<[Setting 
HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.]900>
  at org.junit.Assert.assertEquals(Assert.java:115)
  at 
org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareHeapSizeJavaVsScript(TaskManagerHeapSizeCalculationJavaBashTest.java:275)
  at 
org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareHeapSizeShellScriptWithJava(TaskManagerHeapSizeCalculationJavaBashTest.java:110)
{code}

> TaskManagerHeapSizeCalculationJavaBashTest sometimes fails
> --
>
> Key: FLINK-7488
> URL: https://issues.apache.org/jira/browse/FLINK-7488
> Project: Flink
>  Issue Type: Test
>  Components: Tests
>Reporter: Ted Yu
>Priority: Minor
>
> {code}
> compareNetworkBufShellScriptWithJava(org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest)
>   Time elapsed: 0.239 sec  <<< FAILURE!
> org.junit.ComparisonFailure: Different network buffer memory sizes with 
> configuration: {taskmanager.network.memory.fraction=0.1, 
> taskmanager.memory.off-heap=false, taskmanager.memory.fraction=0.7, 
> taskmanager.memory.size=-1, taskmanager.network.memory.max=1073741824, 
> taskmanager.heap.mb=1000, taskmanager.network.memory.min=67108864} 
> expected:<[]104857600> but was:<[Setting HADOOP_CONF_DIR=/etc/hadoop/conf 
> because no HADOOP_CONF_DIR was set.Using the result of 'hadoop classpath' to 
> augment the Hadoop classpath: 
> /usr/hdp/2.5.0.0-1245/hadoop/conf:/usr/hdp/2.5.0.0-1245/hadoop/lib/*:/usr/hdp/2.5.0.0-1245/hadoop/.//*:/usr/hdp/2.5.0.0-1245/hadoop-hdfs/./:/usr/hdp/2.5.0.0-1245/hadoop-hdfs/lib/*:/usr/hdp/2.5.0.0-1245/hadoop-hdfs/.//*:/usr/hdp/2.5.0.0-1245/hadoop-yarn/lib/*:/usr/hdp/2.5.0.0-1245/hadoop-yarn/.//*:/usr/hdp/2.5.0.0-1245/hadoop-mapreduce/lib/*:/usr/hdp/2.5.0.0-1245/hadoop-mapreduce/.//*:/usr/hdp/2.5.0.0-1245/tez/*:/usr/hdp/2.5.0.0-1245/tez/lib/*:/usr/hdp/2.5.0.0-1245/tez/conf]104857600>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at 
> org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareNetworkBufJavaVsScript(TaskManagerHeapSizeCalculationJavaBashTest.java:235)
>   at 
> org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest.compareNetworkBufShellScriptWithJava(TaskManagerHeapSizeCalculationJavaBashTest.java:81)
> compareHeapSizeShellScriptWithJava(org.apache.flink.dist.TaskManagerHeapSizeCalculationJavaBashTest)
>   Time elapsed: 0.16 sec  <<< FAILURE!
> org.junit.ComparisonFailure: Different heap sizes with configuration: 
> {taskmanager.network.memory.fraction=0.1, taskmanager.memory.off-heap=false, 
> taskmanager.memory.fraction=0.7, taskmanager.memory.size=-1, 
> taskmanager.network.memory.max=1073741824, taskmanager.heap.mb=1000, 
> taskmanager.network.memory.min=67108864} expected:<[]1000> but was:<[Setting 
> HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.Using the 
> result of 'hadoop classpath' to augment the Hadoop classpath: 
> 

[jira] [Created] (FLINK-9014) Adapt BackPressureStatsTracker to work with credit-based flow control

2018-03-16 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9014:
--

 Summary: Adapt BackPressureStatsTracker to work with credit-based 
flow control
 Key: FLINK-9014
 URL: https://issues.apache.org/jira/browse/FLINK-9014
 Project: Flink
  Issue Type: Sub-task
  Components: Network, Webfrontend
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber
 Fix For: 1.5.0, 1.6.0


The {{BackPressureStatsTracker}} relies on sampling threads being blocked in 
{{LocalBufferPool#requestBufferBuilderBlocking}} to indicate backpressure but 
with credit-based flow control, we are also back-pressured if we did not get 
any credits (yet).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402251#comment-16402251
 ] 

Aljoscha Krettek commented on FLINK-8922:
-

[~sihuazhou] Unfortunately, I don't know why the segfaults occurred.

> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8402) HadoopS3FileSystemITCase.testDirectoryListing fails on Travis

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402247#comment-16402247
 ] 

ASF GitHub Bot commented on FLINK-8402:
---

Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5624
  
Unfortunately, this is not entirely under our control, since we rely on the 
underlying `FileSystem` implementation. I can probably reduce some parts of the 
tests that may lead to eventual consistent operations, but looking through 
[S3's data consistency 
model](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel),
 I figure we still (at least) have to deal with these scenarios:

> - A process writes a new object to Amazon S3 and immediately lists keys 
within its bucket. Until the change is fully propagated, the object might not 
appear in the list. 
> - A process deletes an existing object and immediately lists keys within 
its bucket. Until the deletion is fully propagated, Amazon S3 might list the 
deleted object.

This could, however, be handled on a case-to-case basis and I could try to 
improve the tests in this regard.


> HadoopS3FileSystemITCase.testDirectoryListing fails on Travis
> -
>
> Key: FLINK-8402
> URL: https://issues.apache.org/jira/browse/FLINK-8402
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0, 1.4.3
>
>
> The test {{HadoopS3FileSystemITCase.testDirectoryListing}} fails on Travis.
> https://travis-ci.org/tillrohrmann/flink/jobs/327021175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5624: [FLINK-8402][s3][tests] fix hadoop/presto S3 IT cases for...

2018-03-16 Thread NicoK
Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5624
  
Unfortunately, this is not entirely under our control, since we rely on the 
underlying `FileSystem` implementation. I can probably reduce some parts of the 
tests that may lead to eventual consistent operations, but looking through 
[S3's data consistency 
model](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel),
 I figure we still (at least) have to deal with these scenarios:

> - A process writes a new object to Amazon S3 and immediately lists keys 
within its bucket. Until the change is fully propagated, the object might not 
appear in the list. 
> - A process deletes an existing object and immediately lists keys within 
its bucket. Until the deletion is fully propagated, Amazon S3 might list the 
deleted object.

This could, however, be handled on a case-to-case basis and I could try to 
improve the tests in this regard.


---


[jira] [Commented] (FLINK-8948) RescalingITCase fails on Travis

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402229#comment-16402229
 ] 

ASF GitHub Bot commented on FLINK-8948:
---

Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5710
  
it may be better to make `BufferBuilder#finish()` idempotent


> RescalingITCase fails on Travis
> ---
>
> Key: FLINK-8948
> URL: https://issues.apache.org/jira/browse/FLINK-8948
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Tests, Travis
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.5.0
>
>
> https://travis-ci.org/apache/flink/jobs/353468272
> {code}
> testSavepointRescalingInKeyedStateDerivedMaxParallelism[0](org.apache.flink.test.checkpointing.RescalingITCase)
>   Time elapsed: 1.858 sec  <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:891)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:834)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:834)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.IllegalStateException: null
>   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:179)
>   at 
> org.apache.flink.runtime.io.network.buffer.BufferBuilder.finish(BufferBuilder.java:105)
>   at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.closeBufferBuilder(RecordWriter.java:218)
>   at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.clearBuffers(RecordWriter.java:178)
>   at 
> org.apache.flink.streaming.runtime.io.StreamRecordWriter.close(StreamRecordWriter.java:99)
>   at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.close(RecordWriterOutput.java:161)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.releaseOutputs(OperatorChain.java:239)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:402)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5710: [FLINK-8948][runtime] Fix IllegalStateException when clos...

2018-03-16 Thread NicoK
Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5710
  
it may be better to make `BufferBuilder#finish()` idempotent


---


[jira] [Commented] (FLINK-8945) Allow customization of the KinesisProxy

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402218#comment-16402218
 ] 

ASF GitHub Bot commented on FLINK-8945:
---

Github user tweise commented on the issue:

https://github.com/apache/flink/pull/5698
  
@StephanEwen thanks for unblocking our work!


> Allow customization of the KinesisProxy
> ---
>
> Key: FLINK-8945
> URL: https://issues.apache.org/jira/browse/FLINK-8945
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently the KinesisProxy interface here:
> [https://github.com/apache/flink/blob/310f3de62e52f1f977c217d918cc5aac79b87277/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java#L125]
> has a private constructor. This restricts extending the class and prevents 
> customizations on shard discovery. I am proposing to change this to protected.
> While the creating a new implementation of KinesisProxyInterface is possible, 
> I would like to continue to use implementation of getRecords and 
> getShardIterator.
> This will be a temporary workaround till FLINK-8944 is submitted. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5698: [FLINK-8945] [kinesis] Allow customization of KinesisProx...

2018-03-16 Thread tweise
Github user tweise commented on the issue:

https://github.com/apache/flink/pull/5698
  
@StephanEwen thanks for unblocking our work!


---


[jira] [Commented] (FLINK-8915) CheckpointingStatisticsHandler fails to return PendingCheckpointStats

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402053#comment-16402053
 ] 

ASF GitHub Bot commented on FLINK-8915:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175125460
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -438,4 +453,58 @@ public int hashCode() {
return Objects.hash(super.hashCode(), failureTimestamp, 
failureMessage);
}
}
+
+   /**
+* Statistics for a pending checkpoint.
+*/
+   public static final class PendingCheckpointStatistics extends 
CheckpointStatistics {
--- End diff --

Ser/des of this class should be tested in `CheckpointingStatisticsTest`.


> CheckpointingStatisticsHandler fails to return PendingCheckpointStats 
> --
>
> Key: FLINK-8915
> URL: https://issues.apache.org/jira/browse/FLINK-8915
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip6
> Fix For: 1.5.0
>
>
> {noformat}
> 2018-03-10 21:47:52,487 ERROR 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler
>   - Implementation error: Unhandled exception.
> java.lang.IllegalArgumentException: Given checkpoint stats object of type 
> org.apache.flink.runtime.checkpoint.PendingCheckpointStats cannot be 
> converted.
>   at 
> org.apache.flink.runtime.rest.messages.checkpoints.CheckpointStatistics.generateCheckpointStatistics(CheckpointStatistics.java:276)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:146)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:54)
>   at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:81)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8915) CheckpointingStatisticsHandler fails to return PendingCheckpointStats

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402055#comment-16402055
 ] 

ASF GitHub Bot commented on FLINK-8915:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175126066
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -273,7 +274,21 @@ public static CheckpointStatistics 
generateCheckpointStatistics(AbstractCheckpoi
failedCheckpointStats.getFailureTimestamp(),
failedCheckpointStats.getFailureMessage());
} else {
-   throw new IllegalArgumentException("Given checkpoint 
stats object of type " + checkpointStats.getClass().getName() + " cannot be 
converted.");
+   final PendingCheckpointStats pendingCheckpointStats = 
((PendingCheckpointStats) checkpointStats);
--- End diff --

The original `else` block should be kept in case the class hierarchy is 
extended.
```
else if (checkpointStats instanceOf PendingCheckpointStats) {
...
} else {
throw new IllegalArgumentException("Given checkpoint stats object of 
type " + checkpointStats.getClass().getName() + " cannot be converted.");
}
```


> CheckpointingStatisticsHandler fails to return PendingCheckpointStats 
> --
>
> Key: FLINK-8915
> URL: https://issues.apache.org/jira/browse/FLINK-8915
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip6
> Fix For: 1.5.0
>
>
> {noformat}
> 2018-03-10 21:47:52,487 ERROR 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler
>   - Implementation error: Unhandled exception.
> java.lang.IllegalArgumentException: Given checkpoint stats object of type 
> org.apache.flink.runtime.checkpoint.PendingCheckpointStats cannot be 
> converted.
>   at 
> org.apache.flink.runtime.rest.messages.checkpoints.CheckpointStatistics.generateCheckpointStatistics(CheckpointStatistics.java:276)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:146)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:54)
>   at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:81)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8915) CheckpointingStatisticsHandler fails to return PendingCheckpointStats

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402054#comment-16402054
 ] 

ASF GitHub Bot commented on FLINK-8915:
---

Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175126334
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -438,4 +453,58 @@ public int hashCode() {
return Objects.hash(super.hashCode(), failureTimestamp, 
failureMessage);
}
}
+
+   /**
+* Statistics for a pending checkpoint.
+*/
+   public static final class PendingCheckpointStatistics extends 
CheckpointStatistics {
--- End diff --

`@JsonSubTypes.Type(value = 
CheckpointStatistics.PendingCheckpointStatistics.class, name = "in_progress")` 
should be added to `CheckpointStatistics` otherwise deserialization will fail.


> CheckpointingStatisticsHandler fails to return PendingCheckpointStats 
> --
>
> Key: FLINK-8915
> URL: https://issues.apache.org/jira/browse/FLINK-8915
> Project: Flink
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Gary Yao
>Assignee: vinoyang
>Priority: Blocker
>  Labels: flip6
> Fix For: 1.5.0
>
>
> {noformat}
> 2018-03-10 21:47:52,487 ERROR 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler
>   - Implementation error: Unhandled exception.
> java.lang.IllegalArgumentException: Given checkpoint stats object of type 
> org.apache.flink.runtime.checkpoint.PendingCheckpointStats cannot be 
> converted.
>   at 
> org.apache.flink.runtime.rest.messages.checkpoints.CheckpointStatistics.generateCheckpointStatistics(CheckpointStatistics.java:276)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:146)
>   at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleRequest(CheckpointingStatisticsHandler.java:54)
>   at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:81)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5703: [FLINK-8915] CheckpointingStatisticsHandler fails ...

2018-03-16 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175125460
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -438,4 +453,58 @@ public int hashCode() {
return Objects.hash(super.hashCode(), failureTimestamp, 
failureMessage);
}
}
+
+   /**
+* Statistics for a pending checkpoint.
+*/
+   public static final class PendingCheckpointStatistics extends 
CheckpointStatistics {
--- End diff --

Ser/des of this class should be tested in `CheckpointingStatisticsTest`.


---


[GitHub] flink pull request #5703: [FLINK-8915] CheckpointingStatisticsHandler fails ...

2018-03-16 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175126334
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -438,4 +453,58 @@ public int hashCode() {
return Objects.hash(super.hashCode(), failureTimestamp, 
failureMessage);
}
}
+
+   /**
+* Statistics for a pending checkpoint.
+*/
+   public static final class PendingCheckpointStatistics extends 
CheckpointStatistics {
--- End diff --

`@JsonSubTypes.Type(value = 
CheckpointStatistics.PendingCheckpointStatistics.class, name = "in_progress")` 
should be added to `CheckpointStatistics` otherwise deserialization will fail.


---


[GitHub] flink pull request #5703: [FLINK-8915] CheckpointingStatisticsHandler fails ...

2018-03-16 Thread GJL
Github user GJL commented on a diff in the pull request:

https://github.com/apache/flink/pull/5703#discussion_r175126066
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/checkpoints/CheckpointStatistics.java
 ---
@@ -273,7 +274,21 @@ public static CheckpointStatistics 
generateCheckpointStatistics(AbstractCheckpoi
failedCheckpointStats.getFailureTimestamp(),
failedCheckpointStats.getFailureMessage());
} else {
-   throw new IllegalArgumentException("Given checkpoint 
stats object of type " + checkpointStats.getClass().getName() + " cannot be 
converted.");
+   final PendingCheckpointStats pendingCheckpointStats = 
((PendingCheckpointStats) checkpointStats);
--- End diff --

The original `else` block should be kept in case the class hierarchy is 
extended.
```
else if (checkpointStats instanceOf PendingCheckpointStats) {
...
} else {
throw new IllegalArgumentException("Given checkpoint stats object of 
type " + checkpointStats.getClass().getName() + " cannot be converted.");
}
```


---


[jira] [Created] (FLINK-9013) Document yarn.containers.vcores only being effective when adapting YARN config

2018-03-16 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9013:
--

 Summary: Document yarn.containers.vcores only being effective when 
adapting YARN config
 Key: FLINK-9013
 URL: https://issues.apache.org/jira/browse/FLINK-9013
 Project: Flink
  Issue Type: Improvement
  Components: Documentation, YARN
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber
Assignee: Nico Kruber
 Fix For: 1.5.0, 1.6.0


Even after specifying {{yarn.containers.vcores}} and having Flink request such 
a container from YARN, it may not take these into account at all and return a 
container with 1 vcore.

The YARN configuration needs to be adapted to take the vcores into account, 
e.g. by setting the {{FairScheduler}} in {{yarn-site.xml}}:
{code}

  yarn.resourcemanager.scheduler.class
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

{code}

This fact should be documented at least at the configuration parameter 
documentation of  {{yarn.containers.vcores}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8968) Fix native resource leak caused by ReadOptions

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402031#comment-16402031
 ] 

ASF GitHub Bot commented on FLINK-8968:
---

Github user sihuazhou commented on the issue:

https://github.com/apache/flink/pull/5705
  
@StefanRRichter I have updated the code, could you please have a look...


> Fix native resource leak caused by ReadOptions 
> ---
>
> Key: FLINK-8968
> URL: https://issues.apache.org/jira/browse/FLINK-8968
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Assignee: Sihua Zhou
>Priority: Major
> Fix For: 1.5.0
>
>
> We should pull the creation of ReadOptions out of the loop in 
> {{RocksDBFullSnapshotOperation.writeKVStateMetaData()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5705: [FLINK-8968][state]Fix native resource leak caused by Rea...

2018-03-16 Thread sihuazhou
Github user sihuazhou commented on the issue:

https://github.com/apache/flink/pull/5705
  
@StefanRRichter I have updated the code, could you please have a look...


---


[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism

2018-03-16 Thread Sihua Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402023#comment-16402023
 ] 

Sihua Zhou commented on FLINK-8976:
---

Hi [~till.rohrmann] could I ask why this issue won't work with RocksDB 
incremental checkpoints?

> End-to-end test: Resume with different parallelism
> --
>
> Key: FLINK-8976
> URL: https://issues.apache.org/jira/browse/FLINK-8976
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Similar to FLINK-8975, we should have an end-to-end test which resumes a job 
> with a different parallelism after taking 
> a) a savepoint
> b) from the last retained checkpoint (this won't work with RocksDB 
> incremental checkpoints) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reassigned FLINK-9010:
--

Assignee: Nico Kruber

> NoResourceAvailableException with FLIP-6 
> -
>
> Key: FLINK-9010
> URL: https://issues.apache.org/jira/browse/FLINK-9010
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> 

[jira] [Commented] (FLINK-8948) RescalingITCase fails on Travis

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402014#comment-16402014
 ] 

ASF GitHub Bot commented on FLINK-8948:
---

GitHub user pnowojski opened a pull request:

https://github.com/apache/flink/pull/5710

[FLINK-8948][runtime] Fix IllegalStateException when closing StreamTask

All methods in 
org.apache.flink.streaming.runtime.tasks.OperatorChain#releaseOutputs shouldn't
throw any exceptions and should be able to release resources after 
interruption of the task's
thread.

## Verifying this change

No tests, this is a rare concurrent bug that requires ~30s sleep/freeze 
during task cancellation :( RescalingITCase could from time to time trigger 
this bug as reported in the jira ticket.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pnowojski/flink f8948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5710


commit bb97cf042be75c69dce41f1a76d4ca252d16f059
Author: Piotr Nowojski 
Date:   2018-03-16T14:56:07Z

[FLINK-8948][runtime] Fix IllegalStateException when closing StreamTask

All methods in 
org.apache.flink.streaming.runtime.tasks.OperatorChain#releaseOutputs shouldn't
throw any exceptions and should be able to release resources after 
interruption of the task's
thread.




> RescalingITCase fails on Travis
> ---
>
> Key: FLINK-8948
> URL: https://issues.apache.org/jira/browse/FLINK-8948
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Tests, Travis
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.5.0
>
>
> https://travis-ci.org/apache/flink/jobs/353468272
> {code}
> testSavepointRescalingInKeyedStateDerivedMaxParallelism[0](org.apache.flink.test.checkpointing.RescalingITCase)
>   Time elapsed: 1.858 sec  <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:891)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:834)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:834)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.IllegalStateException: null
>   at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:179)
>   at 
> org.apache.flink.runtime.io.network.buffer.BufferBuilder.finish(BufferBuilder.java:105)
>   at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.closeBufferBuilder(RecordWriter.java:218)
>   at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.clearBuffers(RecordWriter.java:178)
>   at 
> 

[GitHub] flink pull request #5710: [FLINK-8948][runtime] Fix IllegalStateException wh...

2018-03-16 Thread pnowojski
GitHub user pnowojski opened a pull request:

https://github.com/apache/flink/pull/5710

[FLINK-8948][runtime] Fix IllegalStateException when closing StreamTask

All methods in 
org.apache.flink.streaming.runtime.tasks.OperatorChain#releaseOutputs shouldn't
throw any exceptions and should be able to release resources after 
interruption of the task's
thread.

## Verifying this change

No tests, this is a rare concurrent bug that requires ~30s sleep/freeze 
during task cancellation :( RescalingITCase could from time to time trigger 
this bug as reported in the jira ticket.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (yes / **no**)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
  - The serializers: (yes / **no** / don't know)
  - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
  - The S3 file system connector: (yes / **no** / don't know)

## Documentation

  - Does this pull request introduce a new feature? (yes / **no**)
  - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pnowojski/flink f8948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5710


commit bb97cf042be75c69dce41f1a76d4ca252d16f059
Author: Piotr Nowojski 
Date:   2018-03-16T14:56:07Z

[FLINK-8948][runtime] Fix IllegalStateException when closing StreamTask

All methods in 
org.apache.flink.streaming.runtime.tasks.OperatorChain#releaseOutputs shouldn't
throw any exceptions and should be able to release resources after 
interruption of the task's
thread.




---


[jira] [Created] (FLINK-9012) Shaded Hadoop S3A end-to-end test failing with S3 bucket in Ireland

2018-03-16 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9012:
--

 Summary: Shaded Hadoop S3A end-to-end test failing with S3 bucket 
in Ireland
 Key: FLINK-9012
 URL: https://issues.apache.org/jira/browse/FLINK-9012
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber
Assignee: Nico Kruber
 Fix For: 1.5.0, 1.6.0


https://api.travis-ci.org/v3/job/354259892/log.txt

{code}
Found AWS bucket [secure], running Shaded Hadoop S3A e2e tests.
Flink dist directory: /home/travis/build/NicoK/flink/build-target
TEST_DATA_DIR: 
/home/travis/build/NicoK/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-05775180416
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed

  0 00 00 0  0  0 --:--:-- --:--:-- --:--:-- 0
  0 00 00 0  0  0 --:--:-- --:--:-- --:--:-- 0
 91   4930   4490 0   2476  0 --:--:-- --:--:-- --:--:--  2467

TemporaryRedirectPlease re-send this request to 
the specified temporary endpoint. Continue to use the original request endpoint 
for future 
requests.[secure][secure].s3-eu-west-1.amazonaws.com1FCEC82C3EBF7C7ENG5dxnXQ0Y5mK2X/m3bU+Z7Fqt0mNVL2JsjyVjGZUmpDmNuBDfKJACh7VI6tCTYEZsw65W057lA=Starting
 cluster.
Starting standalonesession daemon on host 
travis-job-087822e3-2f4c-46b7-b9bd-b6d4aba6136c.
Starting taskexecutor daemon on host 
travis-job-087822e3-2f4c-46b7-b9bd-b6d4aba6136c.
Dispatcher/TaskManagers are not yet up
Waiting for dispatcher REST endpoint to come up...
Dispatcher/TaskManagers are not yet up
Waiting for dispatcher REST endpoint to come up...
Dispatcher/TaskManagers are not yet up
Waiting for dispatcher REST endpoint to come up...
Dispatcher/TaskManagers are not yet up
Waiting for dispatcher REST endpoint to come up...
Dispatcher/TaskManagers are not yet up
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Starting execution of program


 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result.
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:246)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:458)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:446)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at 
org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:398)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:780)
at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:274)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:209)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1019)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1095)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1095)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$4(RestClusterClient.java:341)
at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at 

[jira] [Updated] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-9010:
---
Issue Type: Bug  (was: Improvement)

> NoResourceAvailableException with FLIP-6 
> -
>
> Key: FLINK-9010
> URL: https://issues.apache.org/jira/browse/FLINK-9010
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> 

[jira] [Created] (FLINK-9011) YarnResourceManager spamming log file at INFO level

2018-03-16 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9011:
--

 Summary: YarnResourceManager spamming log file at INFO level
 Key: FLINK-9011
 URL: https://issues.apache.org/jira/browse/FLINK-9011
 Project: Flink
  Issue Type: Bug
  Components: ResourceManager, YARN
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber
 Fix For: 1.5.0, 1.6.0


For every requested resource, the {{YarnResourceManager}} spams the log with 
log-level INFO and the following messages:

{code}
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- Received new container: container_1521038088305_0257_01_000102 - 
Remaining pending container requests: 301
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TaskExecutor container_1521038088305_0257_01_000102 will be 
started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
limit 3072 MB
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab path obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab principal obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote yarn conf path obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote krb5 path obtained null
2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils   
- Copying from 
file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
 to 
hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager 
- Prepared local resource for modified yaml: resource { scheme: 
"hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
"/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
 } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager 
- Creating container launch context for TaskManagers
2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager 
- Starting TaskManagers with command: $JAVA_HOME/bin/java -Xms5120m 
-Xmx5120m -XX:MaxDirectMemorySize=3072m  -Dlog.file=/taskmanager.log 
-Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
/taskmanager.out 2> /taskmanager.err
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9011) YarnResourceManager spamming log file at INFO level

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-9011:
---
Labels: flip-6  (was: )

> YarnResourceManager spamming log file at INFO level
> ---
>
> Key: FLINK-9011
> URL: https://issues.apache.org/jira/browse/FLINK-9011
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> For every requested resource, the {{YarnResourceManager}} spams the log with 
> log-level INFO and the following messages:
> {code}
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
> 2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680190 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,194 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-03-16 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9010:
--

 Summary: NoResourceAvailableException with FLIP-6 
 Key: FLINK-9010
 URL: https://issues.apache.org/jira/browse/FLINK-9010
 Project: Flink
  Issue Type: Improvement
  Components: ResourceManager
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber
 Fix For: 1.5.0, 1.6.0


I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
exception:

{code}
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- Received new container: container_1521038088305_0257_01_000101 - 
Remaining pending container requests: 302
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- TaskExecutor container_1521038088305_0257_01_000101 will be 
started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
limit 3072 MB
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab path obtained null
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab principal obtained null
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote yarn conf path obtained null
2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote krb5 path obtained null
2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils   
- Copying from 
file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
 to 
hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager 
- Prepared local resource for modified yaml: resource { scheme: 
"hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
"/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
 } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager 
- Creating container launch context for TaskManagers
2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager 
- Starting TaskManagers with command: $JAVA_HOME/bin/java -Xms5120m 
-Xmx5120m -XX:MaxDirectMemorySize=3072m  -Dlog.file=/taskmanager.log 
-Dlogback.configurationFile=file:./logback.xml 
-Dlog4j.configuration=file:./log4j.properties 
org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
/taskmanager.out 2> /taskmanager.err
2018-03-16 03:41:20,176 INFO  
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- Received new container: container_1521038088305_0257_01_000102 - 
Remaining pending container requests: 301
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TaskExecutor container_1521038088305_0257_01_000102 will be 
started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
limit 3072 MB
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab path obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote keytab principal obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote yarn conf path obtained null
2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager 
- TM:remote krb5 path obtained null
2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils   
- Copying from 
file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
 to 
hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
2018-03-16 03:41:20,190 INFO  org.apache.flink.yarn.YarnResourceManager 
- Prepared local resource for modified yaml: resource { scheme: 
"hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
"/user/hadoop/.flink/application_1521038088305_0257/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml"
 } size: 595 

[jira] [Updated] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-9010:
---
Labels: flip-6  (was: )

> NoResourceAvailableException with FLIP-6 
> -
>
> Key: FLINK-9010
> URL: https://issues.apache.org/jira/browse/FLINK-9010
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0, 1.6.0
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> 

[jira] [Commented] (FLINK-8968) Fix native resource leak caused by ReadOptions

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401994#comment-16401994
 ] 

ASF GitHub Bot commented on FLINK-8968:
---

Github user StefanRRichter commented on the issue:

https://github.com/apache/flink/pull/5705
  
You can cover it here and I will review it again  


> Fix native resource leak caused by ReadOptions 
> ---
>
> Key: FLINK-8968
> URL: https://issues.apache.org/jira/browse/FLINK-8968
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Assignee: Sihua Zhou
>Priority: Major
> Fix For: 1.5.0
>
>
> We should pull the creation of ReadOptions out of the loop in 
> {{RocksDBFullSnapshotOperation.writeKVStateMetaData()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5705: [FLINK-8968][state]Fix native resource leak caused by Rea...

2018-03-16 Thread StefanRRichter
Github user StefanRRichter commented on the issue:

https://github.com/apache/flink/pull/5705
  
You can cover it here and I will review it again 👍 


---


[jira] [Commented] (FLINK-8872) Yarn detached mode via -yd does not detach

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401983#comment-16401983
 ] 

ASF GitHub Bot commented on FLINK-8872:
---

Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5672
  
thanks for the review, I also did not like the side-effect approach and 
after thinking a bit about your first message, I independently came up with the 
same thing as you proposed in the second one :p
-> rebased onto latest #5671 and added a fixup commit with that change


> Yarn detached mode via -yd does not detach
> --
>
> Key: FLINK-8872
> URL: https://issues.apache.org/jira/browse/FLINK-8872
> Project: Flink
>  Issue Type: Bug
>  Components: Client, YARN
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: flip-6
> Fix For: 1.5.0
>
>
> Running yarn per-job cluster in detached mode currently does not work and 
> waits for the job to finish.
> Example:
> {code}
> ./bin/flink run -m yarn-cluster -yn 10 -yjm 768 -ytm 3072 -ys 2 -yd -p 20 -c 
> org.apache.flink.streaming.examples.wordcount.WordCount 
> ./examples/streaming/WordCount.jar --input
> {code}
> Output in case of an infinite program would then end with something like this:
> {code}
> 2018-03-05 13:41:23,311 INFO  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor   - Waiting for 
> the cluster to be allocated
> 2018-03-05 13:41:23,313 INFO  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor   - Deploying 
> cluster, current state ACCEPTED
> 2018-03-05 13:41:28,342 INFO  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor   - YARN 
> application has been deployed successfully.
> 2018-03-05 13:41:28,343 INFO  
> org.apache.flink.yarn.AbstractYarnClusterDescriptor   - The Flink 
> YARN client has been started in detached mode. In order to stop Flink on 
> YARN, use the following command or a YARN web interface to stop it:
> yarn application -kill application_1519984124671_0006
> Please also note that the temporary files of the YARN session in the home 
> directoy will not be removed.
> Starting execution of program
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5672: [FLINK-8872][flip6] fix yarn detached mode command parsin...

2018-03-16 Thread NicoK
Github user NicoK commented on the issue:

https://github.com/apache/flink/pull/5672
  
thanks for the review, I also did not like the side-effect approach and 
after thinking a bit about your first message, I independently came up with the 
same thing as you proposed in the second one :p
-> rebased onto latest #5671 and added a fixup commit with that change


---


[jira] [Commented] (FLINK-8073) Test instability FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()

2018-03-16 Thread Gary Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401958#comment-16401958
 ] 

Gary Yao commented on FLINK-8073:
-

Another instance: https://travis-ci.org/apache/flink/jobs/354290449

> Test instability 
> FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()
> -
>
> Key: FLINK-8073
> URL: https://issues.apache.org/jira/browse/FLINK-8073
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kostas Kloudas
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> Travis log: https://travis-ci.org/kl0u/flink/jobs/301985988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8968) Fix native resource leak caused by ReadOptions

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401952#comment-16401952
 ] 

ASF GitHub Bot commented on FLINK-8968:
---

Github user sihuazhou commented on the issue:

https://github.com/apache/flink/pull/5705
  
@StefanRRichter Sorry for pinging you again in this PR, but I found there 
maybe be another issue for `full checkpoint`. It's about the 
`kvStateInformation`, we should create a deep copy of `kvStateInformation` when 
taking snapshot, currently we use the code `this.kvStateInformationCopy = new 
ArrayList<>(stateBackend.kvStateInformation.values());`, but this is not a deep 
copy indeed because the `Serializers` in `RegisteredKeyedBackendStateMetaInfo` 
maybe not stateless. If you also think this is a problem, instead of make a new 
PR for that I'd like to change the code to cover it in this PR, WDYT?


> Fix native resource leak caused by ReadOptions 
> ---
>
> Key: FLINK-8968
> URL: https://issues.apache.org/jira/browse/FLINK-8968
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Assignee: Sihua Zhou
>Priority: Major
> Fix For: 1.5.0
>
>
> We should pull the creation of ReadOptions out of the loop in 
> {{RocksDBFullSnapshotOperation.writeKVStateMetaData()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5705: [FLINK-8968][state]Fix native resource leak caused by Rea...

2018-03-16 Thread sihuazhou
Github user sihuazhou commented on the issue:

https://github.com/apache/flink/pull/5705
  
@StefanRRichter Sorry for pinging you again in this PR, but I found there 
maybe be another issue for `full checkpoint`. It's about the 
`kvStateInformation`, we should create a deep copy of `kvStateInformation` when 
taking snapshot, currently we use the code `this.kvStateInformationCopy = new 
ArrayList<>(stateBackend.kvStateInformation.values());`, but this is not a deep 
copy indeed because the `Serializers` in `RegisteredKeyedBackendStateMetaInfo` 
maybe not stateless. If you also think this is a problem, instead of make a new 
PR for that I'd like to change the code to cover it in this PR, WDYT?


---


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Sihua Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401945#comment-16401945
 ] 

Sihua Zhou commented on FLINK-8922:
---

[~srichter] Yes, the downside is obvious and I also feel it should be avoided 
and the logical of the current code is actually good. Indeed I also think that 
it maybe a coincidence  [~aljoscha] Could you please tell us the story 
about how you found that way to avoid segfaults and why it work? Is that a 
coincidence or have some theory? 

> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8941) SpanningRecordSerializationTest fails on Travis

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-8941:
---
Component/s: Network

> SpanningRecordSerializationTest fails on Travis
> ---
>
> Key: FLINK-8941
> URL: https://issues.apache.org/jira/browse/FLINK-8941
> Project: Flink
>  Issue Type: Improvement
>  Components: Network, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0, 1.6.0
>
>
> https://travis-ci.org/zentol/flink/jobs/353217791
> {code:java}
> testHandleMixedLargeRecords(org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest)
>   Time elapsed: 1.992 sec  <<< ERROR!
> java.nio.channels.ClosedChannelException: null
>   at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.addNextChunkFromMemorySegment(SpillingAdaptiveSpanningRecordDeserializer.java:528)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.access$200(SpillingAdaptiveSpanningRecordDeserializer.java:430)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:75)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:143)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:109)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testHandleMixedLargeRecords(SpanningRecordSerializationTest.java:98){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8941) SpanningRecordSerializationTest fails on Travis

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-8941:
---
Affects Version/s: 1.6.0
   1.5.0

> SpanningRecordSerializationTest fails on Travis
> ---
>
> Key: FLINK-8941
> URL: https://issues.apache.org/jira/browse/FLINK-8941
> Project: Flink
>  Issue Type: Improvement
>  Components: Network, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0, 1.6.0
>
>
> https://travis-ci.org/zentol/flink/jobs/353217791
> {code:java}
> testHandleMixedLargeRecords(org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest)
>   Time elapsed: 1.992 sec  <<< ERROR!
> java.nio.channels.ClosedChannelException: null
>   at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.addNextChunkFromMemorySegment(SpillingAdaptiveSpanningRecordDeserializer.java:528)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.access$200(SpillingAdaptiveSpanningRecordDeserializer.java:430)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:75)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:143)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:109)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testHandleMixedLargeRecords(SpanningRecordSerializationTest.java:98){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8941) SpanningRecordSerializationTest fails on Travis

2018-03-16 Thread Nico Kruber (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-8941:
---
Fix Version/s: 1.6.0

> SpanningRecordSerializationTest fails on Travis
> ---
>
> Key: FLINK-8941
> URL: https://issues.apache.org/jira/browse/FLINK-8941
> Project: Flink
>  Issue Type: Improvement
>  Components: Network, Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0, 1.6.0
>
>
> https://travis-ci.org/zentol/flink/jobs/353217791
> {code:java}
> testHandleMixedLargeRecords(org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest)
>   Time elapsed: 1.992 sec  <<< ERROR!
> java.nio.channels.ClosedChannelException: null
>   at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.addNextChunkFromMemorySegment(SpillingAdaptiveSpanningRecordDeserializer.java:528)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.access$200(SpillingAdaptiveSpanningRecordDeserializer.java:430)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:75)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:143)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:109)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testHandleMixedLargeRecords(SpanningRecordSerializationTest.java:98){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink pull request #5709: [FLINK-8941][network][serializer] improve Spanning...

2018-03-16 Thread NicoK
GitHub user NicoK opened a pull request:

https://github.com/apache/flink/pull/5709

[FLINK-8941][network][serializer] improve SpanningRecordSerializationTest 
and ensure unique spilling files

## What is the purpose of the change

This PR contains two commits trying to tackle FLINK-8941 (which I could not 
reproduce).

## Brief change log

- let `SpanningRecordSerializationTest` extend from `TestLogger`
- use a `TemporaryFolder` in `SpanningRecordSerializationTest` for spilling 
files
- make sure `SpillingAdaptiveSpanningRecordDeserializer` does not work on 
an existing file, e.g. from another instance running on the same machine - 
allow 10 retries with 20 random bytes file names and fail otherwise

## Verifying this change

This change is already covered by existing tests, such as 
`SpanningRecordSerializationTest`.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): **no**
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
  - The serializers: **yes** (well, partly - the code around the actual 
serializers)
  - The runtime per-record code paths (performance sensitive): **no**
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
  - The S3 file system connector: **no**

## Documentation

  - Does this pull request introduce a new feature? **no**
  - If yes, how is the feature documented? **not applicable**


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NicoK/flink flink-8941

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5709


commit 5086956c743273dc4cd379fc6a6ee97da5c20258
Author: Nico Kruber 
Date:   2018-03-16T10:43:43Z

[FLINK-8941][tests] use TestLogger and TemporaryFolder in 
SpanningRecordSerializationTest

commit c9ba08b1c92c2510676c3cd3116410c7a379d072
Author: Nico Kruber 
Date:   2018-03-16T10:51:29Z

[FLINK-8941][serialization] make sure we use unique spilling files

Although the spilling files were chosen with random names of 20 bytes, it 
could
rarely happen that these collide. In that case, have another try (at most 
10) at
selecting a unique file name.




---


[jira] [Commented] (FLINK-8941) SpanningRecordSerializationTest fails on Travis

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401936#comment-16401936
 ] 

ASF GitHub Bot commented on FLINK-8941:
---

GitHub user NicoK opened a pull request:

https://github.com/apache/flink/pull/5709

[FLINK-8941][network][serializer] improve SpanningRecordSerializationTest 
and ensure unique spilling files

## What is the purpose of the change

This PR contains two commits trying to tackle FLINK-8941 (which I could not 
reproduce).

## Brief change log

- let `SpanningRecordSerializationTest` extend from `TestLogger`
- use a `TemporaryFolder` in `SpanningRecordSerializationTest` for spilling 
files
- make sure `SpillingAdaptiveSpanningRecordDeserializer` does not work on 
an existing file, e.g. from another instance running on the same machine - 
allow 10 retries with 20 random bytes file names and fail otherwise

## Verifying this change

This change is already covered by existing tests, such as 
`SpanningRecordSerializationTest`.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): **no**
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
  - The serializers: **yes** (well, partly - the code around the actual 
serializers)
  - The runtime per-record code paths (performance sensitive): **no**
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
  - The S3 file system connector: **no**

## Documentation

  - Does this pull request introduce a new feature? **no**
  - If yes, how is the feature documented? **not applicable**


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/NicoK/flink flink-8941

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5709


commit 5086956c743273dc4cd379fc6a6ee97da5c20258
Author: Nico Kruber 
Date:   2018-03-16T10:43:43Z

[FLINK-8941][tests] use TestLogger and TemporaryFolder in 
SpanningRecordSerializationTest

commit c9ba08b1c92c2510676c3cd3116410c7a379d072
Author: Nico Kruber 
Date:   2018-03-16T10:51:29Z

[FLINK-8941][serialization] make sure we use unique spilling files

Although the spilling files were chosen with random names of 20 bytes, it 
could
rarely happen that these collide. In that case, have another try (at most 
10) at
selecting a unique file name.




> SpanningRecordSerializationTest fails on Travis
> ---
>
> Key: FLINK-8941
> URL: https://issues.apache.org/jira/browse/FLINK-8941
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Chesnay Schepler
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.5.0
>
>
> https://travis-ci.org/zentol/flink/jobs/353217791
> {code:java}
> testHandleMixedLargeRecords(org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest)
>   Time elapsed: 1.992 sec  <<< ERROR!
> java.nio.channels.ClosedChannelException: null
>   at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.addNextChunkFromMemorySegment(SpillingAdaptiveSpanningRecordDeserializer.java:528)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.access$200(SpillingAdaptiveSpanningRecordDeserializer.java:430)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:75)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:143)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:109)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testHandleMixedLargeRecords(SpanningRecordSerializationTest.java:98){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Stefan Richter (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401933#comment-16401933
 ] 

Stefan Richter commented on FLINK-8922:
---

Hmm, I wonder why that should be a fix? Does that mean that \{{WriteOption}} 
cannot be shared or is it just coincidence? In any case, the downside of this 
would be that the backend needs to remember and close all the \{{WriteOption}} 
objects in the states, which is exactly what I wanted to avoid :(

> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Sihua Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401925#comment-16401925
 ] 

Sihua Zhou commented on FLINK-8922:
---

Ah, happy to tell you guys that finally I find the way to make it works 
correctly. Indeed I think it's [~aljoscha] that found the way to overcome this, 
as the comments he left, I change to code to re-introduce a {{private final 
WriteOptions;}} for every state (not sharing only one {{WriteOptions}} accross 
all {{State}}), and also a dedicated {{WriteOptions}} for restoring. After the 
above works, the code runs pretty fine on Travis. ;) Now, I am a bit worried 
that the current code maybe problematic and 
[FLINK-8845|https://issues.apache.org/jira/browse/FLINK-8845] is just a trigger 
that fire this issue, [~StephanEwen] [~srichter] WDYT of this, should we make a 
PR to address it for 1.5?

> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-7851) Improve scheduling balance in case of fewer sub tasks than input operator

2018-03-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-7851.

Resolution: Won't Fix

Reverted via
1.6.0: 91d346e9e7611be530509154cc7034cbde22653d
1.5.0: c74f80869f3407ca8d02e79fbcf8a5b267ea7253

> Improve scheduling balance in case of fewer sub tasks than input operator
> -
>
> Key: FLINK-7851
> URL: https://issues.apache.org/jira/browse/FLINK-7851
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.4.0, 1.3.2
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
> Fix For: 1.5.0
>
>
> When having a job where we have a mapper {{m1}} running with dop {{n}} 
> followed by a key by and a mapper {{m2}} (all-to-all communication) which 
> runs with dop {{m}} and {{n > m}}, it happens that the sub tasks of {{m2}} 
> are not uniformly spread out across all currently used {{TaskManagers}}.
> For example: {{n = 4}}, {{m = 2}} and we have 2 TaskManagers with 2 slots 
> each. The deployment would look the following:
> TM1: 
> Slot 1: {{m1_1}} -> {{m_2_1}}
> Slot 2: {{m1_3}} -> {{m_2_2}}
> TM2:
> Slot 1: {{m1_2}}
> Slot 2: {{m1_4}}
> The problem for this behaviour is that when there are too many preferred 
> locations (currently 8) due to an all-to-all communication pattern, then we 
> will simply poll the next slot from the MultiMap in 
> {{SlotSharingGroupAssignment}}. The polling algorithm first drains all 
> available slots for a single machine before it polls slots from another 
> machine. 
> I think it would be better to poll slots in a round robin fashion wrt to the 
> machines. That way we would get a better resource utilisation by spreading 
> the tasks more evenly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8922) Revert FLINK-8859 because it causes segfaults in testing

2018-03-16 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401917#comment-16401917
 ] 

Stephan Ewen commented on FLINK-8922:
-

[~srichter] and me had a discussion about this today.

*Can we confirm that this only happens on Travis, but not on the developer 
machines (laptops)?*

If that is the case, it could be possible that this crash is in fact caused by 
the specific environment on Travis CI, like a specific version of glibc, or so. 
We saw in the past that RocksDB was sensitive to certain library versions 
(works with certain docker base images, but not others).

If we believe that is true, we could actually deactivate WAL by default, except 
for when we run on Travis (can check environment variables for that).

We need to ensure sufficient test runs on other machines, though. We (at data 
Artisans) are currently working towards continuous regression runs (on some 
cloud resources) which could cover that.


> Revert FLINK-8859 because it causes segfaults in testing
> 
>
> Key: FLINK-8922
> URL: https://issues.apache.org/jira/browse/FLINK-8922
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.5.0
>
>
> We need to revert FLINK-8859 because it causes problems with RocksDB that 
> make our automated tests fail on Travis. The change looks actually good and 
> it is currently unclear why this can introduce such a problem. This might 
> also be a Rocks in RocksDB. Nevertheless, for the sake of a proper release 
> testing, we should revert the change for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8984) Disabling credit based flow control deadlocks Flink on checkpoint

2018-03-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401912#comment-16401912
 ] 

ASF GitHub Bot commented on FLINK-8984:
---

Github user zhijiangW commented on the issue:

https://github.com/apache/flink/pull/5708
  
Thanks piotr, I agree with it. 


> Disabling credit based flow control deadlocks Flink on checkpoint
> -
>
> Key: FLINK-8984
> URL: https://issues.apache.org/jira/browse/FLINK-8984
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Affects Versions: 1.5.0
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.5.0
>
>
> This is configuration issue. There are two options: 
> taskmanager.network.credit-based-flow-control.enabled
> and
> taskmanager.exactly-once.blocking.data.enabled
> If we disable first one, but remain default value for the second one 
> deadlocks will occur. I think we can safely drop the second config value 
> altogether and always use blocking BarrierBuffer for credit based flow 
> control and spilling BarrierBuffer for non credit based flow control.
> cc [~zjwang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] flink issue #5708: [FLINK-8984][network] Drop taskmanager.exactly-once.block...

2018-03-16 Thread zhijiangW
Github user zhijiangW commented on the issue:

https://github.com/apache/flink/pull/5708
  
Thanks piotr, I agree with it. 


---


[jira] [Commented] (FLINK-9009) Error| You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few insta

2018-03-16 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401901#comment-16401901
 ] 

Stephan Ewen commented on FLINK-9009:
-

I assume that error is reported by Kafka?
Can you past the stack trace?


> Error| You are creating too many HashedWheelTimer instances.  
> HashedWheelTimer is a shared resource that must be reused across the 
> application, so that only a few instances are created.
> -
>
> Key: FLINK-9009
> URL: https://issues.apache.org/jira/browse/FLINK-9009
> Project: Flink
>  Issue Type: Bug
> Environment: Pass platform: Openshit
>Reporter: Pankaj
>Priority: Blocker
>
> Steps to reproduce:
> 1- Flink with Kafka as a consumer -> Writing stream to Cassandra using flink 
> cassandra sink.
> 2- In memory Job manager and task manager with checkpointing 5000ms.
> 3- env.setpararllelism(10)-> As kafka topic has 10 partition.
> 4- There are around 13 unique streams in a single flink run time environment 
> which are reading from kafka -> processing and writing to cassandra.
> Hardware: CPU 200 milli core . It is deployed on Paas platform on one node
> Memory: 526 MB.
>  
> When i start the server, It starts flink and all off sudden stops with above 
> error. It also shows out of memory error.
>  
> It would be nice if any body can suggest if something is wrong.
>  
> Maven:
> flink-connector-cassandra_2.11: 1.3.2
> flink-streaming-java_2.11: 1.4.0
> flink-connector-kafka-0.11_2.11:1.4.0
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-8945) Allow customization of the KinesisProxy

2018-03-16 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-8945.
-
   Resolution: Fixed
 Assignee: Kailash Hassan Dayanand
Fix Version/s: 1.5.0

Fixed in
  - 1.5.0 via 7763f7f680fec162e27855f792760bcc3820b799
  - 1.6.0 via cb60fd29e7e46f8015587e9be7ad88f2368fe85a

> Allow customization of the KinesisProxy
> ---
>
> Key: FLINK-8945
> URL: https://issues.apache.org/jira/browse/FLINK-8945
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently the KinesisProxy interface here:
> [https://github.com/apache/flink/blob/310f3de62e52f1f977c217d918cc5aac79b87277/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java#L125]
> has a private constructor. This restricts extending the class and prevents 
> customizations on shard discovery. I am proposing to change this to protected.
> While the creating a new implementation of KinesisProxyInterface is possible, 
> I would like to continue to use implementation of getRecords and 
> getShardIterator.
> This will be a temporary workaround till FLINK-8944 is submitted. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-8945) Allow customization of the KinesisProxy

2018-03-16 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-8945.
---

> Allow customization of the KinesisProxy
> ---
>
> Key: FLINK-8945
> URL: https://issues.apache.org/jira/browse/FLINK-8945
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently the KinesisProxy interface here:
> [https://github.com/apache/flink/blob/310f3de62e52f1f977c217d918cc5aac79b87277/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/proxy/KinesisProxy.java#L125]
> has a private constructor. This restricts extending the class and prevents 
> customizations on shard discovery. I am proposing to change this to protected.
> While the creating a new implementation of KinesisProxyInterface is possible, 
> I would like to continue to use implementation of getRecords and 
> getShardIterator.
> This will be a temporary workaround till FLINK-8944 is submitted. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-8888) Upgrade AWS SDK in flink-connector-kinesis

2018-03-16 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen reassigned FLINK-:
---

Assignee: Kailash Hassan Dayanand

> Upgrade AWS SDK in flink-connector-kinesis
> --
>
> Key: FLINK-
> URL: https://issues.apache.org/jira/browse/FLINK-
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Bump up the java aws sdk version to 1.11.272. Evaluate also the impact of 
> this version upgrade for KCL and KPL versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-8888) Upgrade AWS SDK in flink-connector-kinesis

2018-03-16 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-.
---

> Upgrade AWS SDK in flink-connector-kinesis
> --
>
> Key: FLINK-
> URL: https://issues.apache.org/jira/browse/FLINK-
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Bump up the java aws sdk version to 1.11.272. Evaluate also the impact of 
> this version upgrade for KCL and KPL versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-8888) Upgrade AWS SDK in flink-connector-kinesis

2018-03-16 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Fixed in
  - 1.5.0 via 5896840b17d6b39345c836ad2bba0481990432c4
  - 1.6.0 via c71f7e1badebd36e4bd55c90a011c04b51c2e9f1

> Upgrade AWS SDK in flink-connector-kinesis
> --
>
> Key: FLINK-
> URL: https://issues.apache.org/jira/browse/FLINK-
> Project: Flink
>  Issue Type: Improvement
>Reporter: Kailash Hassan Dayanand
>Assignee: Kailash Hassan Dayanand
>Priority: Minor
> Fix For: 1.5.0
>
>
> Bump up the java aws sdk version to 1.11.272. Evaluate also the impact of 
> this version upgrade for KCL and KPL versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9009) Error| You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the application, so that only a few instanc

2018-03-16 Thread Pankaj (JIRA)
Pankaj created FLINK-9009:
-

 Summary: Error| You are creating too many HashedWheelTimer 
instances.  HashedWheelTimer is a shared resource that must be reused across 
the application, so that only a few instances are created.
 Key: FLINK-9009
 URL: https://issues.apache.org/jira/browse/FLINK-9009
 Project: Flink
  Issue Type: Bug
 Environment: Pass platform: Openshit
Reporter: Pankaj


Steps to reproduce:

1- Flink with Kafka as a consumer -> Writing stream to Cassandra using flink 
cassandra sink.

2- In memory Job manager and task manager with checkpointing 5000ms.

3- env.setpararllelism(10)-> As kafka topic has 10 partition.

4- There are around 13 unique streams in a single flink run time environment 
which are reading from kafka -> processing and writing to cassandra.

Hardware: CPU 200 milli core . It is deployed on Paas platform on one node

Memory: 526 MB.

 

When i start the server, It starts flink and all off sudden stops with above 
error. It also shows out of memory error.

 

It would be nice if any body can suggest if something is wrong.

 

Maven:

flink-connector-cassandra_2.11: 1.3.2

flink-streaming-java_2.11: 1.4.0

flink-connector-kafka-0.11_2.11:1.4.0

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9008) End-to-end test: Quickstarts

2018-03-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-9008:


 Summary: End-to-end test: Quickstarts
 Key: FLINK-9008
 URL: https://issues.apache.org/jira/browse/FLINK-9008
 Project: Flink
  Issue Type: Sub-task
  Components: Quickstarts, Tests
Affects Versions: 1.5.0
Reporter: Till Rohrmann
 Fix For: 1.5.0


We could add an end-to-end test which verifies Flink's quickstarts. It should 
do the following:
# create a new Flink project using the quickstarts archetype 
# add a new Flink dependency to the {{pom.xml}} (e.g. Flink connector or 
library) 
# run {{mvn clean package -Pbuild-jar}}
# verify that no core dependencies are contained in the jar file
# Run the program



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9007) Cluster test: Kinesis connector

2018-03-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-9007:


 Summary: Cluster test: Kinesis connector
 Key: FLINK-9007
 URL: https://issues.apache.org/jira/browse/FLINK-9007
 Project: Flink
  Issue Type: Sub-task
  Components: Kinesis Connector, Tests
Reporter: Till Rohrmann


Ideally we have an automated cluster test which allocates nodes from AWS and 
can read from and write to Kinesis using Flink's Kinesis connector. We could 
use a simple pipe job with simple state for checkpointing purposes. The 
checkpoints should then be written to S3 using {{flink-s3-fs-hadoop}} and 
{{flink-s3-fs-presto}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9006) Cluster test: Run general purpose job with failures with Yarn and per-job mode

2018-03-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-9006:


 Summary: Cluster test: Run general purpose job with failures with 
Yarn and per-job mode
 Key: FLINK-9006
 URL: https://issues.apache.org/jira/browse/FLINK-9006
 Project: Flink
  Issue Type: Sub-task
  Components: Tests, YARN
Reporter: Till Rohrmann


Similar to FLINK-9004, we should run the general purpose job (FLINK-8971) on 
Yarn using Flink per-job mode.

The job should also be ill-packaged, use RocksDB full checkpoints and write the 
checkpoints to HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >