[jira] [Commented] (FLINK-10325) [State TTL] Refactor TtlListState to use only loops, no java stream API for performance

2018-09-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616593#comment-16616593
 ] 

ASF GitHub Bot commented on FLINK-10325:


Clark commented on issue #6683: [FLINK-10325] [State TTL] Refactor 
TtlListState to use only loops, no java stream API for performance
URL: https://github.com/apache/flink/pull/6683#issuecomment-421695107
 
 
   @tillrohrmann Thanks for your reply. Sounds reasonable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [State TTL] Refactor TtlListState to use only loops, no java stream API for 
> performance
> ---
>
> Key: FLINK-10325
> URL: https://issues.apache.org/jira/browse/FLINK-10325
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.6.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.1, 1.7.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] Clarkkkkk commented on issue #6683: [FLINK-10325] [State TTL] Refactor TtlListState to use only loops, no java stream API for performance

2018-09-15 Thread GitBox
Clark commented on issue #6683: [FLINK-10325] [State TTL] Refactor 
TtlListState to use only loops, no java stream API for performance
URL: https://github.com/apache/flink/pull/6683#issuecomment-421695107
 
 
   @tillrohrmann Thanks for your reply. Sounds reasonable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (FLINK-9697) Provide connector for Kafka 2.0.0

2018-09-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616592#comment-16616592
 ] 

ASF GitHub Bot commented on FLINK-9697:
---

yanghua commented on issue #6703: [FLINK-9697] Provide connector for Kafka 2.0.0
URL: https://github.com/apache/flink/pull/6703#issuecomment-421693803
 
 
   Currently this PR is not ready to be reviewed, it still needs some 
refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide connector for Kafka 2.0.0
> -
>
> Key: FLINK-9697
> URL: https://issues.apache.org/jira/browse/FLINK-9697
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>
> Kafka 2.0.0 would be released soon.
> Here is vote thread:
> [http://search-hadoop.com/m/Kafka/uyzND1vxnEd23QLxb?subj=+VOTE+2+0+0+RC1]
> We should provide connector for Kafka 2.0.0 once it is released.
> Upgrade to 2.0 documentation : 
> http://kafka.apache.org/20/documentation.html#upgrade_2_0_0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] yanghua commented on issue #6703: [FLINK-9697] Provide connector for Kafka 2.0.0

2018-09-15 Thread GitBox
yanghua commented on issue #6703: [FLINK-9697] Provide connector for Kafka 2.0.0
URL: https://github.com/apache/flink/pull/6703#issuecomment-421693803
 
 
   Currently this PR is not ready to be reviewed, it still needs some 
refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (FLINK-9697) Provide connector for Kafka 2.0.0

2018-09-15 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-9697:
--
Labels: pull-request-available  (was: )

> Provide connector for Kafka 2.0.0
> -
>
> Key: FLINK-9697
> URL: https://issues.apache.org/jira/browse/FLINK-9697
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>
> Kafka 2.0.0 would be released soon.
> Here is vote thread:
> [http://search-hadoop.com/m/Kafka/uyzND1vxnEd23QLxb?subj=+VOTE+2+0+0+RC1]
> We should provide connector for Kafka 2.0.0 once it is released.
> Upgrade to 2.0 documentation : 
> http://kafka.apache.org/20/documentation.html#upgrade_2_0_0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9697) Provide connector for Kafka 2.0.0

2018-09-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616588#comment-16616588
 ] 

ASF GitHub Bot commented on FLINK-9697:
---

yanghua opened a new pull request #6703: [FLINK-9697] Provide connector for 
Kafka 2.0.0
URL: https://github.com/apache/flink/pull/6703
 
 
   ## What is the purpose of the change
   
   *This pull request provides connector for Kafka 2.0.0*
   
   ## Brief change log
   
 - *Provide connector for Kafka 2.0.0*
   
   ## Verifying this change
   
   This change is already covered by existing tests*.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (**yes** / no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (**yes** / no)
 - If yes, how is the feature documented? (not applicable / docs / 
**JavaDocs** / not documented)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide connector for Kafka 2.0.0
> -
>
> Key: FLINK-9697
> URL: https://issues.apache.org/jira/browse/FLINK-9697
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>
> Kafka 2.0.0 would be released soon.
> Here is vote thread:
> [http://search-hadoop.com/m/Kafka/uyzND1vxnEd23QLxb?subj=+VOTE+2+0+0+RC1]
> We should provide connector for Kafka 2.0.0 once it is released.
> Upgrade to 2.0 documentation : 
> http://kafka.apache.org/20/documentation.html#upgrade_2_0_0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] yanghua opened a new pull request #6703: [FLINK-9697] Provide connector for Kafka 2.0.0

2018-09-15 Thread GitBox
yanghua opened a new pull request #6703: [FLINK-9697] Provide connector for 
Kafka 2.0.0
URL: https://github.com/apache/flink/pull/6703
 
 
   ## What is the purpose of the change
   
   *This pull request provides connector for Kafka 2.0.0*
   
   ## Brief change log
   
 - *Provide connector for Kafka 2.0.0*
   
   ## Verifying this change
   
   This change is already covered by existing tests*.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (**yes** / no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (**yes** / no)
 - If yes, how is the feature documented? (not applicable / docs / 
**JavaDocs** / not documented)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (FLINK-7360) Support Scala map type in Table API

2018-09-15 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu reassigned FLINK-7360:


Assignee: xueyu

> Support Scala map type in Table API
> ---
>
> Key: FLINK-7360
> URL: https://issues.apache.org/jira/browse/FLINK-7360
> Project: Flink
>  Issue Type: New Feature
>  Components: Table API  SQL
>Reporter: Timo Walther
>Assignee: xueyu
>Priority: Major
>
> Currently, Flink SQL supports only Java `java.util.Map`. Scala maps are 
> treated as a blackbox with Flink `GenericTypeInfo`/SQL `ANY` data type. 
> Therefore, you can forward these blackboxes and use them within scalar 
> functions but accessing with the `['key']` operator is not supported.
> We should convert these special collections at the beginning, in order to use 
> in a SQL statement.
> See: 
> https://stackoverflow.com/questions/45471503/flink-table-api-sql-and-map-types-scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-10253) Run MetricQueryService with lower priority

2018-09-15 Thread vinoyang (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned FLINK-10253:


Assignee: vinoyang

> Run MetricQueryService with lower priority
> --
>
> Key: FLINK-10253
> URL: https://issues.apache.org/jira/browse/FLINK-10253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should run the {{MetricQueryService}} with a lower priority than the main 
> Flink components. An idea would be to start the underlying threads with a 
> lower priority.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9824) Support IPv6 literal

2018-09-15 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated FLINK-9824:
--
Description: 
Currently we use colon as separator when parsing host and port.

We should support the usage of IPv6 literals in parsing.

  was:
Currently we use colon as separator when parsing host and port.


We should support the usage of IPv6 literals in parsing.


> Support IPv6 literal
> 
>
> Key: FLINK-9824
> URL: https://issues.apache.org/jira/browse/FLINK-9824
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Reporter: Ted Yu
>Assignee: vinoyang
>Priority: Minor
>
> Currently we use colon as separator when parsing host and port.
> We should support the usage of IPv6 literals in parsing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217895759
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
 
 Review comment:
   Keep `isRegistered` as it is. It is used be other methods as well.
   We should change the logic in `insertInto` to directly call `getTable()`. 
This will also improve the performance, because we only need to talk once to 
the external catalog (which might be backed by an external service).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217895771
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
-rootSchema.getTableNames.contains(name)
+var isRegistered = rootSchema.getTableNames.contains(name)
+
+// check if the table exists in external catalogs
+if (!isRegistered) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+isRegistered = externalSchema.getTableNames.contains(externalTableName)
+  }
+}
+
+return isRegistered
   }
 
   protected def getTable(name: String): org.apache.calcite.schema.Table = {
-rootSchema.getTable(name)
+var table = rootSchema.getTable(name)
+
+// check if the table exists in external catalogs
+if (table == null) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+table = externalSchema.getTable(externalTableName)
+  }
+}
+
+return table
 
 Review comment:
   Scala discourages the use of `return`. The result of the last expression of 
a function is it's return value.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217896170
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
-rootSchema.getTableNames.contains(name)
+var isRegistered = rootSchema.getTableNames.contains(name)
+
+// check if the table exists in external catalogs
+if (!isRegistered) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+isRegistered = externalSchema.getTableNames.contains(externalTableName)
+  }
+}
+
+return isRegistered
   }
 
   protected def getTable(name: String): org.apache.calcite.schema.Table = {
-rootSchema.getTable(name)
+var table = rootSchema.getTable(name)
+
+// check if the table exists in external catalogs
+if (table == null) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+table = externalSchema.getTable(externalTableName)
+  }
+}
+
+return table
   }
 
   protected def getRowType(name: String): RelDataType = {
 
 Review comment:
   Method does not seem to be used and can be removed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217895776
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
-rootSchema.getTableNames.contains(name)
+var isRegistered = rootSchema.getTableNames.contains(name)
+
+// check if the table exists in external catalogs
+if (!isRegistered) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+isRegistered = externalSchema.getTableNames.contains(externalTableName)
+  }
+}
+
+return isRegistered
   }
 
   protected def getTable(name: String): org.apache.calcite.schema.Table = {
-rootSchema.getTable(name)
+var table = rootSchema.getTable(name)
 
 Review comment:
   use `val` when possible


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217896469
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
-rootSchema.getTableNames.contains(name)
+var isRegistered = rootSchema.getTableNames.contains(name)
+
+// check if the table exists in external catalogs
+if (!isRegistered) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+isRegistered = externalSchema.getTableNames.contains(externalTableName)
+  }
+}
+
+return isRegistered
   }
 
   protected def getTable(name: String): org.apache.calcite.schema.Table = {
-rootSchema.getTable(name)
+var table = rootSchema.getTable(name)
+
+// check if the table exists in external catalogs
+if (table == null) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+table = externalSchema.getTable(externalTableName)
+  }
+}
+
+return table
   }
 
   protected def getRowType(name: String): RelDataType = {
-rootSchema.getTable(name).getRowType(typeFactory)
+val table = getTable(name)
+if (table != null) {
+  table.getRowType(typeFactory)
+}
+
+return null
+  }
+
+  protected def resolveExternalTable(name: String): (SchemaPlus, String) = {
 
 Review comment:
   this function can be inlined into `getTable()`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] fhueske commented on a change in pull request #6508: [Flink-10079] [table] Support external sink table in the INSERT INTO clause

2018-09-15 Thread GitBox
fhueske commented on a change in pull request #6508: [Flink-10079] [table] 
Support external sink table in the INSERT INTO clause 
URL: https://github.com/apache/flink/pull/6508#discussion_r217897374
 
 

 ##
 File path: 
flink-libraries/flink-table/src/main/scala/org/apache/flink/table/api/TableEnvironment.scala
 ##
 @@ -820,20 +820,59 @@ abstract class TableEnvironment(val config: TableConfig) 
{
 
   /**
 * Checks if a table is registered under the given name.
+* Internal and external catalogs are both checked.
 *
 * @param name The table name to check.
 * @return true, if a table is registered under the name, false otherwise.
 */
   protected[flink] def isRegistered(name: String): Boolean = {
-rootSchema.getTableNames.contains(name)
+var isRegistered = rootSchema.getTableNames.contains(name)
+
+// check if the table exists in external catalogs
+if (!isRegistered) {
+  val (externalSchema, externalTableName) = resolveExternalTable(name)
+  if (externalSchema != null) {
+isRegistered = externalSchema.getTableNames.contains(externalTableName)
+  }
+}
+
+return isRegistered
   }
 
   protected def getTable(name: String): org.apache.calcite.schema.Table = {
 
 Review comment:
   the method should return `Option[org.apache.calcite.schema.Table]`. We need 
to adjust `StreamTableEnvironment` and `BatchTableEnvironment` for this and not 
wrap the result of the `getTable()` calls another time in `Option`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (FLINK-8855) SQL client result serving gets stuck in result-mode=table

2018-09-15 Thread Fabian Hueske (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske closed FLINK-8855.

   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.5.5)
   1.7.0
   1.6.1

Cannot reproduce the issue anymore. 
Was probably fixed by FLINK-10192 or limiting the result size in FLINK-8686.

> SQL client result serving gets stuck in result-mode=table
> -
>
> Key: FLINK-8855
> URL: https://issues.apache.org/jira/browse/FLINK-8855
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Affects Versions: 1.5.0
>Reporter: Fabian Hueske
>Priority: Major
> Fix For: 1.6.1, 1.7.0
>
>
> The result serving of a query in {{result-mode=table}} get stuck after some 
> time when serving an updating result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8418) Kafka08ITCase.testStartFromLatestOffsets() times out on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8418:
-
Fix Version/s: 1.6.2

> Kafka08ITCase.testStartFromLatestOffsets() times out on Travis
> --
>
> Key: FLINK-8418
> URL: https://issues.apache.org/jira/browse/FLINK-8418
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Instance: https://travis-ci.org/kl0u/flink/builds/327733085



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10309) Cancel flink job occurs java.net.ConnectException

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10309:
--
Fix Version/s: (was: 1.6.1)

> Cancel flink job occurs java.net.ConnectException
> -
>
> Key: FLINK-10309
> URL: https://issues.apache.org/jira/browse/FLINK-10309
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, REST
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: vinoyang
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The problem occurs when using the Yarn per-job detached mode. Trying to 
> cancel with savepoint fails with the following exception before being able to 
> retrieve the savepoint path:
> exception stack trace : 
> {code:java}
> org.apache.flink.util.FlinkException: Could not cancel job .
>         at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
>         at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
>         at 
> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
>         at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
> complete the operation. Number of retries has been exhausted.
>         at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>         at 
> org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
>         at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
>         ... 6 more
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
> Could not complete the operation. Number of retries has been exhausted.
>         at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>         at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:274)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>         ... 1 more
> Caused by: java.util.concurrent.CompletionException: 
> java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>         at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>         ... 16 more
> Caused by: 

[jira] [Updated] (FLINK-9749) Rework Bucketing Sink

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9749:
-
Fix Version/s: (was: 1.6.1)

> Rework Bucketing Sink
> -
>
> Key: FLINK-9749
> URL: https://issues.apache.org/jira/browse/FLINK-9749
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming Connectors
>Reporter: Stephan Ewen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.6.2
>
>
> The BucketingSink has a series of deficits at the moment.
> Due to the long list of issues, I would suggest to add a new 
> StreamingFileSink with a new and cleaner design
> h3. Encoders, Parquet, ORC
>  - It only efficiently supports row-wise data formats (avro, jso, sequence 
> files.
>  - Efforts to add (columnar) compression for blocks of data is inefficient, 
> because blocks cannot span checkpoints due to persistence-on-checkpoint.
>  - The encoders are part of the \{{flink-connector-filesystem project}}, 
> rather than in orthogonal formats projects. This blows up the dependencies of 
> the \{{flink-connector-filesystem project}} project. As an example, the 
> rolling file sink has dependencies on Hadoop and Avro, which messes up 
> dependency management.
> h3. Use of FileSystems
>  - The BucketingSink works only on Hadoop's FileSystem abstraction not 
> support Flink's own FileSystem abstraction and cannot work with the packaged 
> S3, maprfs, and swift file systems
>  - The sink hence needs Hadoop as a dependency
>  - The sink relies on "trying out" whether truncation works, which requires 
> write access to the users working directory
>  - The sink relies on enumerating and counting files, rather than maintaining 
> its own state, making less efficient
> h3. Correctness and Efficiency on S3
>  - The BucketingSink relies on strong consistency in the file enumeration, 
> hence may work incorrectly on S3.
>  - The BucketingSink relies on persisting streams at intermediate points. 
> This is not working properly on S3, hence there may be data loss on S3.
> h3. .valid-length companion file
>  - The valid length file makes it hard for consumers of the data and should 
> be dropped
> We track this design in a series of sub issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10291) Generate JobGraph with fixed/configurable JobID in StandaloneJobClusterEntrypoint

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10291:
--
Fix Version/s: (was: 1.6.1)

> Generate JobGraph with fixed/configurable JobID in 
> StandaloneJobClusterEntrypoint
> -
>
> Key: FLINK-10291
> URL: https://issues.apache.org/jira/browse/FLINK-10291
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
>
> The {{StandaloneJobClusterEntrypoint}} currently generates the {{JobGraph}} 
> from the user code when being started. Due to the nature of how the 
> {{JobGraph}} is generated, it will get a random {{JobID}} assigned. This is 
> problematic in case of a failover because then, the {{JobMaster}} won't be 
> able to detect the checkpoints. In order to solve this problem, we need to 
> either fix the {{JobID}} assignment or make it configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7991) Cleanup kafka example jar filters

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7991:
-
Fix Version/s: 1.6.2

> Cleanup kafka example jar filters
> -
>
> Key: FLINK-7991
> URL: https://issues.apache.org/jira/browse/FLINK-7991
> Project: Flink
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Trivial
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The kafka example jar shading configuration includes the following files:
> {code}
> org/apache/flink/streaming/examples/kafka/**
> org/apache/flink/streaming/**
> org/apache/kafka/**
> org/apache/curator/**
> org/apache/zookeeper/**
> org/apache/jute/**
> org/I0Itec/**
> jline/**
> com/yammer/**
> kafka/**
> {code}
> Problems:
> * the inclusion of org.apache.flink.streaming causes large parts of the API 
> (and runtime...) to be included in the jar, along the _Twitter_ source and 
> *all* other examples
> * the yammer, jline, l0ltec, jute, zookeeper and curator inclusions are 
> ineffective; none of these classes show up in the example jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8452) BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter instable on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8452:
-
Fix Version/s: 1.6.2

> BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter instable on 
> Travis
> 
>
> Key: FLINK-8452
> URL: https://issues.apache.org/jira/browse/FLINK-8452
> Project: Flink
>  Issue Type: Bug
>  Components: DataStream API, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{BucketingSinkTest.testNonRollingAvroKeyValueWithCompressionWriter}} 
> seems to be instable on Travis:
>  
> https://travis-ci.org/tillrohrmann/flink/jobs/330261310



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8902) Re-scaling job sporadically fails with KeeperException

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8902:
-
Fix Version/s: (was: 1.6.1)

> Re-scaling job sporadically fails with KeeperException
> --
>
> Key: FLINK-8902
> URL: https://issues.apache.org/jira/browse/FLINK-8902
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.0, 1.6.0
> Environment: Commit: 80020cb
> Hadoop: 2.8.3
> YARN
>  
>Reporter: Gary Yao
>Priority: Critical
>  Labels: flip6
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> *Description*
>  Re-scaling a job with {{bin/flink modify -p }} sporadically 
> fails with a {{KeeperException}}
> *Steps to reproduce*
>  # Submit job to Flink cluster with flip6 enabled running on YARN (session 
> mode).
>  # Re-scale job (5-20 times)
> *Stacktrace (client)*
> {noformat}
> org.apache.flink.util.FlinkException: Could not rescale job 
> 61e2e99db2e959ebd94e40f9c5e816bc.
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$modify$8(CliFrontend.java:766)
>   at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:954)
>   at org.apache.flink.client.cli.CliFrontend.modify(CliFrontend.java:757)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1037)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1095)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1095)
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could 
> not restore from temporary rescaling savepoint. This might indicate that the 
> savepoint 
> hdfs://172.31.33.72:9000/flinkha/savepoints/savepoint-61e2e9-fdb3d05a0035 got 
> corrupted. Deleting this savepoint as a precaution.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$rescaleOperators$3(JobMaster.java:525)
>   at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>   at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:295)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
>   at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:66)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
>   at 
> akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmaster.exceptions.JobModificationException: Could 
> not restore from temporary rescaling savepoint. This might indicate that the 
> savepoint 
> hdfs://172.31.33.72:9000/flinkha/savepoints/savepoint-61e2e9-fdb3d05a0035 got 
> corrupted. Deleting this savepoint as a precaution.
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.lambda$restoreExecutionGraphFromRescalingSavepoint$17(JobMaster.java:1317)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at 

[jira] [Updated] (FLINK-9525) Missing META-INF/services/*FileSystemFactory in flink-hadoop-fs module

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9525:
-
Fix Version/s: (was: 1.6.1)

> Missing META-INF/services/*FileSystemFactory in flink-hadoop-fs module
> --
>
> Key: FLINK-9525
> URL: https://issues.apache.org/jira/browse/FLINK-9525
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Hai Zhou
>Assignee: Hai Zhou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
> Attachments: wx20180605-142...@2x.png
>
>
> if flink job dependencies includes `hadoop-common` and `hadoop-hdfs`, will 
> throw runtime error.
> like this case: 
> [https://stackoverflow.com/questions/47890596/java-util-serviceconfigurationerror-org-apache-hadoop-fs-filesystem-provider-o].
> the root cause: 
> see  {{org.apache.flink.core.fs.FileSystem}}
> This class will load all available file system factories via 
> {{ServiceLoader.load(FileSystemFactory.class)}}. 
> Since  {{ META-INF / services / org.apache.flink.core.fs.FileSystemFactory }} 
> file in the classpath does not have an 
> `org.apache.flink.runtime.fs.hdfs.HadoopFsFactory`, 
> and finaly only loaded one available {{LocalFileSystemFactory}} .
> more error messages see this screenshot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9010) NoResourceAvailableException with FLIP-6

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9010:
-
Fix Version/s: 1.6.2

> NoResourceAvailableException with FLIP-6 
> -
>
> Key: FLINK-9010
> URL: https://issues.apache.org/jira/browse/FLINK-9010
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: flip-6
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> I was trying to run a bigger program with 400 slots (100 TMs, 2 slots each) 
> with FLIP-6 mode and a checkpointing interval of 1000 and got the following 
> exception:
> {code}
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000101 - Remaining pending container 
> requests: 302
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000101 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,154 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,155 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
>  to 
> hdfs://ip-172-31-1-91.eu-west-1.compute.internal:8020/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml
> 2018-03-16 03:41:20,165 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Prepared local resource for modified yaml: resource { scheme: 
> "hdfs" host: "ip-172-31-1-91.eu-west-1.compute.internal" port: 8020 file: 
> "/user/hadoop/.flink/application_1521038088305_0257/3cd0c7d7-ccc1-4680-83a5-54e64dd637bc-taskmanager-conf.yaml"
>  } size: 595 timestamp: 1521171680164 type: FILE visibility: APPLICATION
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Creating container launch context for TaskManagers
> 2018-03-16 03:41:20,168 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Starting TaskManagers with command: $JAVA_HOME/bin/java 
> -Xms5120m -Xmx5120m -XX:MaxDirectMemorySize=3072m  
> -Dlog.file=/taskmanager.log 
> -Dlogback.configurationFile=file:./logback.xml 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /taskmanager.out 2> /taskmanager.err
> 2018-03-16 03:41:20,176 INFO  
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
> Opening proxy : ip-172-31-3-221.eu-west-1.compute.internal:8041
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Received new container: 
> container_1521038088305_0257_01_000102 - Remaining pending container 
> requests: 301
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TaskExecutor container_1521038088305_0257_01_000102 will be 
> started with container size 8192 MB, JVM heap size 5120 MB, JVM direct memory 
> limit 3072 MB
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote keytab principal obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote yarn conf path obtained null
> 2018-03-16 03:41:20,180 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - TM:remote krb5 path obtained null
> 2018-03-16 03:41:20,181 INFO  org.apache.flink.yarn.Utils 
>   - Copying from 
> file:/mnt/yarn/usercache/hadoop/appcache/application_1521038088305_0257/container_1521038088305_0257_01_01/6766be70-82f7-4999-a371-11c27527fb6e-taskmanager-conf.yaml
>  to 
> 

[jira] [Updated] (FLINK-9680) Reduce heartbeat timeout for E2E tests

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9680:
-
Fix Version/s: 1.6.2

> Reduce heartbeat timeout for E2E tests
> --
>
> Key: FLINK-9680
> URL: https://issues.apache.org/jira/browse/FLINK-9680
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Several end-to-end tests shoot down job- and taskmanagers and wait for them 
> to come back up before continuing the testing process.
> {{heartbeat.timeout}} controls how long a container has to be unreachable to 
> be considered lost. The default for this option is 50 seconds, causing 
> significant idle times during the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7915) Verify functionality of RollingSinkSecuredITCase

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7915:
-
Fix Version/s: (was: 1.6.1)

> Verify functionality of RollingSinkSecuredITCase
> 
>
> Key: FLINK-7915
> URL: https://issues.apache.org/jira/browse/FLINK-7915
> Project: Flink
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> I recently stumbled across the test {{RollingSinkSecuredITCase}} which will 
> only be executed for Hadoop version {{>= 3}}. When trying to run it from 
> IntelliJ I immediately run into a class not found exception for 
> {{jdbm.helpers.CachePolicy}} and even after fixing this problem, the test 
> would not run because it complained about wrong security settings.
> I think we should check whether this test is at all working and if not, then 
> we should remove or replace it with something working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9457) Cancel container requests when cancelling pending slot allocations

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9457:
-
Fix Version/s: (was: 1.6.1)

> Cancel container requests when cancelling pending slot allocations
> --
>
> Key: FLINK-9457
> URL: https://issues.apache.org/jira/browse/FLINK-9457
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> When cancelling a pending slot allocation request on the {{ResourceManager}}, 
> then we should also check whether we still need all the requested containers. 
> If it turns out that we no longer need them, then we should try to cancel the 
> unnecessary container requests. That way Flink will be, for example, a better 
> Yarn citizen which quickly releases resources and resource requests if no 
> longer needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8408) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n unstable on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8408:
-
Fix Version/s: (was: 1.6.1)

> YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n unstable on Travis
> --
>
> Key: FLINK-8408
> URL: https://issues.apache.org/jira/browse/FLINK-8408
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3n}} is unstable 
> on Travis.
> https://travis-ci.org/tillrohrmann/flink/jobs/327216460



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8820) FlinkKafkaConsumer010 reads too many bytes

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8820:
-
Fix Version/s: 1.6.2

> FlinkKafkaConsumer010 reads too many bytes
> --
>
> Key: FLINK-8820
> URL: https://issues.apache.org/jira/browse/FLINK-8820
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.4.0
>Reporter: Fabian Hueske
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> A user reported that the FlinkKafkaConsumer010 very rarely consumes too many 
> bytes, i.e., the returned message is too large. The application is running 
> for about a year and the problem started to occur after upgrading to Flink 
> 1.4.0.
> The user made a good effort in debugging the problem but was not able to 
> reproduce it in a controlled environment. It seems that the data is correctly 
> stored in Kafka.
> Here's the thread on the thread on the user mailing list for a detailed 
> description of the problem and analysis so far: 
> https://lists.apache.org/thread.html/1d62f616d275e9e23a5215ddf7f5466051be7ea96897d827232fcb4e@%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10241) Reduce performance/stability impact of latency metrics

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10241:
--
Fix Version/s: (was: 1.6.1)

> Reduce performance/stability impact of latency metrics
> --
>
> Key: FLINK-10241
> URL: https://issues.apache.org/jira/browse/FLINK-10241
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Umbrella issue for performance/stability improvements around the latency 
> metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-4052) Unstable test ConnectionUtilsTest

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-4052:
-
Fix Version/s: 1.6.2

> Unstable test ConnectionUtilsTest
> -
>
> Key: FLINK-4052
> URL: https://issues.apache.org/jira/browse/FLINK-4052
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.2, 1.3.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The error is the following:
> {code}
> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics:59 
> expected: but 
> was:
> {code}
> The probable cause for the failure is that the test tries to select an unused 
> closed port (from the ephemeral range), and then assumes that all connections 
> to that port fail.
> If a concurrent test actually uses that port, connections to the port will 
> succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8336) YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3 test instability

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8336:
-
Fix Version/s: 1.6.2

> YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3 test instability
> ---
>
> Key: FLINK-8336
> URL: https://issues.apache.org/jira/browse/FLINK-8336
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Tests, YARN
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{YarnFileStageTestS3ITCase.testRecursiveUploadForYarnS3}} fails on 
> Travis. I suspect that this has something to do with the consistency 
> guarantees S3 gives us.
> https://travis-ci.org/tillrohrmann/flink/jobs/323930297



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9646) ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart failed on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9646:
-
Fix Version/s: 1.6.2

> ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart failed on 
> Travis
> 
>
> Key: FLINK-9646
> URL: https://issues.apache.org/jira/browse/FLINK-9646
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart}} fails on 
> Travis.
> https://api.travis-ci.org/v3/job/395394863/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9481) FlinkKafkaProducer011ITCase deadlock in initializeState

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9481:
-
Fix Version/s: 1.6.2

> FlinkKafkaProducer011ITCase deadlock in initializeState 
> 
>
> Key: FLINK-9481
> URL: https://issues.apache.org/jira/browse/FLINK-9481
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.5.0, 1.4.2
>Reporter: Piotr Nowojski
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
> Attachments: log.txt.zip
>
>
> FlinkKafkaProducer011ITCase.testRestoreToCheckpointAfterExceedingProducersPool(FlinkKafkaProducer011ITCase.java:152)
>  deadlocked on travis:
>  
> {noformat}
> "main" #1 prio=5 os_prio=0 tid=0x7fa36800a000 nid=0x5b85 waiting on 
> condition [0x7fa371c4d000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xf54856c8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
>   at 
> org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.initTransactions(FlinkKafkaProducer.java:123)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.abortTransactions(FlinkKafkaProducer011.java:919)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.cleanUpUserContext(FlinkKafkaProducer011.java:903)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.finishRecoveringContext(FlinkKafkaProducer011.java:891)
>   at 
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.initializeState(TwoPhaseCommitSinkFunction.java:338)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.initializeState(FlinkKafkaProducer011.java:867)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:254)
>   at 
> org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.initializeState(AbstractStreamOperatorTestHarness.java:424)
>   at 
> org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.initializeState(AbstractStreamOperatorTestHarness.java:346)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testRestoreToCheckpointAfterExceedingProducersPool(FlinkKafkaProducer011ITCase.java:152){noformat}
>  
> https://api.travis-ci.org/v3/job/386021917/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9253) Make buffer count per InputGate always #channels*buffersPerChannel + ExclusiveBuffers

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9253:
-
Fix Version/s: (was: 1.6.1)

> Make buffer count per InputGate always #channels*buffersPerChannel + 
> ExclusiveBuffers
> -
>
> Key: FLINK-9253
> URL: https://issues.apache.org/jira/browse/FLINK-9253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Network
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The credit-based flow control path assigns exclusive buffers only to remote 
> channels (which makes sense since local channels don't use any own buffers). 
> However, this is a bit intransparent with respect to how much data may be in 
> buffers since this depends on the actual schedule of the job and not the job 
> graph.
> By adapting the floating buffers to use a maximum of 
> {{#channels*buffersPerChannel + floatingBuffersPerGate - #exclusiveBuffers}}, 
> we would be channel-type agnostic and keep the old behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10312) Wrong / missing exception when submitting job

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10312:
--
Fix Version/s: 1.6.2

> Wrong / missing exception when submitting job
> -
>
> Key: FLINK-10312
> URL: https://issues.apache.org/jira/browse/FLINK-10312
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Stephan Ewen
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> h3. Problem
> When submitting a job that cannot be created / initialized on the JobManager, 
> there is no proper error message. The exception says *"Could not retrieve the 
> execution result. (JobID: 5a7165e1260c6316fa11d2760bd3d49f)"*
> h3. Steps to Reproduce
> Create a streaming job, set a state backend with a non existing file system 
> scheme.
> h3. Full Stack Trace
> {code}
> Submitting a job where instantiation on the JM fails yields this, which seems 
> like a major regression from seeing the actual exception:
> org.apache.flink.client.program.ProgramInvocationException: Could not 
> retrieve the execution result. (JobID: 5a7165e1260c6316fa11d2760bd3d49f)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:260)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
>   at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>   at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1511)
>   at 
> com.dataartisans.streamledger.examples.simpletrade.SimpleTradeExample.main(SimpleTradeExample.java:98)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:426)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$16(CliFrontend.java:1120)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$25(RestClusterClient.java:379)
>   at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>   at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$32(FutureUtils.java:213)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>   at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: 

[jira] [Updated] (FLINK-9812) SpanningRecordSerializationTest fails on travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9812:
-
Fix Version/s: (was: 1.6.1)

> SpanningRecordSerializationTest fails on travis
> ---
>
> Key: FLINK-9812
> URL: https://issues.apache.org/jira/browse/FLINK-9812
> Project: Flink
>  Issue Type: Bug
>  Components: Network, Tests, Type Serialization System
>Affects Versions: 1.6.0
>Reporter: Chesnay Schepler
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
>
> https://travis-ci.org/zentol/flink/jobs/402744191
> {code}
> testHandleMixedLargeRecords(org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest)
>   Time elapsed: 6.113 sec  <<< ERROR!
> java.nio.channels.ClosedChannelException: null
>   at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
>   at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.addNextChunkFromMemorySegment(SpillingAdaptiveSpanningRecordDeserializer.java:529)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer$SpanningWrapper.access$200(SpillingAdaptiveSpanningRecordDeserializer.java:431)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:76)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:149)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testSerializationRoundTrip(SpanningRecordSerializationTest.java:115)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializationTest.testHandleMixedLargeRecords(SpanningRecordSerializationTest.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10241) Reduce performance/stability impact of latency metrics

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10241:
--
Fix Version/s: 1.6.2

> Reduce performance/stability impact of latency metrics
> --
>
> Key: FLINK-10241
> URL: https://issues.apache.org/jira/browse/FLINK-10241
> Project: Flink
>  Issue Type: Improvement
>  Components: Metrics
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Umbrella issue for performance/stability improvements around the latency 
> metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10260) Confusing log during TaskManager registration

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10260:
--
Fix Version/s: (was: 1.6.1)

> Confusing log during TaskManager registration
> -
>
> Key: FLINK-10260
> URL: https://issues.apache.org/jira/browse/FLINK-10260
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Stephan Ewen
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> During startup, when TaskManagers register, I see a lot of confusing log 
> lines.
> The below case happened during startup of a cloud setup where TaskManagers 
> took a varying amount of time to start and might have started before the 
> JobManager
> {code}
> -- Logs begin at Thu 2018-08-30 14:51:58 UTC, end at Thu 2018-08-30 14:55:39 
> UTC. --
> Aug 30 14:52:52 flink-belgium-1 systemd[1]: Started flink-jobmanager.service.
> -- Subject: Unit flink-jobmanager.service has finished start-up
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> -- 
> -- Unit flink-jobmanager.service has finished starting up.
> -- 
> -- The start-up result is RESULT.
> Aug 30 14:52:52 flink-belgium-1 jobmanager.sh[5416]: used deprecated key 
> `jobmanager.heap.mb`, please replace with key `jobmanager.heap.size`
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: Starting 
> standalonesession as a console application on host flink-belgium-1.
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,221 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> 
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,222 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Starting StandaloneSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, 
> Date:07.08.2018 @ 13:31:13 UTC)
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,222 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  OS 
> current user: flink
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,718 
> WARN  org.apache.hadoop.util.NativeCodeLoader   - Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,847 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Current Hadoop/Kerberos user: flink
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  JVM: 
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Maximum heap size: 1963 MiBytes
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> JAVA_HOME: (not set)
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Hadoop 
> version: 2.8.3
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  JVM 
> Options:
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Xms2048m
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Xmx2048m
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Program Arguments:
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> --configDir
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint 

[jira] [Updated] (FLINK-4052) Unstable test ConnectionUtilsTest

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-4052:
-
Fix Version/s: (was: 1.6.1)

> Unstable test ConnectionUtilsTest
> -
>
> Key: FLINK-4052
> URL: https://issues.apache.org/jira/browse/FLINK-4052
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.0.2, 1.3.0
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The error is the following:
> {code}
> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics:59 
> expected: but 
> was:
> {code}
> The probable cause for the failure is that the test tries to select an unused 
> closed port (from the ephemeral range), and then assumes that all connections 
> to that port fail.
> If a concurrent test actually uses that port, connections to the port will 
> succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9457) Cancel container requests when cancelling pending slot allocations

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9457:
-
Fix Version/s: 1.6.2

> Cancel container requests when cancelling pending slot allocations
> --
>
> Key: FLINK-9457
> URL: https://issues.apache.org/jira/browse/FLINK-9457
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> When cancelling a pending slot allocation request on the {{ResourceManager}}, 
> then we should also check whether we still need all the requested containers. 
> If it turns out that we no longer need them, then we should try to cancel the 
> unnecessary container requests. That way Flink will be, for example, a better 
> Yarn citizen which quickly releases resources and resource requests if no 
> longer needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9582) dist assemblies access jars outside of flink-dist

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9582:
-
Fix Version/s: 1.6.2

> dist assemblies access jars outside of flink-dist
> -
>
> Key: FLINK-9582
> URL: https://issues.apache.org/jira/browse/FLINK-9582
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>
> The flink-dist assemblies access compiled jars outside of flink-dist, for 
> example like this:
> {code:java}
> ../flink-libraries/flink-cep/target/flink-cep_${scala.binary.version}-${project.version}.jar{code}
> As usual, accessing files outside of the module that you're building is a 
> terrible idea.
> It's brittle as it relies on paths that aren't guaranteed to be stable, and 
> requires these modules to be built beforehand. There's also an inherent 
> potential for dependency conflicts when building flink-dist on it's own, as 
> maven may download certain snapshot artifacts, but the assemblies ignore 
> these and bundle jars present in Flink.
> We can use the maven-dependency plugin to copy required dependencies into the 
> {{target}} directory of flink-dist, and point the assemblies to these jars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10251) Handle oversized response messages in AkkaRpcActor

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10251:
--
Fix Version/s: 1.6.2

> Handle oversized response messages in AkkaRpcActor
> --
>
> Key: FLINK-10251
> URL: https://issues.apache.org/jira/browse/FLINK-10251
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{AkkaRpcActor}} should check whether an RPC response which is sent to a 
> remote sender does not exceed the maximum framesize of the underlying 
> {{ActorSystem}}. If this is the case we should fail fast instead. We can 
> achieve this by serializing the response and sending the serialized byte 
> array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8779) ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8779:
-
Fix Version/s: (was: 1.6.1)

> ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis
> 
>
> Key: FLINK-8779
> URL: https://issues.apache.org/jira/browse/FLINK-8779
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The \{{ClassLoaderITCase.testKMeansJobWithCustomClassLoader}} fails on Travis 
> by producing not output for 300s. This might indicate a test instability or a 
> problem with Flink which was recently introduced.
> https://api.travis-ci.org/v3/job/344427688/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10247) Run MetricQueryService in separate thread pool

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10247:
--
Fix Version/s: (was: 1.6.1)

> Run MetricQueryService in separate thread pool
> --
>
> Key: FLINK-10247
> URL: https://issues.apache.org/jira/browse/FLINK-10247
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: Shimin Yang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> In order to make the {{MetricQueryService}} run independently of the main 
> Flink components, it should get its own dedicated thread pool assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9788) ExecutionGraph Inconsistency prevents Job from recovering

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9788:
-
Fix Version/s: 1.6.2

> ExecutionGraph Inconsistency prevents Job from recovering
> -
>
> Key: FLINK-9788
> URL: https://issues.apache.org/jira/browse/FLINK-9788
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.6.0
> Environment: Rev: 4a06160
> Hadoop 2.8.3
>Reporter: Gary Yao
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
> Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the 
> ExecutionGraph ran into an inconsistent state, which prevented job recovery. 
> The following stacktrace was logged in the JobManager log several hundred 
> times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>- Job General purpose test job (37a794195840700b98feb23e99f7ea24) 
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Restarting 
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG 
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Resetting 
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for 
> new execution.
> 2018-07-08 16:47:18,857 WARN  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Failed to 
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in 
> non-terminal state CREATED
> at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
> at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
> at 
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
> 5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10251) Handle oversized response messages in AkkaRpcActor

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10251:
--
Fix Version/s: (was: 1.6.1)

> Handle oversized response messages in AkkaRpcActor
> --
>
> Key: FLINK-10251
> URL: https://issues.apache.org/jira/browse/FLINK-10251
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The {{AkkaRpcActor}} should check whether an RPC response which is sent to a 
> remote sender does not exceed the maximum framesize of the underlying 
> {{ActorSystem}}. If this is the case we should fail fast instead. We can 
> achieve this by serializing the response and sending the serialized byte 
> array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8899) Submitting YARN job with FLIP-6 may lead to ApplicationAttemptNotFoundException

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8899:
-
Fix Version/s: 1.6.2

> Submitting YARN job with FLIP-6 may lead to 
> ApplicationAttemptNotFoundException
> ---
>
> Key: FLINK-8899
> URL: https://issues.apache.org/jira/browse/FLINK-8899
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Priority: Major
>  Labels: flip-6
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Occasionally, running a simple word count as this
> {code}
> ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c 
> org.apache.flink.streaming.examples.wordcount.WordCount 
> ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING
> {code}
> leads to an {{ApplicationAttemptNotFoundException}} in the logs:
> {code}
> 2018-03-08 16:18:08,507 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Streaming 
> WordCount (df707a3c9817ddf5936efe56d427e2bd) switched from state RUNNING to 
> FINISHED.
> 2018-03-08 16:18:08,508 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping 
> checkpoint coordinator for job df707a3c9817ddf5936efe56d427e2bd
> 2018-03-08 16:18:08,508 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2018-03-08 16:18:08,536 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> df707a3c9817ddf5936efe56d427e2bd reached globally terminal state FINISHED.
> 2018-03-08 16:18:08,611 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(df707a3c9817ddf5936efe56d427e2bd).
> 2018-03-08 16:18:08,634 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> dcfdc329d61aae0ace2de26292c8916b: JobManager is shutting down..
> 2018-03-08 16:18:08,634 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://fl...@ip-172-31-2-0.eu-west-1.compute.internal:38555/user/jobmanager_0
>  for job df707a3c9817ddf5936efe56d427e2bd from the resource manager.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending 
> SlotPool.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping 
> SlotPool.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - 
> JobManagerRunner already shutdown.
> 2018-03-08 16:18:09,650 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager adc8090bdb3f7052943ff86bde7d2a7b at the SlotManager.
> 2018-03-08 16:18:09,654 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Replacing old instance of worker for ResourceID 
> container_1519984124671_0090_01_05
> 2018-03-08 16:18:09,654 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - 
> Unregister TaskManager adc8090bdb3f7052943ff86bde7d2a7b from the SlotManager.
> 2018-03-08 16:18:09,654 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager b975dbd16e0fd59c1168d978490a4b76 at the SlotManager.
> 2018-03-08 16:18:09,654 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - The target with resource ID 
> container_1519984124671_0090_01_05 is already been monitored.
> 2018-03-08 16:18:09,992 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 73c258a0dbad236501b8391971c330ba at the SlotManager.
> 2018-03-08 16:18:10,000 INFO  
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED 
> SIGNAL 15: SIGTERM. Shutting down as requested.
> 2018-03-08 16:18:10,028 ERROR 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Exception 
> on heartbeat
> org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: 
> Application attempt appattempt_1519984124671_0090_01 doesn't exist in 
> ApplicationMasterService cache.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:403)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> 

[jira] [Updated] (FLINK-9788) ExecutionGraph Inconsistency prevents Job from recovering

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9788:
-
Fix Version/s: (was: 1.6.1)

> ExecutionGraph Inconsistency prevents Job from recovering
> -
>
> Key: FLINK-9788
> URL: https://issues.apache.org/jira/browse/FLINK-9788
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.6.0
> Environment: Rev: 4a06160
> Hadoop 2.8.3
>Reporter: Gary Yao
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
> Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the 
> ExecutionGraph ran into an inconsistent state, which prevented job recovery. 
> The following stacktrace was logged in the JobManager log several hundred 
> times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph 
>- Job General purpose test job (37a794195840700b98feb23e99f7ea24) 
> switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Restarting 
> the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG 
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Resetting 
> execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for 
> new execution.
> 2018-07-08 16:47:18,857 WARN  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Failed to 
> restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in 
> non-terminal state CREATED
> at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
> at 
> org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
> at 
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
> 5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9485) Improving Flink’s timer management for large state

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9485:
-
Fix Version/s: (was: 1.6.1)

> Improving Flink’s timer management for large state
> --
>
> Key: FLINK-9485
> URL: https://issues.apache.org/jira/browse/FLINK-9485
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Streaming
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.6.2
>
>
> See 
> https://docs.google.com/document/d/1XbhJRbig5c5Ftd77d0mKND1bePyTC26Pz04EvxdA7Jc/edit#heading=h.17v0k3363r6q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7351) test instability in JobClientActorRecoveryITCase#testJobClientRecovery

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7351:
-
Fix Version/s: 1.6.2

> test instability in JobClientActorRecoveryITCase#testJobClientRecovery
> --
>
> Key: FLINK-7351
> URL: https://issues.apache.org/jira/browse/FLINK-7351
> Project: Flink
>  Issue Type: Bug
>  Components: Job-Submission, Tests
>Affects Versions: 1.3.2, 1.4.0
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> On a 16-core VM, the following test failed during {{mvn clean verify}}
> {code}
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.814 sec 
> <<< FAILURE! - in org.apache.flink.runtime.client.JobClientActorRecoveryITCase
> testJobClientRecovery(org.apache.flink.runtime.client.JobClientActorRecoveryITCase)
>   Time elapsed: 21.299 sec  <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:933)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Not enough free slots available to run the job. You can decrease the operator 
> parallelism or increase the number of slots per TaskManager in the 
> configuration. Resources available to scheduler: Number of instances=0, total 
> number of slots=0, available slots=0
> at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:334)
> at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.allocateSlot(Scheduler.java:139)
> at 
> org.apache.flink.runtime.executiongraph.Execution.allocateSlotForExecution(Execution.java:368)
> at 
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:309)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:596)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:450)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleLazy(ExecutionGraph.java:834)
> at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:814)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1425)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9101) HAQueryableStateRocksDBBackendITCase failed on travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9101:
-
Fix Version/s: 1.6.2

> HAQueryableStateRocksDBBackendITCase failed on travis
> -
>
> Key: FLINK-9101
> URL: https://issues.apache.org/jira/browse/FLINK-9101
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2
>
>
> The test deadlocks on travis.
> https://travis-ci.org/zentol/flink/jobs/358894950



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8779) ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8779:
-
Fix Version/s: 1.6.2

> ClassLoaderITCase.testKMeansJobWithCustomClassLoader fails on Travis
> 
>
> Key: FLINK-8779
> URL: https://issues.apache.org/jira/browse/FLINK-8779
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The \{{ClassLoaderITCase.testKMeansJobWithCustomClassLoader}} fails on Travis 
> by producing not output for 300s. This might indicate a test instability or a 
> problem with Flink which was recently introduced.
> https://api.travis-ci.org/v3/job/344427688/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9101) HAQueryableStateRocksDBBackendITCase failed on travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9101:
-
Fix Version/s: (was: 1.6.1)

> HAQueryableStateRocksDBBackendITCase failed on travis
> -
>
> Key: FLINK-9101
> URL: https://issues.apache.org/jira/browse/FLINK-9101
> Project: Flink
>  Issue Type: Bug
>  Components: Queryable State
>Affects Versions: 1.5.0
>Reporter: Chesnay Schepler
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2
>
>
> The test deadlocks on travis.
> https://travis-ci.org/zentol/flink/jobs/358894950



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8623) ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8623:
-
Fix Version/s: 1.6.2

> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on 
> Travis
> 
>
> Key: FLINK-8623
> URL: https://issues.apache.org/jira/browse/FLINK-8623
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics}} fails on 
> Travis: https://travis-ci.org/apache/flink/jobs/33932



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7915) Verify functionality of RollingSinkSecuredITCase

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7915:
-
Fix Version/s: 1.6.2

> Verify functionality of RollingSinkSecuredITCase
> 
>
> Key: FLINK-7915
> URL: https://issues.apache.org/jira/browse/FLINK-7915
> Project: Flink
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> I recently stumbled across the test {{RollingSinkSecuredITCase}} which will 
> only be executed for Hadoop version {{>= 3}}. When trying to run it from 
> IntelliJ I immediately run into a class not found exception for 
> {{jdbm.helpers.CachePolicy}} and even after fixing this problem, the test 
> would not run because it complained about wrong security settings.
> I think we should check whether this test is at all working and if not, then 
> we should remove or replace it with something working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10260) Confusing log during TaskManager registration

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10260:
--
Fix Version/s: 1.6.2

> Confusing log during TaskManager registration
> -
>
> Key: FLINK-10260
> URL: https://issues.apache.org/jira/browse/FLINK-10260
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Stephan Ewen
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> During startup, when TaskManagers register, I see a lot of confusing log 
> lines.
> The below case happened during startup of a cloud setup where TaskManagers 
> took a varying amount of time to start and might have started before the 
> JobManager
> {code}
> -- Logs begin at Thu 2018-08-30 14:51:58 UTC, end at Thu 2018-08-30 14:55:39 
> UTC. --
> Aug 30 14:52:52 flink-belgium-1 systemd[1]: Started flink-jobmanager.service.
> -- Subject: Unit flink-jobmanager.service has finished start-up
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> -- 
> -- Unit flink-jobmanager.service has finished starting up.
> -- 
> -- The start-up result is RESULT.
> Aug 30 14:52:52 flink-belgium-1 jobmanager.sh[5416]: used deprecated key 
> `jobmanager.heap.mb`, please replace with key `jobmanager.heap.size`
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: Starting 
> standalonesession as a console application on host flink-belgium-1.
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,221 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> 
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,222 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Starting StandaloneSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, 
> Date:07.08.2018 @ 13:31:13 UTC)
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,222 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  OS 
> current user: flink
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,718 
> WARN  org.apache.hadoop.util.NativeCodeLoader   - Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,847 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Current Hadoop/Kerberos user: flink
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  JVM: 
> OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Maximum heap size: 1963 MiBytes
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,848 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> JAVA_HOME: (not set)
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Hadoop 
> version: 2.8.3
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  JVM 
> Options:
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Xms2048m
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Xmx2048m
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,853 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  
> Program Arguments:
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 
> --configDir
> Aug 30 14:52:53 flink-belgium-1 jobmanager.sh[5416]: 2018-08-30 14:52:53,854 
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  

[jira] [Updated] (FLINK-9764) Failure in LocalRecoveryRocksDBFullITCase

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9764:
-
Fix Version/s: (was: 1.6.1)

> Failure in LocalRecoveryRocksDBFullITCase
> -
>
> Key: FLINK-9764
> URL: https://issues.apache.org/jira/browse/FLINK-9764
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing, Streaming, Tests
>Affects Versions: 1.6.0
>Reporter: Nico Kruber
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2
>
>
> {code}
> Running org.apache.flink.test.checkpointing.LocalRecoveryRocksDBFullITCase
> Starting null#executeTest.
> org.apache.flink.runtime.client.JobExecutionException: 
> java.lang.AssertionError: Window start: 0 end: 100 expected:<4950> but 
> was:<1209>
>   at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:623)
>   at 
> org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:79)
>   at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:35)
>   at 
> org.apache.flink.test.checkpointing.AbstractEventTimeWindowCheckpointingITCase.testTumblingTimeWindow(AbstractEventTimeWindowCheckpointingITCase.java:286)
>   at 
> org.apache.flink.test.checkpointing.AbstractLocalRecoveryITCase.executeTest(AbstractLocalRecoveryITCase.java:82)
>   at 
> org.apache.flink.test.checkpointing.AbstractLocalRecoveryITCase.executeTest(AbstractLocalRecoveryITCase.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
> Caused by: java.lang.AssertionError: Window start: 0 end: 100 expected:<4950> 
> but was:<1209>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at 
> org.apache.flink.test.checkpointing.AbstractEventTimeWindowCheckpointingITCase$ValidatingSink.invoke(AbstractEventTimeWindowCheckpointingITCase.java:733)
>   at 
> org.apache.flink.test.checkpointing.AbstractEventTimeWindowCheckpointingITCase$ValidatingSink.invoke(AbstractEventTimeWindowCheckpointingITCase.java:669)
>   at 
> org.apache.flink.streaming.api.functions.sink.SinkFunction.invoke(SinkFunction.java:52)
>   at 
> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
>   at 
> 

[jira] [Updated] (FLINK-8820) FlinkKafkaConsumer010 reads too many bytes

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8820:
-
Fix Version/s: (was: 1.6.1)

> FlinkKafkaConsumer010 reads too many bytes
> --
>
> Key: FLINK-8820
> URL: https://issues.apache.org/jira/browse/FLINK-8820
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.4.0
>Reporter: Fabian Hueske
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> A user reported that the FlinkKafkaConsumer010 very rarely consumes too many 
> bytes, i.e., the returned message is too large. The application is running 
> for about a year and the problem started to occur after upgrading to Flink 
> 1.4.0.
> The user made a good effort in debugging the problem but was not able to 
> reproduce it in a controlled environment. It seems that the data is correctly 
> stored in Kafka.
> Here's the thread on the thread on the user mailing list for a detailed 
> description of the problem and analysis so far: 
> https://lists.apache.org/thread.html/1d62f616d275e9e23a5215ddf7f5466051be7ea96897d827232fcb4e@%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10292) Generate JobGraph in StandaloneJobClusterEntrypoint only once

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10292:
--
Fix Version/s: 1.6.2

> Generate JobGraph in StandaloneJobClusterEntrypoint only once
> -
>
> Key: FLINK-10292
> URL: https://issues.apache.org/jira/browse/FLINK-10292
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>
> Currently the {{StandaloneJobClusterEntrypoint}} generates the {{JobGraph}} 
> from the given user code every time it starts/is restarted. This can be 
> problematic if the the {{JobGraph}} generation has side effects. Therefore, 
> it would be better to generate the {{JobGraph}} only once and store it in HA 
> storage instead from where to retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9878) IO worker threads BLOCKED on SSL Session Cache while CMS full gc

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9878:
-
Fix Version/s: (was: 1.6.1)

> IO worker threads BLOCKED on SSL Session Cache while CMS full gc
> 
>
> Key: FLINK-9878
> URL: https://issues.apache.org/jira/browse/FLINK-9878
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> According to https://github.com/netty/netty/issues/832, there is a JDK issue 
> during garbage collection when the SSL session cache is not limited. We 
> should allow the user to configure this and further (advanced) SSL parameters 
> for fine-tuning to fix this and similar issues. In particular, the following 
> parameters should be configurable:
> - SSL session cache size
> - SSL session timeout
> - SSL handshake timeout
> - SSL close notify flush timeout



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10292) Generate JobGraph in StandaloneJobClusterEntrypoint only once

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10292:
--
Fix Version/s: (was: 1.6.1)

> Generate JobGraph in StandaloneJobClusterEntrypoint only once
> -
>
> Key: FLINK-10292
> URL: https://issues.apache.org/jira/browse/FLINK-10292
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>
> Currently the {{StandaloneJobClusterEntrypoint}} generates the {{JobGraph}} 
> from the given user code every time it starts/is restarted. This can be 
> problematic if the the {{JobGraph}} generation has side effects. Therefore, 
> it would be better to generate the {{JobGraph}} only once and store it in HA 
> storage instead from where to retrieve.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9587) ContinuousFileMonitoringFunction crashes on short living files

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9587:
-
Fix Version/s: (was: 1.6.1)

> ContinuousFileMonitoringFunction crashes on short living files
> --
>
> Key: FLINK-9587
> URL: https://issues.apache.org/jira/browse/FLINK-9587
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystem, Streaming, Streaming Connectors
>Affects Versions: 1.5.0
> Environment: Flink 1.5 running as a standalone cluster.
>Reporter: Andrei Shumanski
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
>
> Hi,
>  
> We use Flink to monitor a directory for new files. The filesystem is a MapR 
> Fuse mount that looks like a local FS.
> The files are copied to the directory by another process that uses rsync 
> command. While a file is not completely written rsync creates a temporary 
> file with a name like ".file.txt.uM6MfZ" where the last extension is a random 
> string.
> When the copying is done - file is renamed to the final name "file.txt".
>  
> The bug is that Flink does not correctly handle this behavior and does not 
> take into account that files in the directory might be deleted.
>  
> We are getting error traces:
> {code:java}
> java.io.FileNotFoundException: File 
> file:/mapr/landingarea/cId=2075/.file_00231.cpio.gz.uM6MfZ does not exist or 
> the user running Flink ('root') has insufficient permissions to access it.
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.getFileStatus(LocalFileSystem.java:115)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.listStatus(LocalFileSystem.java:177)
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.listStatus(SafetyNetWrapperFileSystem.java:92)
> at 
> org.apache.flink.api.common.io.FileInputFormat.addFilesInDir(FileInputFormat.java:707)
> at 
> org.apache.flink.api.common.io.FileInputFormat.addFilesInDir(FileInputFormat.java:710)
> at 
> org.apache.flink.api.common.io.FileInputFormat.addFilesInDir(FileInputFormat.java:710)
> at 
> org.apache.flink.api.common.io.FileInputFormat.addFilesInDir(FileInputFormat.java:710)
> at 
> org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:591)
> at 
> org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.getInputSplitsSortedByModTime(ContinuousFileMonitoringFunction.java:270)
> at 
> org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.monitorDirAndForwardSplits(ContinuousFileMonitoringFunction.java:242)
> at 
> org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.run(ContinuousFileMonitoringFunction.java:206)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:56)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> In LocalFileSystem.listStatus(final Path f) we read the list of files in a 
> directory and then create LocalFileStatus object for each of the files. But a 
> file might be removed during the interval between these operations.
> I do not see any option to handle this exception in our code.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10309) Cancel flink job occurs java.net.ConnectException

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10309:
--
Fix Version/s: 1.6.2

> Cancel flink job occurs java.net.ConnectException
> -
>
> Key: FLINK-10309
> URL: https://issues.apache.org/jira/browse/FLINK-10309
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, REST
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: vinoyang
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The problem occurs when using the Yarn per-job detached mode. Trying to 
> cancel with savepoint fails with the following exception before being able to 
> retrieve the savepoint path:
> exception stack trace : 
> {code:java}
> org.apache.flink.util.FlinkException: Could not cancel job .
>         at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
>         at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
>         at 
> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
>         at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
> complete the operation. Number of retries has been exhausted.
>         at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>         at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>         at 
> org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
>         at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
>         ... 6 more
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
> Could not complete the operation. Number of retries has been exhausted.
>         at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>         at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>         at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:274)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>         at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>         ... 1 more
> Caused by: java.util.concurrent.CompletionException: 
> java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>         at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
>         at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>         ... 16 more
> Caused by: 

[jira] [Updated] (FLINK-10317) Configure Metaspace size by default

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10317:
--
Fix Version/s: (was: 1.6.1)

> Configure Metaspace size by default
> ---
>
> Key: FLINK-10317
> URL: https://issues.apache.org/jira/browse/FLINK-10317
> Project: Flink
>  Issue Type: Bug
>  Components: Startup Shell Scripts
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Stephan Ewen
>Assignee: vinoyang
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should set the size of the JVM Metaspace to a sane default, like  
> {{-XX:MaxMetaspaceSize=256m}}.
> If not set, the JVM offheap memory will grow indefinitely with repeated 
> classloading and Jitting, eventually exceeding allowed memory on docker/yarn 
> or similar setups.
> It is hard to come up with a good default, however, I believe the error 
> messages one gets when metaspace is too small are easy to understand (and 
> easy to take action), while it is very hard to figure out why the memory 
> footprint keeps growing steadily and infinitely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9469) Add tests that cover PatternStream#flatSelect

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9469:
-
Fix Version/s: 1.6.2

> Add tests that cover PatternStream#flatSelect
> -
>
> Key: FLINK-9469
> URL: https://issues.apache.org/jira/browse/FLINK-9469
> Project: Flink
>  Issue Type: Improvement
>  Components: CEP
>Reporter: Dawid Wysakowicz
>Assignee: Dawid Wysakowicz
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9253) Make buffer count per InputGate always #channels*buffersPerChannel + ExclusiveBuffers

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9253:
-
Fix Version/s: 1.6.2

> Make buffer count per InputGate always #channels*buffersPerChannel + 
> ExclusiveBuffers
> -
>
> Key: FLINK-9253
> URL: https://issues.apache.org/jira/browse/FLINK-9253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Network
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The credit-based flow control path assigns exclusive buffers only to remote 
> channels (which makes sense since local channels don't use any own buffers). 
> However, this is a bit intransparent with respect to how much data may be in 
> buffers since this depends on the actual schedule of the job and not the job 
> graph.
> By adapting the floating buffers to use a maximum of 
> {{#channels*buffersPerChannel + floatingBuffersPerGate - #exclusiveBuffers}}, 
> we would be channel-type agnostic and keep the old behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10311) HA end-to-end/Jepsen tests for standby Dispatchers

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10311:
--
Fix Version/s: 1.6.2

> HA end-to-end/Jepsen tests for standby Dispatchers
> --
>
> Key: FLINK-10311
> URL: https://issues.apache.org/jira/browse/FLINK-10311
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should add end-to-end or Jepsen tests to verify the HA behaviour when 
> using multiple standby Dispatchers. In particular, we should verify that jobs 
> get properly cleaned up after they finished successfully (see FLINK-10255 and 
> FLINK-10011):
> 1. Test that a standby Dispatcher does not affect job execution and resource 
> cleanup
> 2. Test that a standby Dispatcher recovers all submitted jobs after the 
> leader loses leadership



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9142) Lower the minimum number of buffers for incoming channels to 1

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9142:
-
Fix Version/s: 1.6.2

> Lower the minimum number of buffers for incoming channels to 1
> --
>
> Key: FLINK-9142
> URL: https://issues.apache.org/jira/browse/FLINK-9142
> Project: Flink
>  Issue Type: Sub-task
>  Components: Network
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Nico Kruber
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>
> Even if we make the floating buffers optional, we still require 
> {{taskmanager.network.memory.buffers-per-channel}} number of (exclusive) 
> buffers per incoming channel with credit-based flow control while without, 
> the minimum was 1 and only the maximum number of buffers was influenced by 
> this parameter.
> {{taskmanager.network.memory.buffers-per-channel}} is set to {{2}} by default 
> with the argumentation that this way we will have one buffer available for 
> netty to process while a worker thread is processing/deserializing the other 
> buffer. While this seems reasonable, it does increase our minimum 
> requirements. Instead, we could probably live with {{1}} exclusive buffer and 
> up to {{gate.getNumberOfInputChannels() * (networkBuffersPerChannel - 1) + 
> extraNetworkBuffersPerGate}} floating buffers. That way we will have the same 
> memory footprint as before with only slightly changed behaviour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10259) Key validation for GroupWindowAggregate is broken

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10259:
--
Fix Version/s: (was: 1.6.1)

> Key validation for GroupWindowAggregate is broken
> -
>
> Key: FLINK-10259
> URL: https://issues.apache.org/jira/browse/FLINK-10259
> Project: Flink
>  Issue Type: Bug
>  Components: Table API  SQL
>Reporter: Fabian Hueske
>Assignee: Fabian Hueske
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> WindowGroups have multiple equivalent keys (start, end) that should be 
> handled differently from other keys. The {{UpdatingPlanChecker}} uses 
> equivalence groups to identify equivalent keys but the keys of WindowGroups 
> are not correctly assigned to groups.
> This means that we cannot correctly extract keys from queries that use group 
> windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10291) Generate JobGraph with fixed/configurable JobID in StandaloneJobClusterEntrypoint

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10291:
--
Fix Version/s: 1.6.2

> Generate JobGraph with fixed/configurable JobID in 
> StandaloneJobClusterEntrypoint
> -
>
> Key: FLINK-10291
> URL: https://issues.apache.org/jira/browse/FLINK-10291
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: vinoyang
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
>
> The {{StandaloneJobClusterEntrypoint}} currently generates the {{JobGraph}} 
> from the user code when being started. Due to the nature of how the 
> {{JobGraph}} is generated, it will get a random {{JobID}} assigned. This is 
> problematic in case of a failover because then, the {{JobMaster}} won't be 
> able to detect the checkpoints. In order to solve this problem, we need to 
> either fix the {{JobID}} assignment or make it configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-5125) ContinuousFileProcessingCheckpointITCase is Flaky

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-5125:
-
Fix Version/s: 1.6.2

> ContinuousFileProcessingCheckpointITCase is Flaky
> -
>
> Key: FLINK-5125
> URL: https://issues.apache.org/jira/browse/FLINK-5125
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming Connectors
>Reporter: Aljoscha Krettek
>Assignee: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2
>
>
> This is the travis log: 
> https://api.travis-ci.org/jobs/177402367/log.txt?deansi=true
> The relevant sections is:
> {code}
> Running org.apache.flink.test.checkpointing.CoStreamCheckpointingITCase
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.571 sec - 
> in org.apache.flink.test.exampleJavaPrograms.EnumTriangleBasicITCase
> Running org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.704 sec - 
> in org.apache.flink.test.checkpointing.CoStreamCheckpointingITCase
> Running 
> org.apache.flink.test.checkpointing.EventTimeAllWindowCheckpointingITCase
> Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.805 sec - 
> in org.apache.flink.test.checkpointing.EventTimeAllWindowCheckpointingITCase
> Running 
> org.apache.flink.test.checkpointing.ContinuousFileProcessingCheckpointITCase
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Job execution failed.
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>   at 
> org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:392)
>   at 
> org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:209)
>   at 
> org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:173)
>   at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:32)
>   at 
> org.apache.flink.test.checkpointing.StreamFaultToleranceTestBase.runCheckpointedProgram(StreamFaultToleranceTestBase.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
>   at 
> 

[jira] [Updated] (FLINK-4387) Instability in KvStateClientTest.testClientServerIntegration()

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-4387:
-
Fix Version/s: 1.6.2

> Instability in KvStateClientTest.testClientServerIntegration()
> --
>
> Key: FLINK-4387
> URL: https://issues.apache.org/jira/browse/FLINK-4387
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.5.0, 1.6.0
>Reporter: Robert Metzger
>Assignee: Nico Kruber
>Priority: Major
>  Labels: test-stability
> Fix For: 1.2.0, 1.7.0, 1.6.2
>
>
> According to this log: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/151491745/log.txt
> the {{KvStateClientTest}} didn't complete.
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7fb2b400a000 nid=0x29dc in Object.wait() 
> [0x7fb2bcb3b000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xf7c049a0> (a 
> io.netty.util.concurrent.DefaultPromise)
>   at java.lang.Object.wait(Object.java:502)
>   at 
> io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:254)
>   - locked <0xf7c049a0> (a 
> io.netty.util.concurrent.DefaultPromise)
>   at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:32)
>   at 
> org.apache.flink.runtime.query.netty.KvStateServer.shutDown(KvStateServer.java:185)
>   at 
> org.apache.flink.runtime.query.netty.KvStateClientTest.testClientServerIntegration(KvStateClientTest.java:680)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}
> and
> {code}
> Exception in thread "globalEventExecutor-1-3" java.lang.AssertionError
>   at 
> io.netty.util.concurrent.AbstractScheduledEventExecutor.pollScheduledTask(AbstractScheduledEventExecutor.java:83)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor.fetchFromScheduledTaskQueue(GlobalEventExecutor.java:110)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor.takeTask(GlobalEventExecutor.java:95)
>   at 
> io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:226)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10246) Harden and separate MetricQueryService

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10246:
--
Fix Version/s: (was: 1.6.1)

> Harden and separate MetricQueryService
> --
>
> Key: FLINK-10246
> URL: https://issues.apache.org/jira/browse/FLINK-10246
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Coordination, Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> This is an umbrella issue to track the effort to harden Flink's 
> {{MetricQueryService}} and to separate it from the rest of the system.
> The idea is to setup the {{MetricQueryService}} and the metric system in 
> general in such a way that it cannot interfere with or even bring the main 
> Flink components down. Moreover, the metric system also should not degrade 
> performance by simply using any free CPU cycles but not more. Ideally, the 
> user does not see a difference between running Flink with metric query 
> service turned on or off.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9485) Improving Flink’s timer management for large state

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9485:
-
Fix Version/s: 1.6.2

> Improving Flink’s timer management for large state
> --
>
> Key: FLINK-9485
> URL: https://issues.apache.org/jira/browse/FLINK-9485
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Streaming
>Reporter: Stefan Richter
>Assignee: Stefan Richter
>Priority: Major
> Fix For: 1.6.2
>
>
> See 
> https://docs.google.com/document/d/1XbhJRbig5c5Ftd77d0mKND1bePyTC26Pz04EvxdA7Jc/edit#heading=h.17v0k3363r6q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9228) log details about task fail/task manager is shutting down

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9228:
-
Fix Version/s: 1.6.2

> log details about task fail/task manager is shutting down
> -
>
> Key: FLINK-9228
> URL: https://issues.apache.org/jira/browse/FLINK-9228
> Project: Flink
>  Issue Type: Improvement
>  Components: Logging
>Affects Versions: 1.4.2
>Reporter: makeyang
>Assignee: makeyang
>Priority: Minor
> Fix For: 1.7.0, 1.6.2
>
>
> condition:
> flink version:1.4.2
> jdk version:1.8.0.20
> linux version:3.10.0
> problem description:
> one of my task manager is out of the cluster and I checked its log found 
> something below: 
> 2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task       
>               
> - Attempting to fail task externally Process (115/120) 
> (19d0b0ce1ef3b8023b37bdfda643ef44). 
> 2018-04-19 22:34:47,441 INFO  org.apache.flink.runtime.taskmanager.Task       
>               
> - Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING 
> to FAILED. 
> java.lang.Exception: TaskManager is shutting down. 
>         at 
> org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220)
>  
>         at akka.actor.Actor$class.aroundPostStop(Actor.scala:515) 
>         at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121)
>  
>         at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>  
>         at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) 
>         at akka.actor.ActorCell.terminate(ActorCell.scala:374) 
>         at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467) 
>         at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483) 
>         at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282) 
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260) 
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  
> suggestion:
>  # short term suggestion:
>  ## log reasons why task tail?maybe received some event from job 
> manager/can't connect to job manager? operator exception? the more claritify 
> the better
>  ## log reasons why task manager is shutting down? received some event from 
> job manager/can't connect to job manager? operator exception can't be 
> recovery?
>  # long term suggestion:
>  ## define the state machine of flink node clearly. if nothing happens, the 
> node should stay what it used to be, which means if it is processing events, 
> if nothing happens, it should still processing events.or in other words, if 
> its state changes from processing event to cancel, then event happens.
>  ## define the events which can cause node state changed clearly. like use 
> cancel, operator exception, heart beat timeout etc
>  ## log the state change and event which cause state chaged clearly in logs
>  ## show event details(time, node, event, state changed etc) in webui



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8899) Submitting YARN job with FLIP-6 may lead to ApplicationAttemptNotFoundException

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8899:
-
Fix Version/s: (was: 1.6.1)

> Submitting YARN job with FLIP-6 may lead to 
> ApplicationAttemptNotFoundException
> ---
>
> Key: FLINK-8899
> URL: https://issues.apache.org/jira/browse/FLINK-8899
> Project: Flink
>  Issue Type: Bug
>  Components: ResourceManager, YARN
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Priority: Major
>  Labels: flip-6
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Occasionally, running a simple word count as this
> {code}
> ./bin/flink run -m yarn-cluster -yjm 768 -ytm 3072 -ys 2 -p 20 -c 
> org.apache.flink.streaming.examples.wordcount.WordCount 
> ./examples/streaming/WordCount.jar --input /usr/share/doc/rsync-3.0.6/COPYING
> {code}
> leads to an {{ApplicationAttemptNotFoundException}} in the logs:
> {code}
> 2018-03-08 16:18:08,507 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Streaming 
> WordCount (df707a3c9817ddf5936efe56d427e2bd) switched from state RUNNING to 
> FINISHED.
> 2018-03-08 16:18:08,508 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping 
> checkpoint coordinator for job df707a3c9817ddf5936efe56d427e2bd
> 2018-03-08 16:18:08,508 INFO  
> org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore  - 
> Shutting down
> 2018-03-08 16:18:08,536 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
> df707a3c9817ddf5936efe56d427e2bd reached globally terminal state FINISHED.
> 2018-03-08 16:18:08,611 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Stopping the JobMaster for job Streaming 
> WordCount(df707a3c9817ddf5936efe56d427e2bd).
> 2018-03-08 16:18:08,634 INFO  org.apache.flink.runtime.jobmaster.JobMaster
>   - Close ResourceManager connection 
> dcfdc329d61aae0ace2de26292c8916b: JobManager is shutting down..
> 2018-03-08 16:18:08,634 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Disconnect job manager 
> 0...@akka.tcp://fl...@ip-172-31-2-0.eu-west-1.compute.internal:38555/user/jobmanager_0
>  for job df707a3c9817ddf5936efe56d427e2bd from the resource manager.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Suspending 
> SlotPool.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Stopping 
> SlotPool.
> 2018-03-08 16:18:08,664 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunner   - 
> JobManagerRunner already shutdown.
> 2018-03-08 16:18:09,650 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager adc8090bdb3f7052943ff86bde7d2a7b at the SlotManager.
> 2018-03-08 16:18:09,654 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - Replacing old instance of worker for ResourceID 
> container_1519984124671_0090_01_05
> 2018-03-08 16:18:09,654 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - 
> Unregister TaskManager adc8090bdb3f7052943ff86bde7d2a7b from the SlotManager.
> 2018-03-08 16:18:09,654 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager b975dbd16e0fd59c1168d978490a4b76 at the SlotManager.
> 2018-03-08 16:18:09,654 INFO  org.apache.flink.yarn.YarnResourceManager   
>   - The target with resource ID 
> container_1519984124671_0090_01_05 is already been monitored.
> 2018-03-08 16:18:09,992 INFO  
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Register 
> TaskManager 73c258a0dbad236501b8391971c330ba at the SlotManager.
> 2018-03-08 16:18:10,000 INFO  
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED 
> SIGNAL 15: SIGTERM. Shutting down as requested.
> 2018-03-08 16:18:10,028 ERROR 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Exception 
> on heartbeat
> org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: 
> Application attempt appattempt_1519984124671_0090_01 doesn't exist in 
> ApplicationMasterService cache.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:403)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> 

[jira] [Updated] (FLINK-9481) FlinkKafkaProducer011ITCase deadlock in initializeState

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9481:
-
Fix Version/s: (was: 1.6.1)

> FlinkKafkaProducer011ITCase deadlock in initializeState 
> 
>
> Key: FLINK-9481
> URL: https://issues.apache.org/jira/browse/FLINK-9481
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 1.5.0, 1.4.2
>Reporter: Piotr Nowojski
>Priority: Critical
> Fix For: 1.7.0, 1.6.2
>
> Attachments: log.txt.zip
>
>
> FlinkKafkaProducer011ITCase.testRestoreToCheckpointAfterExceedingProducersPool(FlinkKafkaProducer011ITCase.java:152)
>  deadlocked on travis:
>  
> {noformat}
> "main" #1 prio=5 os_prio=0 tid=0x7fa36800a000 nid=0x5b85 waiting on 
> condition [0x7fa371c4d000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0xf54856c8> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionalRequestResult.await(TransactionalRequestResult.java:50)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.initTransactions(KafkaProducer.java:537)
>   at 
> org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.initTransactions(FlinkKafkaProducer.java:123)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.abortTransactions(FlinkKafkaProducer011.java:919)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.cleanUpUserContext(FlinkKafkaProducer011.java:903)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.finishRecoveringContext(FlinkKafkaProducer011.java:891)
>   at 
> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.initializeState(TwoPhaseCommitSinkFunction.java:338)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.initializeState(FlinkKafkaProducer011.java:867)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
>   at 
> org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:254)
>   at 
> org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.initializeState(AbstractStreamOperatorTestHarness.java:424)
>   at 
> org.apache.flink.streaming.util.AbstractStreamOperatorTestHarness.initializeState(AbstractStreamOperatorTestHarness.java:346)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testRestoreToCheckpointAfterExceedingProducersPool(FlinkKafkaProducer011ITCase.java:152){noformat}
>  
> https://api.travis-ci.org/v3/job/386021917/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10253) Run MetricQueryService with lower priority

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10253:
--
Fix Version/s: (was: 1.6.1)

> Run MetricQueryService with lower priority
> --
>
> Key: FLINK-10253
> URL: https://issues.apache.org/jira/browse/FLINK-10253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should run the {{MetricQueryService}} with a lower priority than the main 
> Flink components. An idea would be to start the underlying threads with a 
> lower priority.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-7991) Cleanup kafka example jar filters

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-7991:
-
Fix Version/s: (was: 1.6.1)

> Cleanup kafka example jar filters
> -
>
> Key: FLINK-7991
> URL: https://issues.apache.org/jira/browse/FLINK-7991
> Project: Flink
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Trivial
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The kafka example jar shading configuration includes the following files:
> {code}
> org/apache/flink/streaming/examples/kafka/**
> org/apache/flink/streaming/**
> org/apache/kafka/**
> org/apache/curator/**
> org/apache/zookeeper/**
> org/apache/jute/**
> org/I0Itec/**
> jline/**
> com/yammer/**
> kafka/**
> {code}
> Problems:
> * the inclusion of org.apache.flink.streaming causes large parts of the API 
> (and runtime...) to be included in the jar, along the _Twitter_ source and 
> *all* other examples
> * the yammer, jline, l0ltec, jute, zookeeper and curator inclusions are 
> ineffective; none of these classes show up in the example jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9776) Interrupt TaskThread only while in User/Operator code

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9776:
-
Fix Version/s: (was: 1.6.1)

> Interrupt TaskThread only while in User/Operator code
> -
>
> Key: FLINK-9776
> URL: https://issues.apache.org/jira/browse/FLINK-9776
> Project: Flink
>  Issue Type: Improvement
>  Components: Local Runtime
>Reporter: Stephan Ewen
>Assignee: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2
>
>
> Upon cancellation, the task thread is periodically interrupted.
> This helps to pull the thread out of blocking operations in the user code.
> Once the thread leaves the user code, the repeated interrupts may interfere 
> with the shutdown cleanup logic, causing confusing exceptions.
> We should stop sending the periodic interrupts once the thread leaves the 
> user code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9469) Add tests that cover PatternStream#flatSelect

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9469:
-
Fix Version/s: (was: 1.6.1)

> Add tests that cover PatternStream#flatSelect
> -
>
> Key: FLINK-9469
> URL: https://issues.apache.org/jira/browse/FLINK-9469
> Project: Flink
>  Issue Type: Improvement
>  Components: CEP
>Reporter: Dawid Wysakowicz
>Assignee: Dawid Wysakowicz
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10282) Provide separate thread-pool for REST endpoint

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10282:
--
Fix Version/s: (was: 1.6.1)

> Provide separate thread-pool for REST endpoint
> --
>
> Key: FLINK-10282
> URL: https://issues.apache.org/jira/browse/FLINK-10282
> Project: Flink
>  Issue Type: Improvement
>  Components: Local Runtime, REST
>Affects Versions: 1.5.1, 1.6.0, 1.7.0
>Reporter: Chesnay Schepler
>Assignee: Chesnay Schepler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> The REST endpoints currently share their thread-pools with the RPC system, 
> which can cause the Dispatcher to become unresponsive if the REST parts are 
> overloaded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10184:
--
Fix Version/s: 1.6.2

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> --
>
> Key: FLINK-10184
> URL: https://issues.apache.org/jira/browse/FLINK-10184
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Thomas Wozniakowski
>Priority: Blocker
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10252) Handle oversized metric messges

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10252:
--
Fix Version/s: (was: 1.6.1)

> Handle oversized metric messges
> ---
>
> Key: FLINK-10252
> URL: https://issues.apache.org/jira/browse/FLINK-10252
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Since the {{MetricQueryService}} is implemented as an Akka actor, it can only 
> send messages of a smaller size then the current {{akka.framesize}}. We 
> should check similarly to FLINK-10251 whether the payload exceeds the maximum 
> framesize and fail fast if it is true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8073) Test instability FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8073:
-
Fix Version/s: (was: 1.6.1)

> Test instability 
> FlinkKafkaProducer011ITCase.testScaleDownBeforeFirstCheckpoint()
> -
>
> Key: FLINK-8073
> URL: https://issues.apache.org/jira/browse/FLINK-8073
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Kostas Kloudas
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Travis log: https://travis-ci.org/kl0u/flink/jobs/301985988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10253) Run MetricQueryService with lower priority

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10253:
--
Fix Version/s: 1.6.2

> Run MetricQueryService with lower priority
> --
>
> Key: FLINK-10253
> URL: https://issues.apache.org/jira/browse/FLINK-10253
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should run the {{MetricQueryService}} with a lower priority than the main 
> Flink components. An idea would be to start the underlying threads with a 
> lower priority.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9752) Add an S3 RecoverableWriter

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9752:
-
Fix Version/s: 1.6.2

> Add an S3 RecoverableWriter
> ---
>
> Key: FLINK-9752
> URL: https://issues.apache.org/jira/browse/FLINK-9752
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming Connectors
>Reporter: Stephan Ewen
>Assignee: Kostas Kloudas
>Priority: Major
> Fix For: 1.7.0, 1.6.2
>
>
> S3 offers persistence only when uploads are complete. That means at the end 
> of simple uploads and uploads of parts of a MultiPartUpload.
> We should implement a RecoverableWriter for S3 that does a MultiPartUpload 
> with a Part per checkpoint.
> Recovering the reader needs the MultiPartUploadID and the list of ETags of 
> previous parts.
> We need additional staging of data in Flink state to work around the fact that
>  - Parts in a MultiPartUpload must be at least 5MB
>  - Part sizes must be known up front. (Note that data can still be streamed 
> in the upload)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9878) IO worker threads BLOCKED on SSL Session Cache while CMS full gc

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9878:
-
Fix Version/s: 1.6.2

> IO worker threads BLOCKED on SSL Session Cache while CMS full gc
> 
>
> Key: FLINK-9878
> URL: https://issues.apache.org/jira/browse/FLINK-9878
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> According to https://github.com/netty/netty/issues/832, there is a JDK issue 
> during garbage collection when the SSL session cache is not limited. We 
> should allow the user to configure this and further (advanced) SSL parameters 
> for fine-tuning to fix this and similar issues. In particular, the following 
> parameters should be configurable:
> - SSL session cache size
> - SSL session timeout
> - SSL handshake timeout
> - SSL close notify flush timeout



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9761) Potential buffer leak in PartitionRequestClientHandler during job failures

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9761:
-
Fix Version/s: 1.6.2

> Potential buffer leak in PartitionRequestClientHandler during job failures
> --
>
> Key: FLINK-9761
> URL: https://issues.apache.org/jira/browse/FLINK-9761
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{PartitionRequestClientHandler#stagedMessages}} may be accessed from 
> multiple threads:
> 1) Netty's IO thread
> 2) During cancellation, 
> {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is 
> called
> If {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} 
> thinks, {{stagesMessages}} is empty, however, it will not install the 
> {{stagedMessagesHandler}} that consumes and releases buffers from received 
> messages.
> Unless some unexpected combination of code calls prevents this from 
> happening, this would leak the non-recycled buffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9761) Potential buffer leak in PartitionRequestClientHandler during job failures

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9761:
-
Fix Version/s: (was: 1.6.1)

> Potential buffer leak in PartitionRequestClientHandler during job failures
> --
>
> Key: FLINK-9761
> URL: https://issues.apache.org/jira/browse/FLINK-9761
> Project: Flink
>  Issue Type: Bug
>  Components: Network
>Affects Versions: 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{PartitionRequestClientHandler#stagedMessages}} may be accessed from 
> multiple threads:
> 1) Netty's IO thread
> 2) During cancellation, 
> {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} is 
> called
> If {{PartitionRequestClientHandler.BufferListenerTask#notifyBufferDestroyed}} 
> thinks, {{stagesMessages}} is empty, however, it will not install the 
> {{stagedMessagesHandler}} that consumes and releases buffers from received 
> messages.
> Unless some unexpected combination of code calls prevents this from 
> happening, this would leak the non-recycled buffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10311) HA end-to-end/Jepsen tests for standby Dispatchers

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10311:
--
Fix Version/s: (was: 1.6.1)

> HA end-to-end/Jepsen tests for standby Dispatchers
> --
>
> Key: FLINK-10311
> URL: https://issues.apache.org/jira/browse/FLINK-10311
> Project: Flink
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Gary Yao
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We should add end-to-end or Jepsen tests to verify the HA behaviour when 
> using multiple standby Dispatchers. In particular, we should verify that jobs 
> get properly cleaned up after they finished successfully (see FLINK-10255 and 
> FLINK-10011):
> 1. Test that a standby Dispatcher does not affect job execution and resource 
> cleanup
> 2. Test that a standby Dispatcher recovers all submitted jobs after the 
> leader loses leadership



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9879) Find sane defaults for (advanced) SSL session parameters

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9879:
-
Fix Version/s: 1.6.2

> Find sane defaults for (advanced) SSL session parameters
> 
>
> Key: FLINK-9879
> URL: https://issues.apache.org/jira/browse/FLINK-9879
> Project: Flink
>  Issue Type: Improvement
>  Components: Network
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Nico Kruber
>Priority: Major
> Fix For: 1.6.2
>
>
> After adding these configuration parameters with 
> https://issues.apache.org/jira/browse/FLINK-9878:
> - SSL session cache size
> - SSL session timeout
> - SSL handshake timeout
> - SSL close notify flush timeout
> We should try to find sane defaults that "just work" :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10247) Run MetricQueryService in separate thread pool

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10247:
--
Fix Version/s: 1.6.2

> Run MetricQueryService in separate thread pool
> --
>
> Key: FLINK-10247
> URL: https://issues.apache.org/jira/browse/FLINK-10247
> Project: Flink
>  Issue Type: Sub-task
>  Components: Metrics
>Affects Versions: 1.5.3, 1.6.0, 1.7.0
>Reporter: Till Rohrmann
>Assignee: Shimin Yang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> In order to make the {{MetricQueryService}} run independently of the main 
> Flink components, it should get its own dedicated thread pool assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8418) Kafka08ITCase.testStartFromLatestOffsets() times out on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8418:
-
Fix Version/s: (was: 1.6.1)

> Kafka08ITCase.testStartFromLatestOffsets() times out on Travis
> --
>
> Key: FLINK-8418
> URL: https://issues.apache.org/jira/browse/FLINK-8418
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector, Tests
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> Instance: https://travis-ci.org/kl0u/flink/builds/327733085



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-8623) ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-8623:
-
Fix Version/s: (was: 1.6.1)

> ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics unstable on 
> Travis
> 
>
> Key: FLINK-8623
> URL: https://issues.apache.org/jira/browse/FLINK-8623
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0
>Reporter: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{ConnectionUtilsTest.testReturnLocalHostAddressUsingHeuristics}} fails on 
> Travis: https://travis-ci.org/apache/flink/jobs/33932



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-10097) More tests to increase StreamingFileSink test coverage

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-10097:
--
Fix Version/s: (was: 1.6.1)

> More tests to increase StreamingFileSink test coverage
> --
>
> Key: FLINK-10097
> URL: https://issues.apache.org/jira/browse/FLINK-10097
> Project: Flink
>  Issue Type: Sub-task
>  Components: filesystem-connector
>Affects Versions: 1.6.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.2
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9646) ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart failed on Travis

2018-09-15 Thread Till Rohrmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-9646:
-
Fix Version/s: (was: 1.6.1)

> ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart failed on 
> Travis
> 
>
> Key: FLINK-9646
> URL: https://issues.apache.org/jira/browse/FLINK-9646
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> {{ExecutionGraphCoLocationRestartTest.testConstraintsAfterRestart}} fails on 
> Travis.
> https://api.travis-ci.org/v3/job/395394863/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >