Re: [VOTE] Release 1.5.0, release candidate #5

2018-05-24 Thread Timo Walther
I tried to build the relase locally but 3 tests failed. 2 are related to 
the evironment in which the tests are executed (see [1]) and the other 
should have been caught by Travis which did not happen (see last 
comments in [2]).


[1] https://issues.apache.org/jira/browse/FLINK-9424
[2] https://issues.apache.org/jira/browse/FLINK-9234

Timo

Am 23.05.18 um 17:33 schrieb Till Rohrmann:

-1

Piotr just found a race condition between the TM registration at the RM and
slot requests coming from the SlotManager/RM [1]. This is a release blocker
since it affects all deployments. Consequently I have to cancel this RC :-(

[1] https://issues.apache.org/jira/browse/FLINK-9427

On Wed, May 23, 2018 at 5:15 PM, Gary Yao  wrote:


+1 (non-binding)

I have run all examples (batch & streaming), and only found a non-blocking
issue
with TPCHQuery3 [1] which has been introduced a year ago.

I have also deployed a cluster with HA enabled on YARN (Hadoop 2.8.3)
without
problems.

[1] https://issues.apache.org/jira/browse/FLINK-9399

On Wed, May 23, 2018 at 4:50 PM, Ted Yu  wrote:


+1

Checked signatures
Ran test suite

Due to FLINK-9340 and FLINK-9091, I had to run tests in multiple rounds.

Cheers

On Wed, May 23, 2018 at 7:39 AM, Fabian Hueske 

wrote:

+1 (binding)

- checked hashes and signatures
- checked source archive and didn't find unexpected binary files
- built from source archive skipping the tests (mvn -DskipTests clean
install), started a local cluster, and ran an example program.

Thanks,
Fabian



2018-05-23 15:39 GMT+02:00 Till Rohrmann :


Fabian pointed me to the updated ASF release policy [1] and the

changes

it

implies for the checksum files. New releases should no longer

provide a

MD5

checksum file and the sha checksum file should have a proper file

name

extension `sha512` instead of `sha`. I've updated the release

artifacts

[2]

accordingly.

[1] https://www.apache.org/dev/release-distribution.html
[2] http://people.apache.org/~trohrmann/flink-1.5.0-rc5/

Cheers,
Till

On Wed, May 23, 2018 at 11:18 AM, Piotr Nowojski <

pi...@data-artisans.com>

wrote:


+1 from me.

Additionally for this RC5 I did some manual tests to double check

backward

compatibility of the bug fix:
https://issues.apache.org/jira/browse/FLINK-9295 <
https://issues.apache.org/jira/browse/FLINK-9295>

The issue with this bug fix was that it was merged after 1.5.0 RC4

but

just before 1.5.0 RC5, so it missed release testing.

Piotrek


On 23 May 2018, at 09:08, Till Rohrmann 

wrote:

Thanks for the pointer Sihua, I've properly closed FLINK-9070.

On Wed, May 23, 2018 at 4:49 AM, sihua zhou 
wrote:

Hi,
just one minor thing, I found the JIRA release notes seem a bit
inconsistent with the this RC. For example,

https://issues.apache.org/

jira/browse/FLINK-9058 hasn't been merged yet but included in

the

release

notes, and https://issues.apache.org/jira/browse/FLINK-9070 has

been

merged but not included in the relase notes.

Best, Sihua


On 05/23/2018 09:18,Till Rohrmann
 wrote:

Hi everyone,

Please review and vote on the release candidate #5 for the

version

1.5.0,

as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific

comments)


The complete staging area is available for your review, which

includes:

* JIRA release notes [1],
* the official Apache source release and binary convenience

releases

to

be

deployed to dist.apache.org [2], which are signed with the key

with

fingerprint 1F302569A96CFFD5 [3],
* all artifacts to be deployed to the Maven Central Repository

[4],

* source code tag "release-1.5.0-rc5" [5],

Please use this document for coordinating testing efforts: [6]

Since the newly included fixes affect only individual components

and

are

covered by tests, I will shorten the voting period until

tomorrow

5:30pm

CET. It is adopted by majority approval, with at least 3 PMC

affirmative

votes.

Thanks,
Your friendly Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?
projectId=12315522&version=12341764
[2] http://people.apache.org/~trohrmann/flink-1.5.0-rc5/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/
orgapacheflink-1160/
[5]
https://git-wip-us.apache.org/repos/asf?p=flink.git;a=commit;h=
841bfe4cceecc9cd6ad3d568173fdc0149a5dc9b
[6]
https://docs.google.com/document/d/

1C1WwUphQj597jExWAXFUVkLH9Bi7-

ir6drW9BgB8Ezo/edit?usp=sharing

Pro-tip: you can create a settings.xml file with these contents:



flink-1.5.0



flink-1.5.0


flink-1.5.0


https://repository.apache.org/content/repositories/

orgapacheflink-1160/




archetype


https://repository.apache.org/content/repositories/

orgapacheflink-1160/








And reference that in you maven commands via --settings
path/to/settings.xml. This is useful for creating a quickstart

based

on

the

staged release and for building against the staged jars.








[jira] [Created] (FLINK-9431) Introduce TimeEnd State to flink cep

2018-05-24 Thread aitozi (JIRA)
aitozi created FLINK-9431:
-

 Summary: Introduce TimeEnd State to flink cep
 Key: FLINK-9431
 URL: https://issues.apache.org/jira/browse/FLINK-9431
 Project: Flink
  Issue Type: Improvement
  Components: CEP
Affects Versions: 1.4.2
Reporter: aitozi
Assignee: aitozi


Now flink cep have no support to reach a Final State upon past some time. if i 
use a pattern like 
{code:java}Pattern.begin('A').notFollowedBy("B"){code}, if i want A element be 
emitted after 5minutes, i have no way.

I want to introduce a timeEnd State to work with notFollowedBy to figure out 
with this scenior.
It can be used like this 
{code:java}Pattern.begin('A').notFollowedBy("B").timeEnd("end").{code},
[~dawidwys] [~kkl0u] Is this meaningful?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9432) Support extract epoch, decade, millisecond, microsecond

2018-05-24 Thread Sergey Nuyanzin (JIRA)
Sergey Nuyanzin created FLINK-9432:
--

 Summary: Support extract epoch, decade, millisecond, microsecond
 Key: FLINK-9432
 URL: https://issues.apache.org/jira/browse/FLINK-9432
 Project: Flink
  Issue Type: Sub-task
Reporter: Sergey Nuyanzin
Assignee: Sergey Nuyanzin


The task is to separate activity depending on 
https://issues.apache.org/jira/browse/CALCITE-2303 from all others that could 
be done without upgrade avatica/calcite

Now the implementations of next functions are blocked
{code:sql}
extract(decade from ...)
extract(epoch from ...)
extract(millisecond from ...)
extract(microsecond from ...)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Merging PR to release branches during RC voting

2018-05-24 Thread Piotr Nowojski
Hey,

I would like to raise an issue that happensa at couple of times during 1.5.0 
RCx votings. There were some bug fixes like:
https://issues.apache.org/jira/browse/FLINK-9349 

https://issues.apache.org/jira/browse/FLINK-9295 


That were merged to release branch after creating some RC, they were marked 
with fix versions set to 1.5.1 and new when previous RC vote was cancelled and 
new RC was created, they sneaked into 1.5.0.

Minor thing is that Jira issues should be updated accordingly. If you merge 
something to release branch, please pay attention to cancelled RCs and from 
which commit new RC is being created.

Much bigger issue is that when such fixes are sneaking so late to the release, 
they do not go through normal release testing procedure. I don’t know how 
should we handle such situations, but if something like that happened to your 
commits during this 1.5.0, please double check them when new RC comes out. They 
might require some manual testing or running some end to end tests. 

I would propose that any merges to release branches during RC votes (after 
release testing kicked off) should go through release manager, so that he could 
decide whether to include such fixes or postpone them to next release. Maybe 
this would require to create separate release branches for 1.5.0 and 1.5.1.

Piotrek

[jira] [Created] (FLINK-9433) SystemProcessingTimeService does not work properly

2018-05-24 Thread Ruidong Li (JIRA)
Ruidong Li created FLINK-9433:
-

 Summary: SystemProcessingTimeService does not work properly
 Key: FLINK-9433
 URL: https://issues.apache.org/jira/browse/FLINK-9433
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Reporter: Ruidong Li
Assignee: Ruidong Li


if  (WindowOperator --> AsyncWaitOperator) chained together, when the queue of 
AsyncWaitOperator is full and timeTrigger of WindowOperator is triggered to 
call collect(), it will wait until the queue of AsyncWaitOperator  is not full, 
at the moment, the timeTrigger of AsyncWaitOperator will not be triggered 
because the SystemProcessingTimeService has only one capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Increasing Disk Read Throughput and IOPS

2018-05-24 Thread Piotr Nowojski
Hi,

This issue might have something to do with compaction. Problems with compaction 
can especially degrade reads performance (or just increase reads IO). Have you 
tried to further enforce more compactions or change CompactionStyle?

Have you taken a look on 
org.apache.flink.contrib.streaming.state.PredefinedOptions?

Maybe Stefan or Andrey could share more input on this.

Piotrek


> On 22 May 2018, at 08:12, Govindarajan Srinivasaraghavan 
>  wrote:
> 
> Hi All,
> 
> We are running flink in AWS and we are observing a strange behavior. We are 
> using docker containers, EBS for storage and Rocks DB state backend. We have 
> a few map and value states with checkpointing every 30 seconds and 
> incremental checkpointing turned on. The issue we are noticing is the read 
> IOPS and read throughput gradually increases over time and keeps constantly 
> growing. The write throughput and write bytes are not increasing as much as 
> reads. The checkpoints are written to a durable NFS storage. We are not sure 
> what is causing this constant increase in read throughput but due to which we 
> are running out of EBS burst balance and need to restart the job every once 
> in a while. Attached the EBS read and write metrics. Has anyone encountered 
> this issue and what could be the possible solution.
> 
> We have also tried setting the below rocksdb options but didn't help.
> 
> DBOptions:
> currentOptions.setOptimizeFiltersForHits(true)
> .setWriteBufferSize(536870912)
> .setMaxWriteBufferNumber(5)
> .setMinWriteBufferNumberToMerge(2);
> ColumnFamilyOptions:
> 
> currentOptions.setMaxBackgroundCompactions(4)
> .setMaxManifestFileSize(1048576)
> .setMaxLogFileSize(1048576);
> 
> 
> 
> Thanks.
>  
>  
>  



Re: [VOTE] Release 1.5.0, release candidate #5

2018-05-24 Thread Till Rohrmann
Thanks for reporting the issues Timo. I think both of them are not blocking
the release and should be addressed for the following releases.

Cheers,
Till

On Thu, May 24, 2018 at 10:52 AM, Timo Walther  wrote:

> I tried to build the relase locally but 3 tests failed. 2 are related to
> the evironment in which the tests are executed (see [1]) and the other
> should have been caught by Travis which did not happen (see last comments
> in [2]).
>
> [1] https://issues.apache.org/jira/browse/FLINK-9424
> [2] https://issues.apache.org/jira/browse/FLINK-9234
>
> Timo
>
> Am 23.05.18 um 17:33 schrieb Till Rohrmann:
>
> -1
>>
>> Piotr just found a race condition between the TM registration at the RM
>> and
>> slot requests coming from the SlotManager/RM [1]. This is a release
>> blocker
>> since it affects all deployments. Consequently I have to cancel this RC
>> :-(
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9427
>>
>> On Wed, May 23, 2018 at 5:15 PM, Gary Yao  wrote:
>>
>> +1 (non-binding)
>>>
>>> I have run all examples (batch & streaming), and only found a
>>> non-blocking
>>> issue
>>> with TPCHQuery3 [1] which has been introduced a year ago.
>>>
>>> I have also deployed a cluster with HA enabled on YARN (Hadoop 2.8.3)
>>> without
>>> problems.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-9399
>>>
>>> On Wed, May 23, 2018 at 4:50 PM, Ted Yu  wrote:
>>>
>>> +1

 Checked signatures
 Ran test suite

 Due to FLINK-9340 and FLINK-9091, I had to run tests in multiple rounds.

 Cheers

 On Wed, May 23, 2018 at 7:39 AM, Fabian Hueske 

>>> wrote:
>>>
 +1 (binding)
>
> - checked hashes and signatures
> - checked source archive and didn't find unexpected binary files
> - built from source archive skipping the tests (mvn -DskipTests clean
> install), started a local cluster, and ran an example program.
>
> Thanks,
> Fabian
>
>
>
> 2018-05-23 15:39 GMT+02:00 Till Rohrmann :
>
> Fabian pointed me to the updated ASF release policy [1] and the
>>
> changes
>>>
 it
>
>> implies for the checksum files. New releases should no longer
>>
> provide a
>>>
 MD5
>
>> checksum file and the sha checksum file should have a proper file
>>
> name
>>>
 extension `sha512` instead of `sha`. I've updated the release
>>
> artifacts
>>>
 [2]
>
>> accordingly.
>>
>> [1] https://www.apache.org/dev/release-distribution.html
>> [2] http://people.apache.org/~trohrmann/flink-1.5.0-rc5/
>>
>> Cheers,
>> Till
>>
>> On Wed, May 23, 2018 at 11:18 AM, Piotr Nowojski <
>>
> pi...@data-artisans.com>
>
>> wrote:
>>
>> +1 from me.
>>>
>>> Additionally for this RC5 I did some manual tests to double check
>>>
>> backward
>>
>>> compatibility of the bug fix:
>>> https://issues.apache.org/jira/browse/FLINK-9295 <
>>> https://issues.apache.org/jira/browse/FLINK-9295>
>>>
>>> The issue with this bug fix was that it was merged after 1.5.0 RC4
>>>
>> but

> just before 1.5.0 RC5, so it missed release testing.
>>>
>>> Piotrek
>>>
>>> On 23 May 2018, at 09:08, Till Rohrmann 

>>> wrote:
>
>> Thanks for the pointer Sihua, I've properly closed FLINK-9070.

 On Wed, May 23, 2018 at 4:49 AM, sihua zhou >>>
>>> wrote:
>>
>>> Hi,
> just one minor thing, I found the JIRA release notes seem a bit
> inconsistent with the this RC. For example,
>
 https://issues.apache.org/
>>
>>> jira/browse/FLINK-9058 hasn't been merged yet but included in
>
 the
>>>
 release
>>>
 notes, and https://issues.apache.org/jira/browse/FLINK-9070 has
>
 been
>
>> merged but not included in the relase notes.
>
> Best, Sihua
>
>
> On 05/23/2018 09:18,Till Rohrmann
>  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #5 for the
>
 version
>>>
 1.5.0,
>>>
 as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific
>
 comments)
>
>>
> The complete staging area is available for your review, which
>
 includes:
>>
>>> * JIRA release notes [1],
> * the official Apache source release and binary convenience
>
 releases

> to
>>
>>> be
>>>
 deployed to dist.apache.org [2], which are signed with the key
>
 with

> fingerprint 1F302569A96CFFD5 [3],
> * all artifacts to be deployed to the Maven Central Repository
>
 [4],

> * source code tag "release-1.5.0-rc5" [5],
>
> Please use th

[jira] [Created] (FLINK-9434) Test instability in YARNSessionCapacitySchedulerITCase#

2018-05-24 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9434:
--

 Summary: Test instability in YARNSessionCapacitySchedulerITCase#
 Key: FLINK-9434
 URL: https://issues.apache.org/jira/browse/FLINK-9434
 Project: Flink
  Issue Type: Bug
  Components: Tests, YARN
Affects Versions: 1.5.0, 1.6.0
Reporter: Nico Kruber


{code}
Test 
testNonexistingQueueWARNmessage(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
 failed with:
java.lang.AssertionError: There is at least one application on the cluster is 
not finished.App application_1527164710351_0007 is in state RUNNING
at org.junit.Assert.fail(Assert.java:88)
at 
org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:217)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
{code}

https://travis-ci.org/apache/flink/jobs/383117221



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9435) Remove per-key selection Tuple instantiation via reflection in KeySelectorUtil#ComparableKeySelector

2018-05-24 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-9435:
--

 Summary: Remove per-key selection Tuple instantiation via 
reflection in KeySelectorUtil#ComparableKeySelector
 Key: FLINK-9435
 URL: https://issues.apache.org/jira/browse/FLINK-9435
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.4.2, 1.4.1, 1.3.3, 1.3.2, 1.4.0, 1.3.1, 1.3.0, 1.5.0, 
1.6.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Every {{ComparableKeySelector#getKey()}} call currently creates a new tuple 
from {{Tuple.getTupleClass(keyLength).newInstance();}} which seems expensive. 
Instead, we could get a template tuple and use {{Tuple#copy()}} which copies 
the right sub-class in a more optimal way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Flink 1.5 - Job fails to execute in multiple taskmanagers (parallelism > 1)

2018-05-24 Thread Edward Rojas
Hi all,

I was testing Flink 1.5 rc5 and I found this issue. I'm running a cluster in
HA mode with one jobmanager, several taskmanagers, each one with two task
slots and default parallelism set to 2.

I'm running two jobs, one simple one with a kafka consumer, a filter and a
sink. The other a little bit more complex with a kafka consumer, filters,
flatmaps, keyed process functions and sinks. 

Both jobs run correctly when they are assigned to run in the 2 slots of the
same taskmanager.

But when one slot in in one taskmanager and the other in a different one,
the simpler job runs cor correctly but the complex one fails with the
following error:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
java.lang.UnsupportedOperationException: Heap buffer
at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:170)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at
org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.exceptionCaught(SslHandler.java:697)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:809)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:341)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
Caused by:
org.apache.flink.shaded.netty4.io.netty.handler.codec.DecoderException:
java.lang.UnsupportedOperationException: Heap buffer
at
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:346)
at
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:229)
at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
... 9 more
Caused by: java.lang.UnsupportedOperationException: Heap buffer
at
org.apache.flink.runtime.io.network.netty.NettyBufferPool.heapBuffer(NettyBufferPool.java:236)
at
org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decode(Ssl

[VOTE] Release 1.5.0, release candidate #6

2018-05-24 Thread Till Rohrmann
Hi everyone,

Please review and vote on the release candidate #6 for the version 1.5.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 1F302569A96CFFD5 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.5.0-rc6" [5],
* PR to update the community web site [6]

Please use this document for coordinating testing efforts: [7]

The voting periods ends tomorrow at 5pm CET. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Your friendly Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12341764
[2] http://people.apache.org/~trohrmann/flink-1.5.0-rc6/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1161/
[5]
https://git-wip-us.apache.org/repos/asf?p=flink.git;a=commit;h=c61b108b8eaa22eac3dc492b3c95c22b4177003f
[6] https://github.com/apache/flink-web/pull/106
[7]
https://docs.google.com/document/d/10KDBLzt25Z44pdZBSm8MTeKldc4UTUAbAynfuVQWOt0/edit?usp=sharing

Pro-tip: you can create a settings.xml file with these contents:



  flink-1.5.0


  
flink-1.5.0

  
flink-1.5.0


https://repository.apache.org/content/repositories/orgapacheflink-1161/

  
  
archetype


https://repository.apache.org/content/repositories/orgapacheflink-1161/

  

  



And reference that in you maven commands via --settings
path/to/settings.xml. This is useful for creating a quickstart based on the
staged release and for building against the staged jars.


Re: Flink 1.5 - Job fails to execute in multiple taskmanagers (parallelism > 1)

2018-05-24 Thread Stephan Ewen
Two quick questions:

  - do you explicitly configure Flink memory onheap / offheap?
  - can you check whether this also happens when SSL is disabled?


On Thu, May 24, 2018 at 6:21 PM, Edward Rojas 
wrote:

> Hi all,
>
> I was testing Flink 1.5 rc5 and I found this issue. I'm running a cluster
> in
> HA mode with one jobmanager, several taskmanagers, each one with two task
> slots and default parallelism set to 2.
>
> I'm running two jobs, one simple one with a kafka consumer, a filter and a
> sink. The other a little bit more complex with a kafka consumer, filters,
> flatmaps, keyed process functions and sinks.
>
> Both jobs run correctly when they are assigned to run in the 2 slots of the
> same taskmanager.
>
> But when one slot in in one taskmanager and the other in a different one,
> the simpler job runs cor correctly but the complex one fails with the
> following error:
>
> org.apache.flink.runtime.io.network.netty.exception.
> LocalTransportException:
> java.lang.UnsupportedOperationException: Heap buffer
> at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestCli
> entHandler.exceptionCaught(CreditBasedPartitionRequestCli
> entHandler.java:170)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.
> java:131)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.
> java:131)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.
> exceptionCaught(ChannelHandlerAdapter.java:79)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:253)
> at
> org.apache.flink.shaded.netty4.io.netty.handler.ssl.
> SslHandler.exceptionCaught(SslHandler.java:697)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:275)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.notifyHandlerException(
> AbstractChannelHandlerContext.java:809)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:341)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:324)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.
> fireChannelRead(DefaultChannelPipeline.java:847)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.
> AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.
> processSelectedKey(NioEventLoop.java:511)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.
> processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.
> processSelectedKeys(NioEventLoop.java:382)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.
> NioEventLoop.run(NioEventLoop.java:354)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.
> SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.handler.codec.DecoderException:
> java.lang.UnsupportedOperationException: Heap buffer
> at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.
> ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:346)
> at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.
> Byte

Re: Increasing Disk Read Throughput and IOPS

2018-05-24 Thread Stephan Ewen
One thing that you can always to is disable fsync, because Flink does not
rely on RocksDBs fsync for persistence.

If you disable incremental checkpoints, does that help?
If yes, it could be an issue with too many small SSTable files due to
incremental checkpoints (an issue we have on the roadmap to fix).

On Thu, May 24, 2018 at 3:52 PM, Piotr Nowojski 
wrote:

> Hi,
>
> This issue might have something to do with compaction. Problems with
> compaction can especially degrade reads performance (or just increase reads
> IO). Have you tried to further enforce more compactions or change
> CompactionStyle?
>
> Have you taken a look on org.apache.flink.contrib.streaming.state.
> PredefinedOptions?
>
> Maybe Stefan or Andrey could share more input on this.
>
> Piotrek
>
>
> > On 22 May 2018, at 08:12, Govindarajan Srinivasaraghavan <
> govindragh...@gmail.com> wrote:
> >
> > Hi All,
> >
> > We are running flink in AWS and we are observing a strange behavior. We
> are using docker containers, EBS for storage and Rocks DB state backend. We
> have a few map and value states with checkpointing every 30 seconds and
> incremental checkpointing turned on. The issue we are noticing is the read
> IOPS and read throughput gradually increases over time and keeps constantly
> growing. The write throughput and write bytes are not increasing as much as
> reads. The checkpoints are written to a durable NFS storage. We are not
> sure what is causing this constant increase in read throughput but due to
> which we are running out of EBS burst balance and need to restart the job
> every once in a while. Attached the EBS read and write metrics. Has anyone
> encountered this issue and what could be the possible solution.
> >
> > We have also tried setting the below rocksdb options but didn't help.
> >
> > DBOptions:
> > currentOptions.setOptimizeFiltersForHits(true)
> > .setWriteBufferSize(536870912)
> > .setMaxWriteBufferNumber(5)
> > .setMinWriteBufferNumberToMerge(2);
> > ColumnFamilyOptions:
> >
> > currentOptions.setMaxBackgroundCompactions(4)
> > .setMaxManifestFileSize(1048576)
> > .setMaxLogFileSize(1048576);
> >
> >
> >
> > Thanks.
> >
> >
> >
>
>


Re: Flink 1.5 - Job fails to execute in multiple taskmanagers (parallelism > 1)

2018-05-24 Thread Edward Rojas
Regarding heap, the only configurations I do explicitly are
/`jobmanager.heap.mb`/, /`taskmanager.heap.mb`/ and
/`taskmanager.memory.preallocate: false`/. All other settings for memory
have their default value.

I just tested and it fails only when SSL is enabled.



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/


Re: Flink 1.5 - Job fails to execute in multiple taskmanagers (parallelism > 1)

2018-05-24 Thread Edward Rojas
This may help to target the issue:
If I let global ssl enabled, but I set taskmanager.data.ssl.enabled: false,
it works again.



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/


Re: Merging PR to release branches during RC voting

2018-05-24 Thread Tzu-Li (Gordon) Tai
Hi Piotr,

Thanks for bringing the issue up. I agree that this is something we should look 
into improving.

Strictly speaking, I don’t think it is an issue only limited to after a RC vote 
has started.
Taking 1.5.0 as an example, release-1.5 was branched out at the announcement 
for feature freeze for nearly a month before the first RC for 1.5.0 was 
created.
That period of time was meant for release testing before the actual votes took 
place, but since commits still went through to release-1.5 during this time 
without much restrictions, there never really was a point-in-time where 
individual testing efforts could have a commonly agreed scope.
We usually create RC0s for the initial testing efforts, but by the time RC1 is 
out, it is hard to say that test efforts on RC0 were still valid because there 
was little restraint on what was merged in between.

As for the issue under the context of during  RC voting:
Up to now, the default way we have mostly used to deal with non-blocking bug 
fixes was to “not block RC votes on them, merge them to the release branch and 
see if it happens to get in if the previous RC was cancelled.”
Sometimes, some critical bug fixes were made non-blocking due to the fact that 
it was introduced already in a previous released version.
The fact that these critical bug fixes sneak into the release in the end, as 
you mentioned, arguably changes the scope of the release across RCs.
Previous RC voters usually expect that previous test efforts can be carried 
over to the new RC because additional commits to the new RC *should* only 
incrementally affect the scope.
As the case for the two bug fixes you pointed out (FLINK-9349 and FLINK-9295), 
it is arguable that isn’t the case, since a lot of our previous test efforts 
involved Kafka substantially.

Minor thing is that Jira issues should be updated accordingly. If you merge 
something to release branch, please pay attention to cancelled RCs and from 
which commit new RC is being created. 

I don’t think this is as minor as it seems. The fact that this correction is a 
very manual process that occurs at a non-anticipated event can easily lead to 
incorrect release notes (which I think has happened a few times already in the 
past, partially because of this issue).

I would propose that any merges to release branches during RC votes (after 
release testing kicked off) should go through release manager, so that he could 
decide whether to include such fixes or postpone them to next release. Maybe 
this would require to create separate release branches for 1.5.0 and 1.5.1. 

Would +1 to this. Just to rephrase this concretely by taking 1.5.0 as an 
example:
A specific release-1.5.0 release branch is created from release-1.5 as soon as 
feature freeze is announced.
All JIRAs resolved from this point on should have ‘fix version’ set as ‘1.5.1’, 
and commits only go to the release-1.5 branch. They will only be included in 
release-1.5.1 once that branches out to prepare for 1.5.1 release.
Any commits that wish to be included in release-1.5.0 strictly needs to go 
through the release manager; a commit that does go through to release-1.5.0 
should mean that a new RC should be created and current ongoing votes are 
cancelled.
Managing changing such JIRAs’ fix version from ‘1.5.1’ to ‘1.5.0’ should be 
much more easier, since that should happen along with the action of bringing 
over commits to release-1.5.0, and not at the event of an RC being cancelled 
and the commits just happen to sneak in.

For minor bug fix releases, it should be quite easy to execute this proposal.
For major releases, we might need to consider the overhead of cherry-picking 
commits over.
Again, looking at 1.5.0 - there were hundreds of commits that were merged to 
both master (1.6-SNAPSHOT) and release-1.5.
Assuming we had this process in place, how many of those would have also needed 
to be merged to release-1.5.0 (as decided by the release manager)?

OTOH, if such a process allows more control over scope creep / more valid and 
meaningful test efforts early on right after feature freeze, than the overhead 
might be worth it.

Cheers,
Gordon
On 24 May 2018 at 8:24:29 PM, Piotr Nowojski (pi...@data-artisans.com) wrote:

Hey,  

I would like to raise an issue that happensa at couple of times during 1.5.0 
RCx votings. There were some bug fixes like:  
https://issues.apache.org/jira/browse/FLINK-9349 
  
https://issues.apache.org/jira/browse/FLINK-9295 
  

That were merged to release branch after creating some RC, they were marked 
with fix versions set to 1.5.1 and new when previous RC vote was cancelled and 
new RC was created, they sneaked into 1.5.0.  

Minor thing is that Jira issues should be updated accordingly. If you merge 
something to release branch, please pay attention to cancelled RCs and from 
which commit new RC is being created.  

Much bigger issue is

Re: Flink 1.5 - Job fails to execute in multiple taskmanagers (parallelism > 1)

2018-05-24 Thread Till Rohrmann
Thanks for reporting the issue Edward.

Taking a look at Netty SslHandler, it looks that we introduced this problem
with the update of the cipher algorithms [1]. Apparently, the SslHandler
wants to use inbound heap byte buffer when using a cipher suite with GCM
enabled [2]. This seems to be fixed with a later version of Netty 4 (we are
using 4.0.27 at the moment). The problem with the heap byte buffers are,
that our NettyBufferPool does not support the allocation of heap byte
buffers in order to have the memory consumption under control.

As a work around, you could set `security.ssl.algorithms` to
`TLS_RSA_WITH_AES_128_CBC_SHA` in the Flink configuration. That should make
it work again at the cost of using a cipher which is no longer recommended.

[1] https://issues.apache.org/jira/browse/FLINK-9310.
[2]
https://github.com/netty/netty/blob/netty-4.0.27.Final/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L1218

Cheers,
Till

On Thu, May 24, 2018 at 8:49 PM, Edward Rojas 
wrote:

> This may help to target the issue:
> If I let global ssl enabled, but I set taskmanager.data.ssl.enabled: false,
> it works again.
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>