[jira] [Created] (FLINK-36556) Allow to configure starting buffer size when using buffer debloating

2024-10-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-36556:
--

 Summary: Allow to configure starting buffer size when using buffer 
debloating
 Key: FLINK-36556
 URL: https://issues.apache.org/jira/browse/FLINK-36556
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network, Runtime / Task
Affects Versions: 1.19.1
Reporter: Piotr Nowojski


Starting buffer size is 32KB, so during recovery/startup, before backpressure 
kicks in, Flink job can be flooded with large buffers, completely stalling the 
progress.

Proposed solution is to make this starting size configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36555) Guarantee debloated buffer size grows even with very small alpha

2024-10-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-36555:
--

 Summary: Guarantee debloated buffer size grows even with very 
small alpha
 Key: FLINK-36555
 URL: https://issues.apache.org/jira/browse/FLINK-36555
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network, Runtime / Task
Affects Versions: 1.19.1
Reporter: Piotr Nowojski


With small buffer sizes and/or small EMA's alpha, currently it can happen 
buffer size won't grow at all. For example:
* current buffer size is 16b
* max new desired size is only capped at 2 * 16b = 32b
* so with alpha ~= 0.03 (or smaller) the new buffer size would be still 16b



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36416) Enable splittable timers for temporal join, temporal sort and windowed aggregation in SQL/Table API

2024-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-36416:
--

 Summary: Enable splittable timers for temporal join, temporal sort 
and windowed aggregation in SQL/Table API
 Key: FLINK-36416
 URL: https://issues.apache.org/jira/browse/FLINK-36416
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Runtime
Affects Versions: 2.0.0
Reporter: Piotr Nowojski
 Fix For: 2.0.0


This is a follow up for 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-443%3A+Interruptible+timers+firing
 . Temporal join, temporal sort and both windowed and table windowed 
aggregations in Table API/SQL can have large amount of registered/fired time, 
while at the same time enabling splittable timers shouldn't cause any side 
effects for those operators.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36260) numBytesInLocal and numBuffersInLocal being reported as remote

2024-09-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-36260:
--

 Summary: numBytesInLocal and numBuffersInLocal being reported as 
remote
 Key: FLINK-36260
 URL: https://issues.apache.org/jira/browse/FLINK-36260
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics, Runtime / Network
Affects Versions: 1.19.1, 1.20.0, 1.18.1, 1.17.2, 1.16.3, 1.15.4, 1.14.6, 
1.13.6, 1.12.7, 1.11.6
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


When {{UnknownInputChannel}} is converted to {{LocalInputChannel}}, that 
channel's metrics are incorrectly reported as remote.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36108) Wait for state download on cancellation to enforce cleanup

2024-08-20 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-36108:
--

 Summary: Wait for state download on cancellation to enforce cleanup
 Key: FLINK-36108
 URL: https://issues.apache.org/jira/browse/FLINK-36108
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35886) Incorrect watermark idleness timeout accounting when subtask is backpressured/blocked

2024-07-23 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35886:
--

 Summary: Incorrect watermark idleness timeout accounting when 
subtask is backpressured/blocked
 Key: FLINK-35886
 URL: https://issues.apache.org/jira/browse/FLINK-35886
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Runtime / Task
Affects Versions: 1.19.1, 1.18.1, 1.20.0
Reporter: Piotr Nowojski


Currently when using watermark with idleness in Flink, idleness can be 
incorrectly detected when reading records from a source that is blocked by the 
runtime. For example this can easily happen when source is either 
backpressured, or blocked by the watermark alignment. In those cases, despite 
there are more records to be read from the source (or source’s split), runtime 
is deciding not to poll (or being unable to) those records. In such case 
idleness timeout can kick in, marking source/source split as idle, which can 
lead to incorrect combined watermark calculations and dropping of incorrectly 
marked late records.

h4. Watermark alignment

If there are two source splits, A and B , and maxAllowedWatermarkDrift is set 
to 30s. 

# Partition A emitted watermark 1042 sec, while partition B sits at watermark 
1000 sec.
# {{1042s - 1000s > maxAllowedWatermarkDrift}}, so partition A is blocked by 
the watermark alignment.
# For the duration of idleTimeout, partition B is emitting some large batch of 
records, that do not advance watermark of that partition by much. For example 
either watermark for partition B stays 1000s, or is updated by a small amount 
to for example 1005s.
# idleTimeout kicks in, marking partition A as idle
# partition B finishes emitting large batch of those older records, and let's 
say now there is a gap in rowtimes. Previously partition B was emitting records 
with rowtime ~1000s, now it jumps to for example ~5000s.
# As partition A is idle, combined watermark jumps to ~5000s as well.
# Watermark alignment unblocks partition A, and it continues emitting records 
with rowtime ~1042s. But now all of those records are dropped due to being late.

h4. Backpressure

When there are two SourceOperator’s, A and B. Due to for example some data 
skew, it could happen that either only A gets backpressured, or A is 
backpressured quicker/sooner. Either way, during that time when A is 
backpressured, while B is not, B can bump the combined watermark high enough, 
so that when backpressure recedes, fresh records from A will be considered as 
late, leading to incorrect results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35821) ResumeCheckpointManuallyITCase failed with File X does not exist or the user running Flink C has insufficient permissions to access it

2024-07-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35821:
--

 Summary: ResumeCheckpointManuallyITCase failed with File X does 
not exist or the user running Flink C has insufficient permissions to access it
 Key: FLINK-35821
 URL: https://issues.apache.org/jira/browse/FLINK-35821
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 2.0.0, 1.20.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60857&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba


{noformat}
Jul 11 13:49:46 13:49:46.412 [ERROR] Tests run: 48, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 309.7 s <<< FAILURE! -- in 
org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase
Jul 11 13:49:46 13:49:46.412 [ERROR] 
org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase.testExternalizedIncrementalRocksDBCheckpointsWithLocalRecoveryZookeeper[RestoreMode
 = CLAIM] -- Time elapsed: 2.722 s <<< ERROR!
Jul 11 13:49:46 org.apache.flink.runtime.JobException: Recovery is suppressed 
by NoRestartBackoffTimeStrategy
Jul 11 13:49:46 at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:219)
Jul 11 13:49:46 at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.handleFailureAndReport(ExecutionFailureHandler.java:166)
Jul 11 13:49:46 at 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:121)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.recordTaskFailure(DefaultScheduler.java:281)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:272)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.DefaultScheduler.onTaskFailed(DefaultScheduler.java:265)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.SchedulerBase.onTaskExecutionStateUpdate(SchedulerBase.java:800)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:777)
Jul 11 13:49:46 at 
org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:83)
Jul 11 13:49:46 at 
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:515)
Jul 11 13:49:46 at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown 
Source)
Jul 11 13:49:46 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Jul 11 13:49:46 at java.lang.reflect.Method.invoke(Method.java:498)
Jul 11 13:49:46 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:318)
Jul 11 13:49:46 at 
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
Jul 11 13:49:46 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:316)
Jul 11 13:49:46 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:229)
Jul 11 13:49:46 at 
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:88)
Jul 11 13:49:46 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:174)
Jul 11 13:49:46 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
Jul 11 13:49:46 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
Jul 11 13:49:46 at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
Jul 11 13:49:46 at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
Jul 11 13:49:46 at 
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
Jul 11 13:49:46 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
Jul 11 13:49:46 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
Jul 11 13:49:46 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
Jul 11 13:49:46 at 
org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
Jul 11 13:49:46 at 
org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
Jul 11 13:49:46 at 
org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
Jul 11 13:49:46 at 
org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
Jul 11 13:49:46 at 
org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
Jul 11 13:49:46 at 
org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)

{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1

[jira] [Created] (FLINK-35773) Document s5cmd

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35773:
--

 Summary: Document s5cmd
 Key: FLINK-35773
 URL: https://issues.apache.org/jira/browse/FLINK-35773
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35772) Deprecate/remove DuplicatingFileSystem

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35772:
--

 Summary: Deprecate/remove DuplicatingFileSystem
 Key: FLINK-35772
 URL: https://issues.apache.org/jira/browse/FLINK-35772
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35771) Limit s5cmd resource usage

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35771:
--

 Summary: Limit s5cmd resource usage
 Key: FLINK-35771
 URL: https://issues.apache.org/jira/browse/FLINK-35771
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35770) Interrupt s5cmd call on cancellation

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35770:
--

 Summary: Interrupt s5cmd call on cancellation 
 Key: FLINK-35770
 URL: https://issues.apache.org/jira/browse/FLINK-35770
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35768) Use native file copy in RocksDBStateDownloader

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35768:
--

 Summary: Use native file copy in RocksDBStateDownloader
 Key: FLINK-35768
 URL: https://issues.apache.org/jira/browse/FLINK-35768
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35767) Provide native file copy support for S3 using s5cmd

2024-07-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35767:
--

 Summary: Provide native file copy support for S3 using s5cmd
 Key: FLINK-35767
 URL: https://issues.apache.org/jira/browse/FLINK-35767
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / FileSystem
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35739) FLIP-444: Native file copy support

2024-07-02 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35739:
--

 Summary: FLIP-444: Native file copy support
 Key: FLINK-35739
 URL: https://issues.apache.org/jira/browse/FLINK-35739
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / FileSystem
Reporter: Piotr Nowojski


https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support

State downloading in Flink can be a time and CPU consuming operation, which is 
especially visible if CPU resources per task slot are strictly restricted to 
for example a single CPU. Downloading 1GB of state size can take significant 
amount of time, while the code doing so is quite inefficient.

Currently when downloading state files, Flink is creating an FSDataInputStream 
from the remote file, and copies its bytes, to an OutputStream pointing to a 
local file (in the RocksDBStateDownloader#downloadDataForStateHandle method). 
FSDataInputStream internally is being wrapped by many layers of abstractions 
and indirections and what’s worse, every file is being copied individually, 
which leads to quite high overheads for small files. Download times and 
download process CPU efficiency can be significantly improved if we introduced 
an API to allow org.apache.flink.core.fs.FileSystem to copy many files natively 
and all at once.

For S3, there are at least two potential implementations. The first one is 
using AWS SDKv2 directly (Flink currently is using AWS SDKv1 wrapped by 
hadoop/presto) and Amazon S3 Transfer Manager. Second option is to use a 3rd 
party tool called s5cmd. It is claimed to be a faster alternative to the 
official AWS clients, which was confirmed by our benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35528) Skip execution of interruptible mails when yielding

2024-06-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35528:
--

 Summary: Skip execution of interruptible mails when yielding
 Key: FLINK-35528
 URL: https://issues.apache.org/jira/browse/FLINK-35528
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.20.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


When operators are yielding, for example waiting for async state access to 
complete before a checkpoint, it would be beneficial to not execute 
interruptible mails. Otherwise continuation mail for firing timers would be 
continuously re-enqeueed. To achieve that MailboxExecutor must be aware which 
mails are interruptible.

The easiest way to achieve this is to set MIN_PRIORITY for interruptible mails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35518) CI Bot doesn't run on PRs

2024-06-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35518:
--

 Summary: CI Bot doesn't run on PRs
 Key: FLINK-35518
 URL: https://issues.apache.org/jira/browse/FLINK-35518
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.20.0
Reporter: Piotr Nowojski


Doesn't want to run on my PR/branch. I was doing force-pushes, rebases, asking 
flink bot to run, closed and opened new PR, but nothing helped
https://github.com/apache/flink/pull/24868
https://github.com/apache/flink/pull/24883

I've heard others were having similar problems recently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35420) WordCountMapredITCase fails to compile in IntelliJ

2024-05-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35420:
--

 Summary: WordCountMapredITCase fails to compile in IntelliJ
 Key: FLINK-35420
 URL: https://issues.apache.org/jira/browse/FLINK-35420
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hadoop Compatibility
Affects Versions: 1.20.0
Reporter: Piotr Nowojski


{noformat}
flink-connectors/flink-hadoop-compatibility/src/test/scala/org/apache/flink/api/hadoopcompatibility/scala/WordCountMapredITCase.scala:42:8
value isFalse is not a member of ?0
possible cause: maybe a semicolon is missing before `value isFalse'?
  .isFalse()
{noformat}

Might be caused by:
https://youtrack.jetbrains.com/issue/SCL-20679




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35065) Add numFiredTimers and numFiredTimersPerSecond metrics

2024-04-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35065:
--

 Summary: Add numFiredTimers and numFiredTimersPerSecond metrics
 Key: FLINK-35065
 URL: https://issues.apache.org/jira/browse/FLINK-35065
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics, Runtime / Task
Affects Versions: 1.19.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.20.0


Currently there is now way of knowing how many timers are being fired by Flink, 
so it's impossible to distinguish, even using code profiling, if operator is 
firing only a couple of heavy timers per second using ~100% of the CPU time, vs 
firing thousands of timer per seconds.

We could add the following metrics to address this issue:
* numFiredTimers - total number of fired timers per operator
* numFiredTimersPerSecond - per second rate of firing timers per operator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35051) Weird priorities when processing unaligned checkpoints

2024-04-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-35051:
--

 Summary: Weird priorities when processing unaligned checkpoints
 Key: FLINK-35051
 URL: https://issues.apache.org/jira/browse/FLINK-35051
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Network, Runtime / Task
Affects Versions: 1.18.1, 1.19.0, 1.17.2
Reporter: Piotr Nowojski


While looking through the code I noticed that `StreamTask` is processing 
unaligned checkpoints in strange order/priority. The end result is that 
unaligned checkpoint `Start Delay` time can be increased, and triggering 
checkpoints in `StreamTask` can be unnecessary delayed by other mailbox actions 
in the system, like for example:
* processing time timers
* `AsyncWaitOperator` results
* ... 

Incoming UC barrier is treated as a priority event by the network stack (it 
will be polled from the input before anything else). This is what we want, but 
polling elements from network stack has lower priority then processing enqueued 
mailbox actions.

Secondly, if AC barrier timeout to UC, that's done via a mailbox action, but 
this mailbox action is also not prioritised in any way, so other mailbox 
actions could be unnecessarily executed first. 

On top of that there is a clash of two separate concepts here:
# Mailbox priority. yieldToDownstream - so in a sense reverse to what we would 
like to have for triggering checkpoint, but that only kicks in #yield() calls, 
where it's actually correct, that operator in a middle of execution can not 
yield to checkpoint - it should only yield to downstream.
# Control mails in mailbox executor - cancellation is done via that, it 
bypasses whole mailbox queue.
# Priority events in the network stack.

It's unfortunate that 1. vs 3. has a naming clash, as priority name is used in 
both things, and highest network priority event containing UC barrier, when 
executed via mailbox has actually the lowest mailbox priority.

Control mails mechanism is a kind of priority mails executed out of order, but 
doesn't generalise well for use in checkpointing.

This whole thing should be re-worked at some point. Ideally what we would like 
have is that:
* mail to convert AC barriers to UC
* polling UC barrier from the network input
* checkpoint trigger via RPC for source tasks
should be processed first, with an exception of yieldToDownstream, where 
current mailbox priorities should be adhered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34913) ConcurrentModificationException SubTaskInitializationMetricsBuilder.addDurationMetric

2024-03-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-34913:
--

 Summary: ConcurrentModificationException 
SubTaskInitializationMetricsBuilder.addDurationMetric
 Key: FLINK-34913
 URL: https://issues.apache.org/jira/browse/FLINK-34913
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics, Runtime / State Backends
Affects Versions: 1.19.0
Reporter: Piotr Nowojski
 Fix For: 1.19.1


The following failures can occur during job's recovery when using clip & ingest

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.compute(HashMap.java:1230)
at 
org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.runAndReportDuration(RocksDBIncrementalRestoreOperation.java:988)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.lambda$createAsyncCompactionTask$2(RocksDBIncrementalRestoreOperation.java:291)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.compute(HashMap.java:1230)
at 
org.apache.flink.runtime.checkpoint.SubTaskInitializationMetricsBuilder.addDurationMetric(SubTaskInitializationMetricsBuilder.java:49)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:794)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:744)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:744)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:704)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:953)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:922)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:746)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34430) Akka frame size exceeded with many ByteStreamStateHandle being used

2024-02-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-34430:
--

 Summary: Akka frame size exceeded with many ByteStreamStateHandle 
being used
 Key: FLINK-34430
 URL: https://issues.apache.org/jira/browse/FLINK-34430
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.17.2, 1.16.3, 1.19.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33775) Report JobInitialization traces

2023-12-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33775:
--

 Summary: Report JobInitialization traces
 Key: FLINK-33775
 URL: https://issues.apache.org/jira/browse/FLINK-33775
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33709) Report CheckpointStats as Spans

2023-11-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33709:
--

 Summary: Report CheckpointStats as Spans
 Key: FLINK-33709
 URL: https://issues.apache.org/jira/browse/FLINK-33709
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33708) Add Span and TraceReporter concepts

2023-11-30 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33708:
--

 Summary: Add Span and TraceReporter concepts
 Key: FLINK-33708
 URL: https://issues.apache.org/jira/browse/FLINK-33708
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33696) FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33696:
--

 Summary: FLIP-385: Add OpenTelemetryTraceReporter and 
OpenTelemetryMetricReporter
 Key: FLINK-33696
 URL: https://issues.apache.org/jira/browse/FLINK-33696
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0


h1. Motivation

[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces]
 is adding TraceReporter interface. However with 
[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces]
 alone, Log4jTraceReporter would be the only available implementation of 
TraceReporter interface, which is not very helpful.

In this FLIP I’m proposing to contribute both MetricExporter and TraceReporter 
implementation using OpenTelemetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33697) FLIP-386: Support adding custom metrics in Recovery Spans

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33697:
--

 Summary: FLIP-386: Support adding custom metrics in Recovery Spans
 Key: FLINK-33697
 URL: https://issues.apache.org/jira/browse/FLINK-33697
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics, Runtime / State Backends
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0


h1. Motivation

FLIP-386 is building on top of 
[FLIP-384|https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces].
 The intention here is to add a capability for state backends to attach custom 
attributes during recovery to recovery spans. For example 
RocksDBIncrementalRestoreOperation could report both remote download time and 
time to actually clip/ingest the RocksDB instances after rescaling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33695) FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33695:
--

 Summary: FLIP-384: Introduce TraceReporter and use it to create 
checkpointing and recovery traces
 Key: FLINK-33695
 URL: https://issues.apache.org/jira/browse/FLINK-33695
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Checkpointing, Runtime / Metrics
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.19.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33338) Bump up RocksDB version to 7.x

2023-10-23 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-8:
--

 Summary: Bump up RocksDB version to 7.x
 Key: FLINK-8
 URL: https://issues.apache.org/jira/browse/FLINK-8
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski


We need to bump RocksDB in order to be able to use new IngestDB and ClipDB 
commands.

If some of the required changes haven't been merged to Facebook/RocksDB, we 
should cherry-pick and include them in our FRocksDB fork.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33337) Expose IngestDB and ClipDB in the official RocksDB API

2023-10-23 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-7:
--

 Summary: Expose IngestDB and ClipDB in the official RocksDB API
 Key: FLINK-7
 URL: https://issues.apache.org/jira/browse/FLINK-7
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski
Assignee: Yue Ma


Remaining open PR: https://github.com/facebook/rocksdb/pull/11646



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33071) Log checkpoint statistics

2023-09-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-33071:
--

 Summary: Log checkpoint statistics 
 Key: FLINK-33071
 URL: https://issues.apache.org/jira/browse/FLINK-33071
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Checkpointing, Runtime / Metrics
Affects Versions: 1.18.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


This is a stop gap solution until we have a proper way of solving FLINK-23411.

The plan is to dump JSON serialised checkpoint statistics into Flink JM's log, 
with a {{DEBUG}} level. This could be used to analyse what has happened with a 
certain checkpoint in the past.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32303) Incorrect error message in KafkaSource

2023-06-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-32303:
--

 Summary: Incorrect error message in KafkaSource 
 Key: FLINK-32303
 URL: https://issues.apache.org/jira/browse/FLINK-32303
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.17.1, 1.18.0
Reporter: Piotr Nowojski


When exception is thrown from an operator chained with a KafkaSource, 
KafkaSource is returning a misleading error, like shown below:

{noformat}
java.io.IOException: Failed to deserialize consumer record due to
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:56)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:33)
 ~[classes/:?]
at 
org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:160)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:562)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:852)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:801) 
~[classes/:?]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
 ~[classes/:?]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) 
[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) 
[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) 
[classes/:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: 
org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:92)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter$SourceOutputWrapper.collect(KafkaRecordEmitter.java:67)
 ~[classes/:?]
at 
org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:84)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.deserializer.KafkaValueOnlyDeserializationSchemaWrapper.deserialize(KafkaValueOnlyDeserializationSchemaWrapper.java:51)
 ~[classes/:?]
at 
org.apache.flink.connector.kafka.source.reader.KafkaRecordEmitter.emitRecord(KafkaRecordEmitter.java:53)
 ~[classes/:?]
... 14 more
Caused by: org.apache.flink.runtime.operators.testutils.ExpectedTestException: 
Failover!
at 
org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:263)
 ~[test-classes/:?]
at 
org.apache.flink.test.ManySmallJobsBenchmarkITCase$ThrottlingAndFailingIdentityMap.map(ManySmallJobsBenchmarkITCase.java:243)
 ~[test-classes/:?]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:309)
 ~[classes/:?]
at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 ~[classes/:?]
   

[jira] [Created] (FLINK-31045) In IntelliJ: flink-clients cannot find symbol symbol: class TestUserClassLoaderJobLib

2023-02-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-31045:
--

 Summary: In IntelliJ: flink-clients cannot find symbol symbol: 
class TestUserClassLoaderJobLib
 Key: FLINK-31045
 URL: https://issues.apache.org/jira/browse/FLINK-31045
 Project: Flink
  Issue Type: Bug
  Components: Build System, Client / Job Submission
Affects Versions: 1.18.0
Reporter: Piotr Nowojski


When trying to build/run some tests in the IDE, IntelliJ is reporting the 
following compilation failure:

{noformat}
/XXX/flink/flink-clients/src/test/java/org/apache/flink/client/testjar/TestUserClassLoaderJob.java:33:38
java: cannot find symbol
  symbol:   class TestUserClassLoaderJobLib
  location: class org.apache.flink.client.testjar.TestUserClassLoaderJob
{noformat}

A workaround seems to be to:
# right click on {{flink-clients}}
# Rebuild module (flink-clients)

The issue is probably related to the comment from the flink-clients/pom.xml 
file:
{noformat}

{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30791) Codespeed machine is not responding

2023-01-25 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30791:
--

 Summary: Codespeed machine is not responding
 Key: FLINK-30791
 URL: https://issues.apache.org/jira/browse/FLINK-30791
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.16.0, 1.17.0
Reporter: Piotr Nowojski


Neither speedcenter: [http://codespeed.dak8s.net:8000/]

nor jenkins: [http://codespeed.dak8s.net:8080|http://codespeed.dak8s.net:8080/]

are responding



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30155) Pretty print MutatedConfigurationException

2022-11-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30155:
--

 Summary: Pretty print MutatedConfigurationException
 Key: FLINK-30155
 URL: https://issues.apache.org/jira/browse/FLINK-30155
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


Currently MutatedConfigurationException is printed as:

{noformat}
org.apache.flink.client.program.MutatedConfigurationException: 
Configuration execution.sorted-inputs.enabled:true not allowed.
Configuration execution.runtime-mode was changed from STREAMING to BATCH.
Configuration execution.checkpointing.interval:500 ms not allowed in the 
configuration object CheckpointConfig.
Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in the 
configuration object CheckpointConfig.
Configuration pipeline.max-parallelism:1024 not allowed in the 
configuration object ExecutionConfig.
Configuration parallelism.default:25 not allowed in the configuration 
object ExecutionConfig.
at 
org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235)
at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175)
at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027)
{noformat}
Which is slightly confusing. First not allowed configuration is listed in the 
same line as the exception name, which (especially if wrapped) can make it more 
difficult than necessary for user to understand that this is a list of 
violations. I'm proposing to change it to:
{noformat}
org.apache.flink.client.program.MutatedConfigurationException: Not allowed 
configuration change(s) were detected:
 - Configuration execution.sorted-inputs.enabled:true not allowed.
 - Configuration execution.runtime-mode was changed from STREAMING to BATCH.
 - Configuration execution.checkpointing.interval:500 ms not allowed in the 
configuration object CheckpointConfig.
 - Configuration execution.checkpointing.mode:EXACTLY_ONCE not allowed in 
the configuration object CheckpointConfig.
 - Configuration pipeline.max-parallelism:1024 not allowed in the 
configuration object ExecutionConfig.
 - Configuration parallelism.default:25 not allowed in the configuration 
object ExecutionConfig.
at 
org.apache.flink.client.program.StreamContextEnvironment.checkNotAllowedConfigurations(StreamContextEnvironment.java:235)
at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:175)
at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:115)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2049)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2027)
{noformat} 
To make it more clear.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30070) Create savepoints without side effects

2022-11-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30070:
--

 Summary: Create savepoints without side effects
 Key: FLINK-30070
 URL: https://issues.apache.org/jira/browse/FLINK-30070
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream, Runtime / Checkpointing
Affects Versions: 1.14.6, 1.15.2, 1.16.0
Reporter: Piotr Nowojski


Side effects are any external state - a state that is stored not in Flink, but 
in an external system, like for example connectors transactions (KafkaSink, 
...).

We shouldn't be relaying on the external systems for storing part of the job's 
state, especially for any long period of time. The most prominent issue is that 
Kafka transactions can time out, leading to a data loss if transaction hasn't 
been committed.

Stop-with-savepoint, currently  guarantee that {{notifyCheckpointCompleted}} 
call will be issued, so properly implemented operators are guaranteed to 
committed it's state. However this information is currently not stored in the 
checkpoint in any way ( FLINK-16419 ). Larger issue is how to deal with 
savepoints, since there we currently do not have any guarantees that 
transactions have been committed. 

Some potential solution might be to expand API (like {{CheckpointedFunction}} 
), to let the operators/functions know, that they should 
close/commit/clear/deal with external state differently and use that API during 
stop-with-savepoint + rework how regular savepoints are handled. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30068) Allow users to configure what to do with errors while committing transactions during recovery in KafkaSink

2022-11-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-30068:
--

 Summary: Allow users to configure what to do with errors while 
committing transactions during recovery in KafkaSink
 Key: FLINK-30068
 URL: https://issues.apache.org/jira/browse/FLINK-30068
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Kafka
Affects Versions: 1.15.2, 1.16.0, 1.17.0
Reporter: Piotr Nowojski
 Fix For: 1.17.0


Currently it looks like {{KafkaSink}} fails the job on any failures to commit 
transactions. As [reported by the 
user|https://lists.apache.org/thread/4f6bb8j6qtvgp888y4dxgj86x3kw2b11], this 
makes impossible for jobs to recover from older Savepoints.

{noformat}
2022-11-16 10:01:07.168 [flink-akka.actor.default-dispatcher-13] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph  - Balances aggreagation 
ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka 
realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> 
Save to Kafka daily ETH: Committer) (4/5) (6d4d91ab8657bba830695b9a011f7db6) 
switched from INITIALIZING to RUNNING.
2022-11-16 10:01:37.222 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering 
checkpoint 65436 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1668592897201 for job 
.
2022-11-16 10:01:39.082 [flink-akka.actor.default-dispatcher-13] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph  - Balances aggreagation 
ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> Save to Kafka 
realtime ETH: Committer, Filter -> Map -> Save to Kafka daily ETH: Writer -> 
Save to Kafka daily ETH: Committer) (1/5) (cfaca46e7f4dc89629cdcaed5b48c059) 
switched from RUNNING to FAILED on 10.42.145.181:33297-efc328 @ 
eth-top-holders-v2-flink-taskmanager-0.eth-top-holders-v2-flink-taskmanager.flink.svc.cluster.local
 (dataPort=43125).
java.io.IOException: Could not perform checkpoint 65436 for operator Balances 
aggreagation ETH -> (Filter -> Map -> Save to Kafka realtime ETH: Writer -> 
Save to Kafka realtime ETH: Committer, Filter -> Map -> Save to Kafka daily 
ETH: Writer -> Save to Kafka daily ETH: Committer) (1/5)#0.
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1210)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927

[jira] [Created] (FLINK-29888) Improve MutatedConfigurationException for disallowed changes in CheckpointConfig and ExecutionConfig

2022-11-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29888:
--

 Summary: Improve MutatedConfigurationException for disallowed 
changes in CheckpointConfig and ExecutionConfig
 Key: FLINK-29888
 URL: https://issues.apache.org/jira/browse/FLINK-29888
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


Currently if {{CheckpointConfig}} or {{ExecutionConfig}} are modified in a 
non-allowed way, user gets a generic error "Configuration object 
ExecutionConfig changed", without a hint of what has been modified. With 
FLINK-29379 we can improve this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29886) networkThroughput 1000,100ms,OpenSSL Benchmark is failing

2022-11-04 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29886:
--

 Summary: networkThroughput 1000,100ms,OpenSSL Benchmark is failing
 Key: FLINK-29886
 URL: https://issues.apache.org/jira/browse/FLINK-29886
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Network
Affects Versions: 1.17.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8080/job/flink-master-benchmarks-java8/837/console

{noformat}
04:09:40  java.lang.NoClassDefFoundError: 
org/apache/flink/shaded/netty4/io/netty/internal/tcnative/CertificateCompressionAlgo
04:09:40at 
org.apache.flink.shaded.netty4.io.netty.handler.ssl.OpenSslX509KeyManagerFactory$OpenSslKeyManagerFactorySpi.engineInit(OpenSslX509KeyManagerFactory.java:129)
04:09:40at 
javax.net.ssl.KeyManagerFactory.init(KeyManagerFactory.java:256)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.java:279)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:324)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalNettySSLContext(SSLUtils.java:303)
04:09:40at 
org.apache.flink.runtime.net.SSLUtils.createInternalClientSSLEngineFactory(SSLUtils.java:119)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyConfig.createClientSSLEngineFactory(NettyConfig.java:147)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyClient.init(NettyClient.java:115)
04:09:40at 
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.start(NettyConnectionManager.java:87)
04:09:40at 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.start(NettyShuffleEnvironment.java:349)
04:09:40at 
org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkBenchmarkEnvironment.setUp(StreamNetworkBenchmarkEnvironment.java:133)
04:09:40at 
org.apache.flink.streaming.runtime.io.benchmark.StreamNetworkThroughputBenchmark.setUp(StreamNetworkThroughputBenchmark.java:108)
04:09:40at 
org.apache.flink.benchmark.StreamNetworkThroughputBenchmarkExecutor$MultiEnvironment.setUp(StreamNetworkThroughputBenchmarkExecutor.java:117)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29807) Drop TypeSerializerConfigSnapshot and savepoint support from Flink versions < 1.8.0

2022-10-31 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29807:
--

 Summary: Drop TypeSerializerConfigSnapshot and savepoint support 
from Flink versions < 1.8.0
 Key: FLINK-29807
 URL: https://issues.apache.org/jira/browse/FLINK-29807
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.17
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17


The motivation behind this move is two fold. One reason is that it complicates 
our code base unnecessarily and creates confusion on how to actually implement 
custom serializers. The immediate reason is that I wanted to clean up Flink's 
configuration stack a bit and refactor the ExecutionConfig class [2]. This 
refactor would keep the API compatibility of the ExecutionConfig, but it would 
break savepoint compatibility with snapshots written with some of the old 
serializers, which had ExecutionConfig as a field and were serialized in the 
snapshot. This issue has been resolved by the introduction of 
TypeSerializerSnapshot in Flink 1.7 [3], where serializers are no longer part 
of the snapshot.

TypeSerializerConfigSnapshot has been deprecated and no longer used by built-in 
serializers since Flink 1.8 [4] and [5]. Users were encouraged to migrate to 
TypeSerializerSnapshot since then with their own custom serializers. That has 
been plenty of time for the migration.

This proposal would have the following impact for the users:
1. we would drop support for recovery from savepoints taken with Flink < 1.7.0 
for all built in types serializers
2. we would drop support for recovery from savepoints taken with Flink < 1.8.0 
for built in kryo serializers
3. we would drop support for recovery from savepoints taken with Flink < 1.17 
for custom serializers using deprecated TypeSerializerConfigSnapshot

1. and 2. would have a simple migration path. Users migrating from those old 
savepoints would have to first start his job using a Flink version from the 
[1.8, 1.16] range, and take a new savepoint that would be compatible with Flink 
1.17.
3. This is a bit more problematic, because users would have to first migrate 
their own custom serializers to use TypeSerializerSnapshot (using a Flink 
version from the [1.8, 1.16]), take a savepoint, and only then migrate to Flink 
1.17. However users had already 4 years to migrate, which in my opinion has 
been plenty of time to do so.

*As discussed and vote is currently in progress*: 
https://lists.apache.org/thread/x5d0p08pf2wx47njogsgqct0k5rpfrl4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29662) PojoSerializerSnapshot is using incorrect ExecutionConfig when restoring serializer

2022-10-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29662:
--

 Summary: PojoSerializerSnapshot is using incorrect ExecutionConfig 
when restoring serializer
 Key: FLINK-29662
 URL: https://issues.apache.org/jira/browse/FLINK-29662
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System
Affects Versions: 1.17.0
Reporter: Piotr Nowojski


{{org.apache.flink.api.java.typeutils.runtime.PojoSerializerSnapshot#restoreSerializer}}
 is using freshly created {{new ExecutionConfig()}} execution config for the 
restored serializer, which overrides any configuration choices made by the 
user. Most likely this is a dormant bug, since restored serializer shouldn't be 
used for serializing any new data, and the {{ExecutionConfig}} is only used for 
subclasses serializations that haven't been cached. If this is indeed the case, 
I would propose to change it's value to {{null}} and safeguard accesses to that 
field with an {{IllegalStateException}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29645) BatchExecutionKeyedStateBackend is using incorrect ExecutionConfig when creating serializer

2022-10-14 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29645:
--

 Summary: BatchExecutionKeyedStateBackend is using incorrect 
ExecutionConfig when creating serializer
 Key: FLINK-29645
 URL: https://issues.apache.org/jira/browse/FLINK-29645
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.14.6, 1.15.2, 1.13.6, 1.12.7, 1.16.0, 1.17.0
Reporter: Piotr Nowojski


{{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionKeyedStateBackend#getOrCreateKeyedState}}
 is using freshly constructed {{ExecutionConfig}}, instead of the one 
configured by the user from the environment.


{code:java}
public  S getOrCreateKeyedState(
TypeSerializer namespaceSerializer, StateDescriptor 
stateDescriptor)
throws Exception {
checkNotNull(namespaceSerializer, "Namespace serializer");
checkNotNull(
keySerializer,
"State key serializer has not been configured in the config. "
+ "This operation cannot use partitioned state.");

if (!stateDescriptor.isSerializerInitialized()) {
stateDescriptor.initializeSerializerUnlessSet(new 
ExecutionConfig());
}
{code}

The correct one could be obtained from {{env.getExecutionConfig()}} in 
{{org.apache.flink.streaming.api.operators.sorted.state.BatchExecutionStateBackend#createKeyedStateBackend}}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29584) Upgrade java 11 version on the microbenchmark worker

2022-10-11 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-29584:
--

 Summary: Upgrade java 11 version on the microbenchmark worker
 Key: FLINK-29584
 URL: https://issues.apache.org/jira/browse/FLINK-29584
 Project: Flink
  Issue Type: Improvement
  Components: Benchmarks
Affects Versions: 1.17.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.17.0


It looks like the older JDK 11 builds have problems with JIT in the 
microbenchmarks, as for example visible 
[here|http://codespeed.dak8s.net:8000/timeline/?ben=globalWindow&env=2]. 
Locally I was able to reproduce this problem and the issue goes away after 
upgrading to a newer JDK 11 build.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28835) Savepoint and checkpoint capabilities and limitations table is incorrect

2022-08-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-28835:
--

 Summary: Savepoint and checkpoint capabilities and limitations 
table is incorrect
 Key: FLINK-28835
 URL: https://issues.apache.org/jira/browse/FLINK-28835
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.15.1, 1.16.0
Reporter: Piotr Nowojski
 Fix For: 1.16.0, 1.15.2


https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints_vs_savepoints/

is inconsistent with 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints#FLIP203:Incrementalsavepoints-Proposal.
 "Non-arbitrary job upgrade" for unaligned checkpoints should be officially 
supported. 

It looks like a typo in the original PR FLINK-26134



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-27640) Flink not compiling, flink-connector-hive_2.12 is missing pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde

2022-05-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27640:
--

 Summary: Flink not compiling, flink-connector-hive_2.12 is missing 
pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde 
 Key: FLINK-27640
 URL: https://issues.apache.org/jira/browse/FLINK-27640
 Project: Flink
  Issue Type: Bug
  Components: Build System, Connectors / Hive
Affects Versions: 1.16.0
Reporter: Piotr Nowojski


When clean installing whole project after cleaning local {{.m2}} directory I 
encountered the following error when compiling flink-connector-hive_2.12:
{noformat}
[ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could not 
resolve dependencies for project 
org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to collect 
dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read 
artifact descriptor for 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer 
artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to 
maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for repositories: 
[conjars (http://conjars.org/repo, default, releases+snapshots), 
apache.snapshots (http://repository.apache.org/snapshots, default, snapshots)] 
-> [Help 1]
{noformat}
I've solved this by adding 
{noformat}

spring-repo-plugins
https://repo.spring.io/ui/native/plugins-release/

{noformat}
to ~/.m2/settings.xml file. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27209) Half the Xmx and double the forkCount for unit tests

2022-04-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27209:
--

 Summary: Half the Xmx and double the forkCount for unit tests
 Key: FLINK-27209
 URL: https://issues.apache.org/jira/browse/FLINK-27209
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Affects Versions: 1.16.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.16.0


As per recent "Speeding up the test builds" discussion on the dev mailing.

I'm proposing to half the memory allocated for running unit tests (as defined 
per   {{**/*Test.*}} 
property) but at the same time double the forkCounts for those tests. The 
premise is that they shouldn't need as much memory as ITCases to be stable, 
while increasing number of forks, should provide us with a couple of minutes 
improved build times.

CC [~chesnay]




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27202) NullPointerException on stop-with-savepoint with AsyncWaitOperator followed by FlinkKafkaProducer

2022-04-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27202:
--

 Summary: NullPointerException on stop-with-savepoint with 
AsyncWaitOperator followed by FlinkKafkaProducer 
 Key: FLINK-27202
 URL: https://issues.apache.org/jira/browse/FLINK-27202
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Runtime / Task
Affects Versions: 1.13.6, 1.12.7
Reporter: Piotr Nowojski


Some lingering mails from {{AsyncWaitOperator}} (or other operators using 
mailbox, or maybe even processing time timers), that are chained with 
{{FlinkKafkaProducer}} can cause the following exceptions when using 
stop-with-savepoint:

{noformat}
2022-04-11 15:46:19,781 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - static 
enrichment -> Map -> Sink: enriched events sink (179/256) 
(3fefa588ad05fa8d2a10a6ad4a740cc6) switched from RUNNING to FAILED on 
10.239.104.67:38149-12df6c @ 10.239.104.67 (dataPort=35745).
java.lang.NullPointerException: null
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction$TransactionHolder.access$000(TwoPhaseCommitSinkFunction.java:591)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:223)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:54)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:38)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:71)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:46)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:26)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:28)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:50)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.StreamRecordQueueEntry.emitResult(StreamRecordQueueEntry.java:64)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue$Segment.emitCompleted(UnorderedStreamElementQueue.java:272)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.queue.UnorderedStreamElementQueue.emitCompletedElement(UnorderedStreamElementQueue.java:159)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:287)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:356)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:337)
 ~[flink-dist_2.12-1.12.7-stream2.jar:1.12.7-stream2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(St

[jira] [Created] (FLINK-27133) Performance regression in serializerHeavyString

2022-04-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27133:
--

 Summary: Performance regression in serializerHeavyString
 Key: FLINK-27133
 URL: https://issues.apache.org/jira/browse/FLINK-27133
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System, Benchmarks
Affects Versions: 1.15.0, 1.16.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=200

Suspected range:
5f21d15a09..caa296b813

I've run a benchmark request before FLINK-26957 and it suggests that it is 
indeed the cause for this regression:
http://codespeed.dak8s.net:8080/job/flink-benchmark-request/77/artifact/jmh-result.csv/*view*/

CC [~mapohl]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27117) FLIP-217: Support watermark alignment of source splits

2022-04-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-27117:
--

 Summary: FLIP-217: Support watermark alignment of source splits
 Key: FLINK-27117
 URL: https://issues.apache.org/jira/browse/FLINK-27117
 Project: Flink
  Issue Type: New Feature
  Components: API / DataStream, Runtime / Task
Affects Versions: 1.14.4, 1.13.6, 1.15.0
Reporter: Piotr Nowojski


https://cwiki.apache.org/confluence/display/FLINK/FLIP-217+Support+watermark+alignment+of+source+splits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26864) Performance regression on 25.03.2022

2022-03-25 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26864:
--

 Summary: Performance regression on 25.03.2022
 Key: FLINK-26864
 URL: https://issues.apache.org/jira/browse/FLINK-26864
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.16.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=arrayKeyBy&extr=on&quarts=on&equid=off&env=2&revs=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=remoteFilePartition&extr=on&quarts=on&equid=off&env=2&revs=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=remoteSortPartition&extr=on&quarts=on&equid=off&env=2&revs=200
http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=tupleKeyBy&extr=on&quarts=on&equid=off&env=2&revs=200



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26682) Migrate regression check script to python and to main repository

2022-03-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26682:
--

 Summary: Migrate regression check script to python and to main 
repository
 Key: FLINK-26682
 URL: https://issues.apache.org/jira/browse/FLINK-26682
 Project: Flink
  Issue Type: New Feature
  Components: Benchmarks
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


Script provided by Roman 
https://github.com/rkhachatryan/flink-benchmarks/blob/regression-alerts/regression-alert.sh
 should be merge to the main repo and migrated to python.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26147) Deleting incremental savepoints directory in CLAIM mode

2022-02-15 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26147:
--

 Summary: Deleting incremental savepoints directory in CLAIM mode
 Key: FLINK-26147
 URL: https://issues.apache.org/jira/browse/FLINK-26147
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26146) Add test coverage for native format flink upgrades (minor versions)

2022-02-15 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26146:
--

 Summary: Add test coverage for native format flink upgrades (minor 
versions)
 Key: FLINK-26146
 URL: https://issues.apache.org/jira/browse/FLINK-26146
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Tests
Reporter: Piotr Nowojski


Check test coverage for:

Flink minor (1.x → 1.y) version upgrade

Parametrised SavepointMigrationTestBase to work on: 

canonical savepoints

native savepoints 

aligned (retained) checkpoints

Assignee should contact release managers and decided whether to already create 
1.15 (snapshot) artefacts for native savepoints and aligned checkpoints. If we 
do so, we might make release manager work  more complicated. However if we 
don’t, the test will be a dead code until release manager creates those 
artefacts, which can also make the release manager work more difficult if test 
explodes during RC creation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26064) KinesisFirehoseSinkITCase IllegalStateException: Trying to access closed classloader

2022-02-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-26064:
--

 Summary: KinesisFirehoseSinkITCase IllegalStateException: Trying 
to access closed classloader
 Key: FLINK-26064
 URL: https://issues.apache.org/jira/browse/FLINK-26064
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31044&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=ed165f3f-d0f6-524b-5279-86f8ee7d0e2d

(shortened stack trace, as full is too large)
{noformat}
Feb 09 20:05:04 java.util.concurrent.ExecutionException: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
Feb 09 20:05:04 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
(...)
Feb 09 20:05:04 Caused by: 
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute 
HTTP request: Trying to access closed classloader. Please check if you store 
classloaders directly or indirectly in static fields. If the stacktrace 
suggests that the leak occurs in a third party library and cannot be fixed 
immediately, you can disable this check with the configuration 
'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)
Feb 09 20:05:04 at 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)
(...)
Feb 09 20:05:04 Caused by: java.lang.IllegalStateException: Trying to access 
closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164)
Feb 09 20:05:04 at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResources(FlinkUserCodeClassLoaders.java:188)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348)
Feb 09 20:05:04 at 
java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
Feb 09 20:05:04 at 
java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder$1.run(FactoryFinder.java:352)
Feb 09 20:05:04 at java.security.AccessController.doPrivileged(Native 
Method)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.findServiceProvider(FactoryFinder.java:341)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:313)
Feb 09 20:05:04 at 
javax.xml.stream.FactoryFinder.find(FactoryFinder.java:227)
Feb 09 20:05:04 at 
javax.xml.stream.XMLInputFactory.newInstance(XMLInputFactory.java:154)
Feb 09 20:05:04 at 
software.amazon.awssdk.protocols.query.unmarshall.XmlDomParser.createXmlInputFactory(XmlDomParser.java:124)
Feb 09 20:05:04 at 
java.lang.ThreadLocal$SuppliedThreadLocal.initialValue(ThreadLocal.java:284)
Feb 09 20:05:04 at 
java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180)
Feb 09 20:05:04 at java.lang.ThreadLocal.get(ThreadLocal.java:170)
(...)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25982) Support idleness with watermark alignment

2022-02-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25982:
--

 Summary: Support idleness with watermark alignment
 Key: FLINK-25982
 URL: https://issues.apache.org/jira/browse/FLINK-25982
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25914) Performance regression in checkpointSingleInput.UNALIGNED on 31.01.2022

2022-02-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25914:
--

 Summary: Performance regression in checkpointSingleInput.UNALIGNED 
on 31.01.2022
 Key: FLINK-25914
 URL: https://issues.apache.org/jira/browse/FLINK-25914
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/?ben=checkpointSingleInput.UNALIGNED&env=2






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25912) HashMapStateBackend performance regression on 31.01.2022

2022-02-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25912:
--

 Summary: HashMapStateBackend performance regression on 31.01.2022
 Key: FLINK-25912
 URL: https://issues.apache.org/jira/browse/FLINK-25912
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks, Runtime / State Backends
Reporter: Piotr Nowojski
 Fix For: 1.15.0


http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=mapEntries.HEAP&env=2&revs=200&equid=off&quarts=on&extr=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=mapIterator.HEAP&env=2&revs=200&equid=off&quarts=on&extr=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=mapKeys.HEAP&env=2&revs=200&equid=off&quarts=on&extr=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=mapValues.HEAP&env=2&revs=200&equid=off&quarts=on&extr=on



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25896) Buffer debloating issues with high parallelism

2022-01-31 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25896:
--

 Summary: Buffer debloating issues with high parallelism
 Key: FLINK-25896
 URL: https://issues.apache.org/jira/browse/FLINK-25896
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Affects Versions: 1.14.3, 1.15.0
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25869) Create a stress test cron build

2022-01-28 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25869:
--

 Summary: Create a stress test cron build
 Key: FLINK-25869
 URL: https://issues.apache.org/jira/browse/FLINK-25869
 Project: Flink
  Issue Type: Improvement
  Components: Test Infrastructure, Tests
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


We propose creating some kind of stress test cron job dedicated for 
periodically running longer lasting stress tests. For example trying to expose 
memory leaks.

First idea to test could be to provide test coverage for FLINK-25728. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25745) Support RocksDB incremental native savepoints

2022-01-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25745:
--

 Summary: Support RocksDB incremental native savepoints
 Key: FLINK-25745
 URL: https://issues.apache.org/jira/browse/FLINK-25745
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Piotr Nowojski
 Fix For: 1.15.0


Respect CheckpointType.SharingFilesStrategy#NO_SHARING flag in 
RocksIncrementalSnapshotStrategy. We also need to make sure that 
RocksDBIncrementalSnapshotStrategy is creating self contained/relocatable 
snapshots (using CheckpointedStateScope#EXCLUSIVE for native savepoints)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25744) Support native savepoints (w/o modifying the statebackend specific snapshot strategies)

2022-01-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25744:
--

 Summary: Support native savepoints (w/o modifying the statebackend 
specific snapshot strategies)
 Key: FLINK-25744
 URL: https://issues.apache.org/jira/browse/FLINK-25744
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Reporter: Piotr Nowojski
Assignee: Dawid Wysakowicz


For example w/o incremental RocksDB support. But HashMap and Full RocksDB 
should be working out of the box w/o extra changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25704) Performance regression on 18.01.2022 in batch network benchmarks

2022-01-19 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25704:
--

 Summary: Performance regression on 18.01.2022 in batch network 
benchmarks
 Key: FLINK-25704
 URL: https://issues.apache.org/jira/browse/FLINK-25704
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=compressedFilePartition&env=2&revs=200&equid=off&quarts=on&extr=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=uncompressedFilePartition&env=2&revs=200&equid=off&quarts=on&extr=on
http://codespeed.dak8s.net:8000/timeline/#/?exe=1,3&ben=uncompressedMmapPartition&env=2&revs=200&equid=off&quarts=on&extr=on

Suspected range:
{code}
git ls eeec246677..f5c99c6f26
f5c99c6f26 [5 weeks ago] [FLINK-17321][table] Add support casting of map to map 
and multiset to multiset [Sergey Nuyanzin]
745cfec705 [24 hours ago] [hotfix][table-common] Fix InternalDataUtils for 
MapData tests [Timo Walther]
ed699b6ee6 [6 days ago] [FLINK-25637][network] Make sort-shuffle the default 
shuffle implementation for batch jobs [kevin.cyj]
4275525fed [6 days ago] [FLINK-25638][network] Increase the default write 
buffer size of sort-shuffle to 16M [kevin.cyj]
e1878fb899 [6 days ago] [FLINK-25639][network] Increase the default read buffer 
size of sort-shuffle to 64M [kevin.cyj]
{code}

It looks [~kevin.cyj], that most likely your change has caused that?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25688) Resolve performance degradation with high parallelism when using buffer debloating

2022-01-18 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25688:
--

 Summary: Resolve performance degradation with high parallelism 
when using buffer debloating
 Key: FLINK-25688
 URL: https://issues.apache.org/jira/browse/FLINK-25688
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Affects Versions: 1.14.3, 1.15.0
Reporter: Piotr Nowojski


As documented in FLINK-25646, currently buffer debloating in Flink, at least in 
the default configuration, has quite noticeable performance degradation at 
larger scale. For example throughput can drop by a factor of 4, or even 
checkpointing times can be increased. Currently it's not clear why is this 
happening. It looks like increasing the number of buffers per channel from the 
default ~2 to above 3 (for example via bumping number of floating buffers to 
value equal or higher then parallelism), seems to be solving this problem, at 
least on one cluster where buffer debloating has been tested at large scale.

Further investigation is required.

CC [~akalashnikov]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25414) Provide metrics to measure how long task has been blocked

2021-12-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25414:
--

 Summary: Provide metrics to measure how long task has been blocked
 Key: FLINK-25414
 URL: https://issues.apache.org/jira/browse/FLINK-25414
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Metrics, Runtime / Task
Affects Versions: 1.14.2
Reporter: Piotr Nowojski


Currently back pressured/busy metrics tell the user whether task is 
blocked/busy and how much % of the time it is blocked/busy. But they do not 
tell how for how long single block event is lasting. It can be 1ms or 1h and 
back pressure/busy would be still reporting 100%.

In order to improve this, we could provide two new metrics:
# maxSoftBackPressureDuration
# maxHardBackPressureDuration

The max would be reset to 0 periodically or on every access to the metric (via 
metric reporter). Soft back pressure would be if task is back pressured in a 
non blocking fashion (StreamTask detected in availability of the output). Hard 
back pressure would measure the time task is actually blocked.

Unfortunately I don't know how to efficiently provide similar metric for busy 
time, without impacting max throughput.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25407) Network stack deadlock when cancellation happens during initialisation

2021-12-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25407:
--

 Summary: Network stack deadlock when cancellation happens during 
initialisation
 Key: FLINK-25407
 URL: https://issues.apache.org/jira/browse/FLINK-25407
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0, 1.15.0
Reporter: Piotr Nowojski


{noformat}
Java stack information for the threads listed above:
===
"Canceler for Source: Custom Source -> Filter (7/12)#14176 
(0fbb8a89616ca7a40e473adad51f236f).":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:420)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Canceler for Map -> Map (6/12)#14176 (6195862d199aa4d52c12f25b39904725).":
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:585)
   - waiting to lock <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:544)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:424)
   - locked <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:567)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.closeBufferPool(ResultPartition.java:264)
   at 
org.apache.flink.runtime.io.network.partition.ResultPartition.fail(ResultPartition.java:276)
   at 
org.apache.flink.runtime.taskmanager.Task.failAllResultPartitions(Task.java:999)
   at org.apache.flink.runtime.taskmanager.Task.access$100(Task.java:138)
   at org.apache.flink.runtime.taskmanager.Task$TaskCanceler.run(Task.java:1669)
   at java.lang.Thread.run(Thread.java:748)
"Map -> Sink: Unnamed (7/12)#14176":
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.recycleMemorySegments(NetworkBufferPool.java:256)
   - waiting to lock <0x82937f28> (a java.lang.Object)
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegmentsBlocking(NetworkBufferPool.ja
   at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247)
   - locked <0x97108898> (a java.util.ArrayDeque)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:497)
   at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:276)
   at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:105)
   at 
org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:965)
   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:652)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
   at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.
{noformat}
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28297&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=19003
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28306&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=19832



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25382) Failure in "Upload Logs" task

2021-12-20 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25382:
--

 Summary: Failure in "Upload Logs" task
 Key: FLINK-25382
 URL: https://issues.apache.org/jira/browse/FLINK-25382
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


I don't see any error message, but it seems like uploading the logs has failed:

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=bb16d35c-fdfe-5139-f244-9492cbd2050b

for the following build:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=27568&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25276) Support native and incremental savepoints

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25276:
--

 Summary: Support native and incremental savepoints
 Key: FLINK-25276
 URL: https://issues.apache.org/jira/browse/FLINK-25276
 Project: Flink
  Issue Type: New Feature
Reporter: Piotr Nowojski


Motivation. Currently with non incremental canonical format savepoints, with 
very large state, both taking and recovery from savepoints can take very long 
time. Providing options to take native format and incremental savepoint would 
alleviate this problem.

In the past the main challenge lied in the ownership semantic and files clean 
up of such incremental savepoints. However with FLINK-25154 implemented some of 
those concerns can be solved. Incremental savepoint could leverage "force full 
snapshot" mode provided by FLINK-25192, to duplicate/copy all of the savepoint 
files out of the Flink's ownership scope.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25275) Weighted KeyGroup assignment

2021-12-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25275:
--

 Summary: Weighted KeyGroup assignment
 Key: FLINK-25275
 URL: https://issues.apache.org/jira/browse/FLINK-25275
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Network
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


Currently key groups are split into key group ranges naively in the simplest 
way. Key groups are split into equally sized continuous ranges (number of 
ranges = parallelism = number of keygroups / size of single keygroup). Flink 
could avoid data skew between key groups, by assigning them to tasks based on 
their "weight". "Weight" could be defined as frequency of an access for the 
given key group. 

Arbitrary, non-continuous, key group assignment (for example TM1 is processing 
kg1 and kg3 while TM2 is processing only kg2) would require extensive changes 
to the state backends for example. However the data skew could be mitigated to 
some extent by creating key group ranges in a more clever way, while keeping 
the key group range continuity. For example TM1 processes range [kg1, kg9], 
while TM2 just [kg10, kg11].

[This branch shows a PoC of such 
approach.|https://github.com/pnowojski/flink/commits/antiskew]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25255) Consider/design implementing State Processor API (FC)

2021-12-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-25255:
--

 Summary: Consider/design implementing State Processor API (FC)
 Key: FLINK-25255
 URL: https://issues.apache.org/jira/browse/FLINK-25255
 Project: Flink
  Issue Type: Sub-task
Reporter: Piotr Nowojski






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24919) UnalignedCheckpointITCase hangs on Azure

2021-11-16 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24919:
--

 Summary: UnalignedCheckpointITCase hangs on Azure
 Key: FLINK-24919
 URL: https://issues.apache.org/jira/browse/FLINK-24919
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


Extracted from FLINK-23466

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=13067

Nov 10 16:13:03 Starting 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline 
with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1].

>From the log, we can see this case hangs. I guess this seems a new issue which 
>is different from the one reported in this ticket. From the stack, it seems 
>there is something wrong with the checkpoint coordinator, the following thread 
>locked 0x87db4fb8:
{code:java}
2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 
daemon prio=5 os_prio=0 tid=0x7f12e000b800 nid=0x3fb6 runnable 
[0x7f0fcd6d4000]
2021-11-10T17:14:21.0899924Z Nov 10 17:14:21java.lang.Thread.State: RUNNABLE
2021-11-10T17:14:21.0900300Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338)
2021-11-10T17:14:21.0900745Z Nov 10 17:14:21at 
java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112)
2021-11-10T17:14:21.0901146Z Nov 10 17:14:21at 
java.util.HashMap.removeNode(HashMap.java:840)
2021-11-10T17:14:21.0901577Z Nov 10 17:14:21at 
java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301)
2021-11-10T17:14:21.0902002Z Nov 10 17:14:21at 
java.util.HashMap.putVal(HashMap.java:664)
2021-11-10T17:14:21.0902531Z Nov 10 17:14:21at 
java.util.HashMap.putMapEntries(HashMap.java:515)
2021-11-10T17:14:21.0902931Z Nov 10 17:14:21at 
java.util.HashMap.putAll(HashMap.java:785)
2021-11-10T17:14:21.0903429Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60)
2021-11-10T17:14:21.0904060Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867)
2021-11-10T17:14:21.0904686Z Nov 10 17:14:21at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152)
2021-11-10T17:14:21.0905372Z Nov 10 17:14:21- locked <0x87db4fb8> 
(a java.lang.Object)
2021-11-10T17:14:21.0905895Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
2021-11-10T17:14:21.0906493Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown
 Source)
2021-11-10T17:14:21.0907086Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
2021-11-10T17:14:21.0907698Z Nov 10 17:14:21at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown
 Source)
2021-11-10T17:14:21.0908210Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2021-11-10T17:14:21.0908735Z Nov 10 17:14:21at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2021-11-10T17:14:21.0909333Z Nov 10 17:14:21at 
java.lang.Thread.run(Thread.java:748) {code}
But other thread is waiting for the lock. I am not familiar with these logics 
and not sure if this is in the right state. Could anyone who is familiar with 
these logics take a look?

 

BTW, concurrent access of HashMap may cause infinite loop,I see in the stack 
that there are multiple threads are accessing HashMap, though I am not sure if 
they are the same instance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24846) AsyncWaitOperator fails during stop-with-savepoint

2021-11-09 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24846:
--

 Summary: AsyncWaitOperator fails during stop-with-savepoint
 Key: FLINK-24846
 URL: https://issues.apache.org/jira/browse/FLINK-24846
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: Piotr Nowojski
 Attachments: log-jm.txt

{noformat}
Caused by: 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailbox$MailboxClosedException:
 Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.checkPutStateConditions(TaskMailboxImpl.java:269)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.put(TaskMailboxImpl.java:197)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.execute(MailboxExecutorImpl.java:74)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.api.common.operators.MailboxExecutor.execute(MailboxExecutor.java:103)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.outputCompletedElement(AsyncWaitOperator.java:304)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.access$100(AsyncWaitOperator.java:78)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.processResults(AsyncWaitOperator.java:370)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.lambda$processInMailbox$0(AsyncWaitOperator.java:351)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.drain(MailboxProcessor.java:177)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:854)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:767) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
 ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) 
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

{noformat}

As reported by a user on [the mailing 
list:|https://mail-archives.apache.org/mod_mbox/flink-user/202111.mbox/%3CCAO6dnLwtLNxkr9qXG202ysrnse18Wgvph4hqHZe3ar8cuXAfDw%40mail.gmail.com%3E]
{quote}
I failed to stop a job with savepoint with the following message:
Inconsistent execution state after stopping with savepoint. At least one 
execution is still in one of the following states: FAILED, CANCELED. A global 
fail-over is triggered to recover the job 452594f3ec5797f399e07f95c884a44b.

The job manager said
 A savepoint was created at 
hdfs://mobdata-flink-hdfs/driving-habits/svpts/savepoint-452594-f60305755d0e 
but the corresponding job 452594f3ec5797f399e07f95c884a44b didn't terminate 
successfully.
while complaining about
Mailbox is in state QUIESCED, but is required to be in state OPEN for put 
operations.

Is it okay to ignore this kind of error?

Please see the attached files for the detailed context.

FYI, 
- I used the latest 1.14.0
- I started the job with "$FLINK_HOME"/bin/flink run --target yarn-per-job
- I couldn't reproduce the exception using the same jar so I might not able to 
provide DUBUG messages
{quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24696) Translate how to configure unaligned checkpoints into Chinese

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24696:
--

 Summary: Translate how to configure unaligned checkpoints into 
Chinese
 Key: FLINK-24696
 URL: https://issues.apache.org/jira/browse/FLINK-24696
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski
 Fix For: 1.15.0


Page {{content.zh/docs/ops/state/checkpointing_under_backpressure.md}} needs to 
be translated.

Reference english version ticket: FLINK-24670



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24695) Update how to configure unaligned checkpoints in the documentation

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24695:
--

 Summary: Update how to configure unaligned checkpoints in the 
documentation
 Key: FLINK-24695
 URL: https://issues.apache.org/jira/browse/FLINK-24695
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


It looks like we don't have a code example how to enabled unaligned checkpoints 
anywhere in the docs. That should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24694) Translate "Checkpointing under backpressure" page into Chinese

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24694:
--

 Summary: Translate "Checkpointing under backpressure" page into 
Chinese
 Key: FLINK-24694
 URL: https://issues.apache.org/jira/browse/FLINK-24694
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski


Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and 
out of sync of it's english counterpart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24693) Update Chinese version of "Checkpoints" page

2021-10-29 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24693:
--

 Summary: Update Chinese version of "Checkpoints" page
 Key: FLINK-24693
 URL: https://issues.apache.org/jira/browse/FLINK-24693
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.15.0, 1.14.1
Reporter: Piotr Nowojski


Page {{content.zh/docs/ops/state/checkpoints.md}} is very much out of date and 
out of sync of it's english counterpart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24670) Restructure unaligned checkpoints documentation page to "Checkpointing under back pressure"

2021-10-27 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24670:
--

 Summary: Restructure unaligned checkpoints documentation page to 
"Checkpointing under back pressure"
 Key: FLINK-24670
 URL: https://issues.apache.org/jira/browse/FLINK-24670
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.14.1
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0, 1.14.1


https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/unaligned_checkpoints/

* Should be renamed and restructured to "Checkpointing under back pressure"
* Should have a short section that cross references network mem tuning guide
* It looks like we don't have a code example how to enabled unaligned 
checkpoints anywhere in the docs? That should be fixed.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24531) CheckpointingTimeBenchmark can fail with connection timed out

2021-10-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24531:
--

 Summary: CheckpointingTimeBenchmark can fail with connection timed 
out
 Key: FLINK-24531
 URL: https://issues.apache.org/jira/browse/FLINK-24531
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski


There is a race condition between obtaining REST API port and actually starting 
up the REST API server. If REST API server start is delayed, the 
{{CheckpointingTimeBenchmark}} can fail with

{noformat}
java.util.concurrent.ExecutionException: 
org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: 
connection timed out: localhost/127.0.0.1:60503
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.waitForBackpressure(CheckpointingTimeBenchmark.java:251)
at 
org.apache.flink.benchmark.CheckpointingTimeBenchmark$CheckpointEnvironmentContext.setUp(CheckpointingTimeBenchmark.java:216)
at 
org.apache.flink.benchmark.generated.CheckpointingTimeBenchmark_checkpointSingleInput_jmhTest._jmh_tryInit_f_checkpointenvironmentcont
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24483) Document what is Public API and what compatibility guarantees Flink is providing

2021-10-08 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24483:
--

 Summary: Document what is Public API and what compatibility 
guarantees Flink is providing
 Key: FLINK-24483
 URL: https://issues.apache.org/jira/browse/FLINK-24483
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream, Documentation, Table SQL / API
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


We should document:
* What constitute of the Public API, what do 
Public/PublicEvolving/Experimental/Internal annotations mean.
* What compatibility guarantees we are providing (forward functional, compile 
and binary compatibility for {{@Public}} interfaces)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24475) Remove no longer used NestedMap* classes

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24475:
--

 Summary: Remove no longer used NestedMap* classes
 Key: FLINK-24475
 URL: https://issues.apache.org/jira/browse/FLINK-24475
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.13.2, 1.14.0, 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


After FLINK-21935 all of the {{NestedMapsStateTable}} classes are no longer 
used in the production code. They are still however being used in benchmarks in 
some tests. Benchmarks/tests should be migrated to {{CopyOnWrite}} versions 
while the {{NestedMaps}} classes should be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24472) DispatcherTest is unstable

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24472:
--

 Summary: DispatcherTest  is unstable
 Key: FLINK-24472
 URL: https://issues.apache.org/jira/browse/FLINK-24472
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


testCancellationOfNonCanceledTerminalJobFailsWithAppropriateException from 
DispatcherTest can fail with:

{noformat}
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
Oct 07 10:31:18 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
Oct 07 10:31:18 at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Oct 07 10:31:18 at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
Oct 07 10:31:18 at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
Oct 07 10:31:18 at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
Oct 07 10:31:18 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Oct 07 10:31:18 at 
java.util.Iterator.forEachRemaining(Iterator.java:116)
Oct 07 10:31:18 at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
Oct 07 10:31:18 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
Oct 07 10:31:18 at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
Oct 07 10:31:18 at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
Oct 07 10:31:18 at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
Oct 07 10:31:18 at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
Oct 07 10:31:18 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
Oct 07 10:31:18 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
Oct 07 10:31:18 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
Oct 07 10:31:18 at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24471) Add microbenchmark for throughput with debloating enabled

2021-10-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24471:
--

 Summary: Add microbenchmark for throughput with debloating enabled
 Key: FLINK-24471
 URL: https://issues.apache.org/jira/browse/FLINK-24471
 Project: Flink
  Issue Type: Sub-task
  Components: Benchmarks, Runtime / Network
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24439) Introduce a CoordinatorStore

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24439:
--

 Summary: Introduce a CoordinatorStore
 Key: FLINK-24439
 URL: https://issues.apache.org/jira/browse/FLINK-24439
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


In order to allow {{SourceCoordinator}}s from different {{Source}}s (for 
example two different Kafka sources, or Kafka and Kinesis) to align watermarks, 
they have to be able to exchange information/aggregate watermarks from those 
different Sources. To enable this, we need to provide some {{CoordinatorStore}} 
concept, that would be a thread safe singleton.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24441) Block SourceReader when watermarks are out of alignment

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24441:
--

 Summary: Block SourceReader when watermarks are out of alignment
 Key: FLINK-24441
 URL: https://issues.apache.org/jira/browse/FLINK-24441
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


SourceReader should become unavailable once it's latest watermark is too far 
into the future



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24440) Announce and combine latest watermarks across SourceReaders

2021-10-01 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24440:
--

 Summary: Announce and combine latest watermarks across 
SourceReaders
 Key: FLINK-24440
 URL: https://issues.apache.org/jira/browse/FLINK-24440
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Common
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.15.0


# Each SourceReader should inform it's SourceCoordinator about the latest 
watermark that it has emitted so far
# SourceCoordinators should combine those watermarks and broadcast the 
aggregated result back to all SourceReaders



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24357) ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled fails with an Unhandled error

2021-09-22 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24357:
--

 Summary: 
ZooKeeperLeaderElectionConnectionHandlingTest#testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
 fails with an Unhandled error
 Key: FLINK-24357
 URL: https://issues.apache.org/jira/browse/FLINK-24357
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Piotr Nowojski


In a [private 
azure|https://dev.azure.com/pnowojski/637f6470-2732-4605-a304-caebd40e284b/_apis/build/builds/517/logs/155]
 build when testing my own PR I've noticed the following error that looks 
unrelated to any of my changes (modifications to {{Task}} class 
error/cancellation handling logic):

{noformat}

2021-09-22T08:09:16.6244936Z Sep 22 08:09:16 [ERROR] 
testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled  Time 
elapsed: 28.753 s  <<< FAILURE!
2021-09-22T08:09:16.6245821Z Sep 22 08:09:16 java.lang.AssertionError: The 
TestingFatalErrorHandler caught an exception.
2021-09-22T08:09:16.6246513Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.after(TestingFatalErrorHandlerResource.java:78)
2021-09-22T08:09:16.6247281Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource.access$300(TestingFatalErrorHandlerResource.java:33)
2021-09-22T08:09:16.6248167Z Sep 22 08:09:16at 
org.apache.flink.runtime.util.TestingFatalErrorHandlerResource$1.evaluate(TestingFatalErrorHandlerResource.java:57)
2021-09-22T08:09:16.6248862Z Sep 22 08:09:16at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
2021-09-22T08:09:16.6249620Z Sep 22 08:09:16at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
2021-09-22T08:09:16.6250210Z Sep 22 08:09:16at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
2021-09-22T08:09:16.6250773Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-22T08:09:16.6251375Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
2021-09-22T08:09:16.6251951Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
2021-09-22T08:09:16.6252562Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
2021-09-22T08:09:16.6253415Z Sep 22 08:09:16at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
2021-09-22T08:09:16.6254469Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
2021-09-22T08:09:16.6255039Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
2021-09-22T08:09:16.6256238Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
2021-09-22T08:09:16.6257109Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
2021-09-22T08:09:16.6257766Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
2021-09-22T08:09:16.6258406Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-22T08:09:16.6259050Z Sep 22 08:09:16at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
2021-09-22T08:09:16.6259827Z Sep 22 08:09:16at 
org.junit.runner.JUnitCore.run(JUnitCore.java:137)
2021-09-22T08:09:16.6260963Z Sep 22 08:09:16at 
org.junit.runner.JUnitCore.run(JUnitCore.java:115)
2021-09-22T08:09:16.6261796Z Sep 22 08:09:16at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
2021-09-22T08:09:16.6262428Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2021-09-22T08:09:16.6263268Z Sep 22 08:09:16at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
2021-09-22T08:09:16.6263875Z Sep 22 08:09:16at 
java.util.Iterator.forEachRemaining(Iterator.java:116)
2021-09-22T08:09:16.6265025Z Sep 22 08:09:16at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
2021-09-22T08:09:16.6265940Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
2021-09-22T08:09:16.6266767Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
2021-09-22T08:09:16.6267470Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
2021-09-22T08:09:16.6268165Z Sep 22 08:09:16at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
2021-09-22T08:09:16.6269341Z Sep 22 08:09:16at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
2021-09-22T08:09:16.6269928Z Sep 22 08:09:16at 
java.util.stream.ReferencePipeline.forEach(Refe

[jira] [Created] (FLINK-24344) Handling of IOExceptions when triggering checkpoints doesn't cause job failover

2021-09-21 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24344:
--

 Summary: Handling of IOExceptions when triggering checkpoints 
doesn't cause job failover
 Key: FLINK-24344
 URL: https://issues.apache.org/jira/browse/FLINK-24344
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0, 1.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24328) Long term fix for receiving new buffer size before network reader configured

2021-09-17 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24328:
--

 Summary: Long term fix for receiving new buffer size before 
network reader configured
 Key: FLINK-24328
 URL: https://issues.apache.org/jira/browse/FLINK-24328
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0
Reporter: Anton Kalashnikov
Assignee: Anton Kalashnikov
 Fix For: 1.14.0


It happened on the big cluster(parallelism = 75, TM=5, task=8) just on the 
initialization moment.
{noformat}
2021-09-09 14:36:42,383 WARN  org.apache.flink.runtime.taskmanager.Task 
   [] - Map -> Flat Map (71/75)#0 (7a5b971e0cd57aa5d057a114e2679b03) 
switched from RUNNING to FAILED with failure c
ause: 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Fatal error at remote task manager 
'ip-172-31-22-183.eu-central-1.compute.internal/172.31.22.183:42085'.
at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.decodeMsg(CreditBasedPartitionRequestClientHandler.java:339)
at 
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelRead(CreditBasedPartitionRequestClientHandler.java:240)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelRead(NettyMessageClientDecoderDelegate.java:112)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
at 
org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: No reader for receiverId = 
296559f497c54a82534945f4549b9e2d exists.
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.obtainReader(PartitionRequestQueue.java:194)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.notifyNewBufferSize(PartitionRequestQueue.java:188)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:134)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:42)
at 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at 
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
at 
org.apache.flink.sha

[jira] [Created] (FLINK-24184) Potential race condition leading to incorrectly issued interruptions

2021-09-07 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24184:
--

 Summary: Potential race condition leading to incorrectly issued 
interruptions
 Key: FLINK-24184
 URL: https://issues.apache.org/jira/browse/FLINK-24184
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.13.2, 1.12.5, 1.11.4, 1.10.3, 1.9.3, 1.8.3, 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.15.0


There is a  race condition in disabling interrupts while closing resources. 
Currently this is guarded by a volatile variable, but there might be a race 
condition when:
1. interrupter thread first checked the shouldInterruptOnCancel flag
2. shouldInterruptOnCancel flag switched to false as Task/StreamTask entered 
cleaning up phase
3. interrupter issued an interrupt while Task/StreamTask are closing/releasing 
resources, potentially causing a memory leak



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24155) Translate documentation for how to configure the CheckpointFailureManager

2021-09-03 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24155:
--

 Summary: Translate documentation for how to configure the 
CheckpointFailureManager
 Key: FLINK-24155
 URL: https://issues.apache.org/jira/browse/FLINK-24155
 Project: Flink
  Issue Type: Improvement
  Components: chinese-translation, Documentation
Affects Versions: 1.13.2, 1.14.0, 1.15.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


Documentation added in FLINK-23916 should be translated to it's Chinese 
counterpart. Note that this applies to three separate commits:
merged to master as cd01d4c0279
merged to release-1.14 as 2e769746bf2
merged to release-1.13 as e1a71219454



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24154) Translate documentation for how to configure the CheckpointFailureManager

2021-09-03 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24154:
--

 Summary: Translate documentation for how to configure the 
CheckpointFailureManager
 Key: FLINK-24154
 URL: https://issues.apache.org/jira/browse/FLINK-24154
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Armstrong Nova
Assignee: wangzhao


Translate 
"[https://ci.apache.org/projects/flink/flink-docs-master/monitoring/historyserver.html]";
 page into Chinese.

This doc located in "flink/docs/monitoring/historyserver.zh.md"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24030) PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits failed

2021-08-27 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-24030:
--

 Summary: PulsarSourceITCase>SourceTestSuiteBase.testMultipleSplits 
failed
 Key: FLINK-24030
 URL: https://issues.apache.org/jira/browse/FLINK-24030
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Pulsar
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22936&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461
root cause:

{noformat}
Aug 27 09:41:42 Caused by: 
org.apache.pulsar.client.api.PulsarClientException$BrokerMetadataException: 
Consumer not found
Aug 27 09:41:42 at 
org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:987)
Aug 27 09:41:42 at 
org.apache.pulsar.client.impl.PulsarClientImpl.close(PulsarClientImpl.java:658)
Aug 27 09:41:42 at 
org.apache.flink.connector.pulsar.source.reader.source.PulsarSourceReaderBase.close(PulsarSourceReaderBase.java:83)
Aug 27 09:41:42 at 
org.apache.flink.connector.pulsar.source.reader.source.PulsarOrderedSourceReader.close(PulsarOrderedSourceReader.java:170)
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.SourceOperator.close(SourceOperator.java:308)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:141)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:127)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1015)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:859)
Aug 27 09:41:42 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:747)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
Aug 27 09:41:42 at 
org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
{noformat}
Top level error:
{noformat}

WARNING: The following warnings have been detected: WARNING: Return type, 
java.util.Map, of method, 
public java.util.Map 
org.apache.pulsar.broker.admin.impl.ClustersBase.getNamespaceIsolationPolicies(java.lang.String)
 throws java.lang.Exception, is not resolvable to a concrete type.

Aug 27 09:41:42 [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 357.849 s <<< FAILURE! - in 
org.apache.flink.connector.pulsar.source.PulsarSourceITCase
Aug 27 09:41:42 [ERROR] testMultipleSplits{TestEnvironment, ExternalContext}[1] 
 Time elapsed: 5.391 s  <<< ERROR!
Aug 27 09:41:42 java.lang.RuntimeException: Failed to fetch next result
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:109)
Aug 27 09:41:42 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:151)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.utils.TestDataMatchers$MultipleSplitDataMatcher.matchesSafely(TestDataMatchers.java:133)
Aug 27 09:41:42 at 
org.hamcrest.TypeSafeDiagnosingMatcher.matches(TypeSafeDiagnosingMatcher.java:55)
Aug 27 09:41:42 at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:12)
Aug 27 09:41:42 at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
Aug 27 09:41:42 at 
org.apache.flink.connectors.test.common.testsuites.SourceTestSuiteBase.testMultipleSplits(SourceTestSuiteBase.java:156)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23988) Test corner cases

2021-08-26 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23988:
--

 Summary: Test corner cases
 Key: FLINK-23988
 URL: https://issues.apache.org/jira/browse/FLINK-23988
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network
Reporter: Piotr Nowojski


Check how debloating behaves in case of:
* data skew
* multiple/two/union inputs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23776) Performance regression on 14.08.2021 in FLIP-27

2021-08-14 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23776:
--

 Summary: Performance regression on 14.08.2021 in FLIP-27
 Key: FLINK-23776
 URL: https://issues.apache.org/jira/browse/FLINK-23776
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Arvid Heise
 Fix For: 1.14.0


http://codespeed.dak8s.net:8000/timeline/?ben=mapSink.F27_UNBOUNDED&env=2
http://codespeed.dak8s.net:8000/timeline/?ben=mapRebalanceMapSink.F27_UNBOUNDED&env=2


{noformat}
git ls 7b60a964b1..7f3636f6b4
7f3636f6b4f [2 days ago] [FLINK-23652][connectors] Adding common source 
metrics. [Arvid Heise]
97c8f72b813 [3 months ago] [FLINK-23652][connectors] Adding common sink 
metrics. [Arvid Heise]
48da20e8f88 [3 months ago] [FLINK-23652][test] Adding InMemoryMetricReporter 
and using it by default in MiniClusterResource. [Arvid Heise]
63ee60859ca [3 months ago] [FLINK-23652][core/metrics] Extract 
Operator(IO)MetricGroup interfaces and expose them in RuntimeContext [Arvid 
Heise]
5d5e39b614b [2 days ago] [refactor][connectors] Only use 
MockSplitReader.Builder for instantiation. [Arvid Heise]
b927035610c [3 months ago] [refactor][core] Extract common context creation in 
CollectionExecutor [Arvid Heise]
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23772) Double check if non-keyed FullyFinishedOperatorState can be mixed up with non finished OperatorState on recovery

2021-08-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23772:
--

 Summary: Double check if non-keyed FullyFinishedOperatorState can 
be mixed up with non finished OperatorState on recovery
 Key: FLINK-23772
 URL: https://issues.apache.org/jira/browse/FLINK-23772
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


I'm not sure if with non-keyed state we have an issue that it can be reshuffled 
to different operators during recovery. Are there any guarantees that if 
subtask 1 has state A, while subtask 2 has B, that after recovery it won’t be 
rotated?
# is this an issue?
# if so, if we have partially finished tasks with some operators having, 
{{FullyFinishedOperatorState}}, what prevents 
{{VerticesFinishedCache.calculateIfFinished}} from failing if the 
{{FullyFinishedOperatorState}} gets assigned to an operator chain with non 
finished operator?

For example an operator chain with parallelism of two, non-keyed, before 
recovery:
{noformat}
src1 (finished state) -> foo1 (finished state)
src2 -> foo2
{noformat}
Can we end up after recovery with:
{noformat}
src1 (finished state) -> foo2
src2 -> foo1 (finished state)
{noformat}
?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23771) FLIP-147 Checkpoint N has not been started at all

2021-08-13 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23771:
--

 Summary: FLIP-147 Checkpoint N has not been started at all
 Key: FLINK-23771
 URL: https://issues.apache.org/jira/browse/FLINK-23771
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
 Fix For: 1.14.0


Observed when running WIP version of https://github.com/apache/flink/pull/16773/
{noformat}
6311 [flink-akka.actor.default-dispatcher-6] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - transform-2-keyed 
(4/4) (7f9e36a73792074afd7803e39f563fac) switched from RUNNING to FAILED on 
dac2cc5f-41f8-4067-99a9-ae56341e3735 @ localhost (dataPort=-1).
java.lang.Exception: Could not perform checkpoint 4 for operator 
transform-2-keyed (4/4)#1.
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1170)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$11(StreamTask.java:1117)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(StreamTask.java:851)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:754)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:787)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:730) 
~[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:786) 
~[classes/:?]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:572) 
~[classes/:?]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
Caused by: java.lang.IllegalStateException: Checkpoint 4 has not been started 
at all
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.getAllBarriersReceivedFuture(SingleCheckpointBarrierHandler.java:464)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.getAllBarriersReceivedFuture(CheckpointedInputGate.java:209)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.prepareSnapshot(StreamTaskNetworkInput.java:112)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.prepareSnapshot(StreamOneInputProcessor.java:83)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.prepareInputSnapshot(StreamTask.java:458)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.prepareInflightDataSnapshot(SubtaskCheckpointCoordinatorImpl.java:554)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.initInputsCheckpoint(SubtaskCheckpointCoordinatorImpl.java:423)
 ~[classes/:?]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1154)
 ~[classes/:?]
... 13 more
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23741) Waiting for final checkpoint can deadlock job

2021-08-12 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23741:
--

 Summary: Waiting for final checkpoint can deadlock job
 Key: FLINK-23741
 URL: https://issues.apache.org/jira/browse/FLINK-23741
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


With {{ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH}} enabled, final checkpoint can 
deadlock (or timeout after very long time) if there is a race condition between 
selecting tasks to trigger checkpoint on and finishing tasks. FLINK-21246 was 
supposed to handle it, but it doesn't work as expected, because futures from:
org.apache.flink.runtime.taskexecutor.TaskExecutor#triggerCheckpoint
and
org.apache.flink.streaming.runtime.tasks.StreamTask#triggerCheckpointAsync
are not linked together. TaskExecutor#triggerCheckpoint reports that checkpoint 
has been successfully triggered, while {{StreamTask}} might have actually 
finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23708) When task is finishedOnRestore Operators shouldn't be used

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23708:
--

 Summary: When task is finishedOnRestore Operators shouldn't be used
 Key: FLINK-23708
 URL: https://issues.apache.org/jira/browse/FLINK-23708
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


When task is finishedOnRestore Operators shouldn't be used, invoked or even 
initialised. Currently at least a couple of methods are being invoked like:
* endInput()
* getMetricGroup()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23704) FLIP-27 sources are not generating LatencyMarkers

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23704:
--

 Summary: FLIP-27 sources are not generating LatencyMarkers
 Key: FLINK-23704
 URL: https://issues.apache.org/jira/browse/FLINK-23704
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Runtime / Task
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Piotr Nowojski


Currently {{LatencyMarker}} is created only in 
{{StreamSource.LatencyMarksEmitter#LatencyMarksEmitter}}. FLIP-27 sources are 
never emitting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23698) Pass watermarks in finished on restore operators

2021-08-10 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23698:
--

 Summary: Pass watermarks in finished on restore operators
 Key: FLINK-23698
 URL: https://issues.apache.org/jira/browse/FLINK-23698
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


After merging FLINK-23541 there is a bug on restore finished tasks that we are 
loosing an information that max watermark has been already emitted. As task is 
finished, it means no new watermark will be ever emitted, while downstream 
tasks (for example two/multiple input tasks) would be deadlocked waiting for a 
watermark from an already finished input.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23648) Drop StreamTwoInputProcessor in favour of StreamMultipleInputProcessor

2021-08-05 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23648:
--

 Summary: Drop StreamTwoInputProcessor in favour of 
StreamMultipleInputProcessor
 Key: FLINK-23648
 URL: https://issues.apache.org/jira/browse/FLINK-23648
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.14.0
Reporter: Piotr Nowojski
Assignee: Piotr Nowojski
 Fix For: 1.14.0


StreamTwoInputProcessor is less efficient and can be simply replaced with 
StreamMultipleInputProcessor. This also solves a performance problem for 
FLINK-23541, where initial version was causing a performance regression in two 
input processor, while multiple input was working fine. Hence instead of 
optimising StreamTwoInputProcessor, it's just simpler to drop it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23593) Performance regression on 15.07.2021

2021-08-02 Thread Piotr Nowojski (Jira)
Piotr Nowojski created FLINK-23593:
--

 Summary: Performance regression on 15.07.2021
 Key: FLINK-23593
 URL: https://issues.apache.org/jira/browse/FLINK-23593
 Project: Flink
  Issue Type: Bug
  Components: Benchmarks
Affects Versions: 1.14.0
Reporter: Piotr Nowojski


http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2
http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2


{noformat}
pnowojski@piotr-mbp: [~/flink -  ((no branch, bisect started on pr/16589))] $ 
git ls f4afbf3e7de..eb8100f7afe
eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) 
[FLINK-22017][coordination] Allow BLOCKING result partition to be individually 
consumable [Thesharing]
d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) 
[FLINK-22017][coordination] Get the ConsumedPartitionGroup that 
IntermediateResultPartition and DefaultResultPartition belong to [Thesharing]
d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable 
AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther]
4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, 
refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) 
[FLINK-21928][clients][runtime] Introduce static method constructors of 
DuplicateJobSubmissionException for better readability. [David Moravek]
172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should 
succeed, when trying to resubmit already terminated job in application mode. 
[David Moravek]
f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce 
org.apache.flink.util.concurrent.FutureUtils#handleException method, that 
allows future to recover from the specied exception. [David Moravek]
d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, 
refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) 
[FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang]
16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests to 
a separate test class. [Yun Gao]
31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new 
sources if finished on restore [Yun Gao]
20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the legacy 
source task if finished on restore [Yun Gao]
874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of 
operators if finished on restore [Yun Gao]
ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, 
refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix 
isClosed() for operator wrapper and proxy operator close to the operator chain 
[Yun Gao]
41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request 
protocol [Yangze Guo]
489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest 
[Yangze Guo]
8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot 
[Yangze Guo]
72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and 
its related tests [Yangze Guo]
bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in 
YarnConfigOptionsInternal [Yangze Guo]
a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for the 
currently processed checkpoint [Anton Kalashnikov]
b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint 
alignment time only for last started checkpoint [Anton Kalashnikov]
3abec22c536 [3 weeks ago] [FLINK-23107][table-runtime] Separate implementation 
of deduplicate rank from other rank functions [Shuo Cheng]
1a195f5cc59 [3 weeks ago] [FLINK-16093][docs-zh] Translate "System Functions" 
page of "Functions" into Chinese (#16348) [ZhiJie Yang]
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >