Rui Fan created FLINK-32298:
-------------------------------
Summary: The outputQueueSize is negative
Key: FLINK-32298
URL: https://issues.apache.org/jira/browse/FLINK-32298
Project: Flink
Issue Type: Bug
Components: Runtime / Network
Affects Versions: 1.18.0
Reporter: Rui Fan
Assignee: Rui Fan
Attachments: image-2023-06-09-17-27-46-429.png
h1. Backgraound
The outputQueueSize indicates `The real size of queued output buffers in
bytes.`, so it shouldn't be negative. However, it may be negative in some cases.
h2. How outputQueueSize is generated?
TotalWrittenBytes: *_BufferWritingResultPartition#totalWrittenBytes_* records
how many data is written to ResultPartition.
TotalSentNumberOfBytes: *_PipelinedSubpartition#totalNumberOfBytes_* records
how many data is sent to downstream.
The outputQueueSize = TotalWrittenBytes - TotalSentNumberOfBytes.
h1. Bug
The TotalSentNumberOfBytes may be larger than TotalWrittenBytes due to some
data are written to the PipelinedSubpartition without the
BufferWritingResultPartition, such as :
# PipelinedSubpartition#finishReadRecoveredState writes the
`EndOfChannelStateEvent` even if the unaligned checkpoint is disable
# PipelinedSubpartition#addRecovered writes channel state(if the job recovered
from unaligned checkpoint, the outputQueueSize is totally wrong)
# PipelinedSubpartition#finish writes the `EndOfPartitionEvent`
!image-2023-06-09-17-27-46-429.png|width=1033,height=296!
h1. Solution
PipelinedSubpartition should is written through BufferWritingResultPartition,
and all writes should be counted.
By the way, outputQueueSize doesn't matter because it's just a metric, it
doesn't affect data processing. I found this bug because some of our flink
scenarios need to use adaptive rebalance (FLINK-31655), I'm developing it in
our internal version, which relies on the correct outputQueueSize to select the
low pressure channel.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)