[jira] [Created] (FLINK-32298) The outputQueueSize is negative

Rui Fan (Jira) Fri, 09 Jun 2023 02:43:07 -0700

Rui Fan created FLINK-32298:
-------------------------------

             Summary: The outputQueueSize is negative 
                 Key: FLINK-32298
                 URL: https://issues.apache.org/jira/browse/FLINK-32298
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.18.0
            Reporter: Rui Fan
            Assignee: Rui Fan
         Attachments: image-2023-06-09-17-27-46-429.png


h1. Backgraound

The outputQueueSize indicates `The real size of queued output buffers in 
bytes.`, so it shouldn't be negative. However, it may be negative in some cases.
h2. How outputQueueSize is generated?

TotalWrittenBytes: *_BufferWritingResultPartition#totalWrittenBytes_* records 
how many data is written to ResultPartition.

TotalSentNumberOfBytes: *_PipelinedSubpartition#totalNumberOfBytes_* records 
how many data is sent to downstream.

The outputQueueSize = TotalWrittenBytes - TotalSentNumberOfBytes.
h1. Bug

The TotalSentNumberOfBytes may be larger than TotalWrittenBytes due to some 
data are written to the PipelinedSubpartition without the 
BufferWritingResultPartition, such as : 
 # PipelinedSubpartition#finishReadRecoveredState writes the 
`EndOfChannelStateEvent` even if the unaligned checkpoint is disable
 # PipelinedSubpartition#addRecovered writes channel state(if the job recovered 
from unaligned checkpoint, the outputQueueSize is totally wrong)
 # PipelinedSubpartition#finish writes the `EndOfPartitionEvent`

!image-2023-06-09-17-27-46-429.png|width=1033,height=296!

 
h1. Solution

PipelinedSubpartition should is written through BufferWritingResultPartition, 
and all writes should be counted.

 

By the way, outputQueueSize doesn't matter because it's just a metric, it 
doesn't affect data processing. I found this bug because some of our flink 
scenarios need to use adaptive rebalance (FLINK-31655), I'm developing it in 
our internal version, which relies on the correct outputQueueSize to select the 
low pressure channel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-32298) The outputQueueSize is negative

Reply via email to