[ 
https://issues.apache.org/jira/browse/HDFS-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080982#comment-18080982
 ] 

ASF GitHub Bot commented on HDFS-17916:
---------------------------------------

charlesconnell commented on PR #8466:
URL: https://github.com/apache/hadoop/pull/8466#issuecomment-4454207803

   Tests ran for 24 hours and were killed




> DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to 
> ByteArrayManager
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17916
>                 URL: https://issues.apache.org/jira/browse/HDFS-17916
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 3.3.6, 3.5.0, 3.4.3
>            Reporter: Charles Connell
>            Priority: Major
>              Labels: pull-request-available
>
> A [certain code 
> path|https://github.com/apache/hadoop/blob/b322c3ce2c10b45cec2f9acbe6f00fb75c054caa/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1422]
>  in the DFS client DataStreamer discards DFSPacket objects without returning 
> their contained byte arrays to the ByteArrayManager. I discovered this bug at 
> my company after we had HBase server threads hung for hours at 
> {{{}ByteArrayManager#allocate(){}}}. Because the leak only happens in an 
> error-handling path, the problem requires an unhealthy HDFS cluster in order 
> to be exposed.
> I took a heap dump of a high-uptime but relatively healthy HBase server, and 
> found evidence of leaked byte arrays there too. In the heap dump, the two 
> FixedLengthManagers both had {{{}numAllocated = 9{}}}, but there were zero 
> live {{DFSPacket}} objects. This suggests that the byte arrays, and their 
> containing {{DFSPackets}} had been garbage collected, unbeknownst to 
> {{{}FixedLengthManager{}}}.
> In DataStreamer.java starting at line 1410, the {{DFSPacket}} that is 
> {{{}remove(){}}}'d from {{dataQueue}} is allowed to be garbage collected 
> without further interaction.
> {code:java}
>     if (!streamerClosed && dfsClient.clientRunning) {
>       if (stage == BlockConstructionStage.PIPELINE_CLOSE) {        // If we 
> had an error while closing the pipeline, we go through a fast-path
>         // where the BlockReceiver does not run. Instead, the DataNode just 
> finalizes
>         // the block immediately during the 'connect ack' process. So, we 
> want to pull
>         // the end-of-block packet from the dataQueue, since we don't 
> actually have
>         // a true pipeline to send it over.
>         //
>         // We also need to set lastAckedSeqno to the end-of-block Packet's 
> seqno, so that
>         // a client waiting on close() will be aware that the flush finished.
>         synchronized (dataQueue) {
>           DFSPacket endOfBlockPacket = dataQueue.remove();  // remove the end 
> of block packet
>           // Close any trace span associated with this Packet
>           Span span = endOfBlockPacket.getSpan();
>           if (span != null) {
>             span.finish();
>             endOfBlockPacket.setSpan(null);
>           }
>           assert endOfBlockPacket.isLastPacketInBlock();
>           assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
>           lastAckedSeqno = endOfBlockPacket.getSeqno();
>           pipelineRecoveryCount = 0;
>           dataQueue.notifyAll();
>         }
>         endBlock();
>       } else {
>         initDataStreaming();
>       }
>     } {code}
> This could be fixed by inserting this line somewhere above:
> {code:java}
> endOfBlockPacket.releaseBuffer(byteArrayManager);
> {code}
> Claude Opus 4.7 was used to assist in finding this bug. I verified the 
> findings and I stand by them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to