[
https://issues.apache.org/jira/browse/HDFS-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083499#comment-18083499
]
ASF GitHub Bot commented on HDFS-17916:
---------------------------------------
charlesconnell opened a new pull request, #8516:
URL: https://github.com/apache/hadoop/pull/8516
### Description of PR
Backport of https://github.com/apache/hadoop/pull/8466
### How was this patch tested?
see other PR
### For code changes:
see other PR
- [ ] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
### AI Tooling
see other PR
If an AI tool was used:
- [ ] The PR includes the phrase "Contains content generated by <tool>"
where <tool> is the name of the AI tool used.
- [ ] My use of AI contributions follows the ASF legal policy
https://www.apache.org/legal/generative-tooling.html
> DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to
> ByteArrayManager
> ---------------------------------------------------------------------------------------------
>
> Key: HDFS-17916
> URL: https://issues.apache.org/jira/browse/HDFS-17916
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 3.3.6, 3.5.0, 3.4.3
> Reporter: Charles Connell
> Assignee: Charles Connell
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.6.0
>
>
> A [certain code
> path|https://github.com/apache/hadoop/blob/b322c3ce2c10b45cec2f9acbe6f00fb75c054caa/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1422]
> in the DFS client DataStreamer discards DFSPacket objects without returning
> their contained byte arrays to the ByteArrayManager. I discovered this bug at
> my company after we had HBase server threads hung for hours at
> {{{}ByteArrayManager#allocate(){}}}. Because the leak only happens in an
> error-handling path, the problem requires an unhealthy HDFS cluster in order
> to be exposed.
> I took a heap dump of a high-uptime but relatively healthy HBase server, and
> found evidence of leaked byte arrays there too. In the heap dump, the two
> FixedLengthManagers both had {{{}numAllocated = 9{}}}, but there were zero
> live {{DFSPacket}} objects. This suggests that the byte arrays, and their
> containing {{DFSPackets}} had been garbage collected, unbeknownst to
> {{{}FixedLengthManager{}}}.
> In DataStreamer.java starting at line 1410, the {{DFSPacket}} that is
> {{{}remove(){}}}'d from {{dataQueue}} is allowed to be garbage collected
> without further interaction.
> {code:java}
> if (!streamerClosed && dfsClient.clientRunning) {
> if (stage == BlockConstructionStage.PIPELINE_CLOSE) { // If we
> had an error while closing the pipeline, we go through a fast-path
> // where the BlockReceiver does not run. Instead, the DataNode just
> finalizes
> // the block immediately during the 'connect ack' process. So, we
> want to pull
> // the end-of-block packet from the dataQueue, since we don't
> actually have
> // a true pipeline to send it over.
> //
> // We also need to set lastAckedSeqno to the end-of-block Packet's
> seqno, so that
> // a client waiting on close() will be aware that the flush finished.
> synchronized (dataQueue) {
> DFSPacket endOfBlockPacket = dataQueue.remove(); // remove the end
> of block packet
> // Close any trace span associated with this Packet
> Span span = endOfBlockPacket.getSpan();
> if (span != null) {
> span.finish();
> endOfBlockPacket.setSpan(null);
> }
> assert endOfBlockPacket.isLastPacketInBlock();
> assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
> lastAckedSeqno = endOfBlockPacket.getSeqno();
> pipelineRecoveryCount = 0;
> dataQueue.notifyAll();
> }
> endBlock();
> } else {
> initDataStreaming();
> }
> } {code}
> This could be fixed by inserting this line somewhere above:
> {code:java}
> endOfBlockPacket.releaseBuffer(byteArrayManager);
> {code}
> Claude Opus 4.7 was used to assist in finding this bug. I verified the
> findings and I stand by them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]