[
https://issues.apache.org/jira/browse/HADOOP-19902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083181#comment-18083181
]
ASF GitHub Bot commented on HADOOP-19902:
-----------------------------------------
sunchao commented on PR #8513:
URL: https://github.com/apache/hadoop/pull/8513#issuecomment-4529722557
we encountered this while using Parquet 1.15.2 without this PR
https://github.com/apache/parquet-java/pull/3204
> [ABFS] Small write optimization fails hflush followed by close by retaining
> consumed block
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-19902
> URL: https://issues.apache.org/jira/browse/HADOOP-19902
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Chao Sun
> Priority: Major
> Labels: pull-request-available
>
> When `fs.azure.write.enableappendwithflush` is enabled, `AbfsOutputStream`
> fails for a short write followed by `hflush()` and `close()`.
> h3. Reproducer
> {code:java}
> try (FSDataOutputStream out = fs.create(path)) {
> out.write(new byte[1000]);
> out.hflush();
> }
> {code}
> Run with `fs.azure.write.enableappendwithflush=true` and a write buffer
> larger than the payload. The issue is present on current trunk and branch-3.4.
> h3. Actual behavior
> The `hflush()` call sends an append-with-flush request and consumes the
> underlying data block. The subsequent `close()` still sees the same block as
> active and attempts to upload it again, failing before a second append can be
> sent:
> {code}
> java.lang.IllegalStateException: Expected stream state Writing -but actual
> state is Closed in ByteBufferBlock\{...}
> at org.apache.hadoop.fs.store.DataBlocks$DataBlock.verifyState(...)
> at org.apache.hadoop.fs.store.DataBlocks$ByteBufferBlock.startUpload(...)
> at org.apache.hadoop.fs.azurebfs.services.AbfsBlock.startUpload(...)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.uploadBlockAsync(...)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.smallWriteOptimizedflushInternal(...)
> at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.close(...)
> {code}
> h3. Expected behavior
> After an optimized `hflush()`, `close()` should complete successfully without
> attempting to re-upload the data already submitted by the flush-mode append.
> h3. Root cause
> `smallWriteOptimizedflushInternal()` calls `uploadBlockAsync()`, which
> invokes `startUpload()` and consumes the active block, but the optimized path
> does not clear that block from the block manager. The regular
> `uploadCurrentBlock()` path already clears the active block in a `finally`
> block after submission.
> h3. Proposed fix
> Clear the active block after submitting the optimized append-with-flush,
> matching the lifecycle used by regular uploads, and add a regression test for
> `write() -> hflush() -> close()` that verifies the payload is appended
> exactly once.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]