[
https://issues.apache.org/jira/browse/HDDS-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siyao Meng updated HDDS-15341:
------------------------------
Description:
This can be hit when multiple EC key output streams in the same client JVM use
the Java raw EC encoder concurrently with different encode/reset lengths. Each
ECKeyOutputStream has its own encoder, but all Java encoders share the static
CoderUtil.emptyChunk cache. *If native ISA-L is unavailable or not selected*,
the Java RSRawEncoder clears parity output buffers through
CoderUtil.resetOutputBuffers(). Under concurrent close/flush paths, especially
with partial final stripes of different sizes, one stream can grow the shared
zero buffer for a larger encode while another smaller encode races and shrinks
it, causing the larger encode’s later System.arraycopy() to throw
ArrayIndexOutOfBoundsException.
This issue is avoided if native lib (ISA-L) is in-use. The issue can only be
hit when fallback builtin-java codec is being used, where you may see messages
like this printed on the client:
{code}
W20260513 08:22:52.526697 4325 ErasureCodeNative.java:55]
854bfd7fdbf38f0c:f3db82230000000c] ISA-L support is not available in your
platform... using builtin-java codec where applicable
{code}
Problem:
CoderUtil.resetBuffer(byte[] buffer, int offset, int len) gets a shared
zero-filled buffer from getEmptyChunk(len) and then calls:
{code}
System.arraycopy(empty, 0, buffer, offset, len);
{code}
The old getEmptyChunk() implementation checked emptyChunk.length before
entering the synchronized block, unconditionally replaced the shared static
buffer inside the lock, and returned the shared static field after leaving the
lock. This allowed a smaller concurrent caller to shrink the shared cached
buffer after a larger caller had grown it.
{code:title=Stacktrace}
ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetBuffer(CoderUtil.java:76)
at
org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetOutputBuffers(CoderUtil.java:96)
at
org.apache.ozone.erasurecode.rawcoder.RSRawEncoder.doEncode(RSRawEncoder.java:69)
at
org.apache.ozone.erasurecode.rawcoder.RawErasureEncoder.encode(RawErasureEncoder.java:88)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.generateParityCells(ECKeyOutputStream.java:305)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:475)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.close(OzoneFSOutputStream.java:70)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
{code}
{code:title=Current logic w/o the fix}
static byte[] getEmptyChunk(int leastLength) {
if (emptyChunk.length >= leastLength) {
return emptyChunk; // In most time
}
synchronized (CoderUtil.class) {
emptyChunk = new byte[leastLength];
}
{code}
Repro:
1. emptyChunk starts as byte[4096].
2. Thread A calls getEmptyChunk(4097) and blocks before entering the
synchronized block.
3. Thread B calls getEmptyChunk(8194), enters the synchronized block, and sets
emptyChunk = byte[8194].
4. Thread A resumes and unconditionally sets emptyChunk = byte[4097].
5. Thread B returns the shared static emptyChunk, now byte[4097].
6. System.arraycopy(..., len=8194) throws ArrayIndexOutOfBoundsException.
This is a TOCTOU-style race on the shared emptyChunk cache.
was:
In Ozone, this can be hit when multiple EC key output streams in the same
client JVM use the Java raw EC encoder concurrently with different encode/reset
lengths. Each ECKeyOutputStream has its own encoder, but all Java encoders
share the static CoderUtil.emptyChunk cache. *If native ISA-L is unavailable or
not selected*, the Java RSRawEncoder clears parity output buffers through
CoderUtil.resetOutputBuffers(). Under concurrent close/flush paths, especially
with partial final stripes of different sizes, one stream can grow the shared
zero buffer for a larger encode while another smaller encode races and shrinks
it, causing the larger encode’s later System.arraycopy() to throw
ArrayIndexOutOfBoundsException.
This issue is avoided if native lib (ISA-L) is in-use. The issue can only be
hit when fallback builtin-java codec is being used, where you may see messages
like this printed on the client:
{code}
W20260513 08:22:52.526697 4325 ErasureCodeNative.java:55]
854bfd7fdbf38f0c:f3db82230000000c] ISA-L support is not available in your
platform... using builtin-java codec where applicable
{code}
Problem:
CoderUtil.resetBuffer(byte[] buffer, int offset, int len) gets a shared
zero-filled buffer from getEmptyChunk(len) and then calls:
{code}
System.arraycopy(empty, 0, buffer, offset, len);
{code}
The old getEmptyChunk() implementation checked emptyChunk.length before
entering the synchronized block, unconditionally replaced the shared static
buffer inside the lock, and returned the shared static field after leaving the
lock. This allowed a smaller concurrent caller to shrink the shared cached
buffer after a larger caller had grown it.
{code:title=Stacktrace}
ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetBuffer(CoderUtil.java:76)
at
org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetOutputBuffers(CoderUtil.java:96)
at
org.apache.ozone.erasurecode.rawcoder.RSRawEncoder.doEncode(RSRawEncoder.java:69)
at
org.apache.ozone.erasurecode.rawcoder.RawErasureEncoder.encode(RawErasureEncoder.java:88)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.generateParityCells(ECKeyOutputStream.java:305)
at
org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:475)
at
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
at
org.apache.hadoop.fs.ozone.OzoneFSOutputStream.close(OzoneFSOutputStream.java:70)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
{code}
{code:title=Current logic w/o the fix}
static byte[] getEmptyChunk(int leastLength) {
if (emptyChunk.length >= leastLength) {
return emptyChunk; // In most time
}
synchronized (CoderUtil.class) {
emptyChunk = new byte[leastLength];
}
{code}
Repro:
1. emptyChunk starts as byte[4096].
2. Thread A calls getEmptyChunk(4097) and blocks before entering the
synchronized block.
3. Thread B calls getEmptyChunk(8194), enters the synchronized block, and sets
emptyChunk = byte[8194].
4. Thread A resumes and unconditionally sets emptyChunk = byte[4097].
5. Thread B returns the shared static emptyChunk, now byte[4097].
6. System.arraycopy(..., len=8194) throws ArrayIndexOutOfBoundsException.
This is a TOCTOU-style race on the shared emptyChunk cache.
> EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil
> emptyChunk resize race
> ---------------------------------------------------------------------------------------------
>
> Key: HDDS-15341
> URL: https://issues.apache.org/jira/browse/HDDS-15341
> Project: Apache Ozone
> Issue Type: Bug
> Components: EC Client
> Reporter: Siyao Meng
> Assignee: Siyao Meng
> Priority: Major
> Labels: pull-request-available
>
> This can be hit when multiple EC key output streams in the same client JVM
> use the Java raw EC encoder concurrently with different encode/reset lengths.
> Each ECKeyOutputStream has its own encoder, but all Java encoders share the
> static CoderUtil.emptyChunk cache. *If native ISA-L is unavailable or not
> selected*, the Java RSRawEncoder clears parity output buffers through
> CoderUtil.resetOutputBuffers(). Under concurrent close/flush paths,
> especially with partial final stripes of different sizes, one stream can grow
> the shared zero buffer for a larger encode while another smaller encode races
> and shrinks it, causing the larger encode’s later System.arraycopy() to throw
> ArrayIndexOutOfBoundsException.
> This issue is avoided if native lib (ISA-L) is in-use. The issue can only be
> hit when fallback builtin-java codec is being used, where you may see
> messages like this printed on the client:
> {code}
> W20260513 08:22:52.526697 4325 ErasureCodeNative.java:55]
> 854bfd7fdbf38f0c:f3db82230000000c] ISA-L support is not available in your
> platform... using builtin-java codec where applicable
> {code}
> Problem:
> CoderUtil.resetBuffer(byte[] buffer, int offset, int len) gets a shared
> zero-filled buffer from getEmptyChunk(len) and then calls:
> {code}
> System.arraycopy(empty, 0, buffer, offset, len);
> {code}
> The old getEmptyChunk() implementation checked emptyChunk.length before
> entering the synchronized block, unconditionally replaced the shared static
> buffer inside the lock, and returned the shared static field after leaving
> the lock. This allowed a smaller concurrent caller to shrink the shared
> cached buffer after a larger caller had grown it.
> {code:title=Stacktrace}
> ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at
> org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetBuffer(CoderUtil.java:76)
> at
> org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetOutputBuffers(CoderUtil.java:96)
> at
> org.apache.ozone.erasurecode.rawcoder.RSRawEncoder.doEncode(RSRawEncoder.java:69)
> at
> org.apache.ozone.erasurecode.rawcoder.RawErasureEncoder.encode(RawErasureEncoder.java:88)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.generateParityCells(ECKeyOutputStream.java:305)
> at
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:475)
> at
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
> at
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.close(OzoneFSOutputStream.java:70)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
> {code}
> {code:title=Current logic w/o the fix}
> static byte[] getEmptyChunk(int leastLength) {
> if (emptyChunk.length >= leastLength) {
> return emptyChunk; // In most time
> }
> synchronized (CoderUtil.class) {
> emptyChunk = new byte[leastLength];
> }
> {code}
> Repro:
> 1. emptyChunk starts as byte[4096].
> 2. Thread A calls getEmptyChunk(4097) and blocks before entering the
> synchronized block.
> 3. Thread B calls getEmptyChunk(8194), enters the synchronized block, and
> sets emptyChunk = byte[8194].
> 4. Thread A resumes and unconditionally sets emptyChunk = byte[4097].
> 5. Thread B returns the shared static emptyChunk, now byte[4097].
> 6. System.arraycopy(..., len=8194) throws ArrayIndexOutOfBoundsException.
> This is a TOCTOU-style race on the shared emptyChunk cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]