[ 
https://issues.apache.org/jira/browse/HDDS-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyao Meng updated HDDS-15341:
------------------------------
    Summary: EC write can fail with ArrayIndexOutOfBoundsException due to 
CoderUtil emptyChunk resize race  (was: EC client write can fail with 
ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race)

> EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil 
> emptyChunk resize race
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDDS-15341
>                 URL: https://issues.apache.org/jira/browse/HDDS-15341
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: EC Client
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>              Labels: pull-request-available
>
> In Ozone, this can be hit when multiple EC key output streams in the same 
> client JVM use the Java raw EC encoder concurrently with different 
> encode/reset lengths. Each ECKeyOutputStream has its own encoder, but all 
> Java encoders share the static CoderUtil.emptyChunk cache. *If native ISA-L 
> is unavailable or not selected*, the Java RSRawEncoder clears parity output 
> buffers through CoderUtil.resetOutputBuffers(). Under concurrent close/flush 
> paths, especially with partial final stripes of different sizes, one stream 
> can grow the shared zero buffer for a larger encode while another smaller 
> encode races and shrinks it, causing the larger encode’s later 
> System.arraycopy() to throw ArrayIndexOutOfBoundsException.
> This issue is avoided if native lib (ISA-L) is in-use. The issue can only be 
> hit when fallback builtin-java codec is being used, where you may see 
> messages like this printed on the client:
> {code}
> W20260513 08:22:52.526697  4325 ErasureCodeNative.java:55] 
> 854bfd7fdbf38f0c:f3db82230000000c] ISA-L support is not available in your 
> platform... using builtin-java codec where applicable
> {code}
> Problem:
> CoderUtil.resetBuffer(byte[] buffer, int offset, int len) gets a shared 
> zero-filled buffer from getEmptyChunk(len) and then calls:
> {code}
> System.arraycopy(empty, 0, buffer, offset, len);
> {code}
> The old getEmptyChunk() implementation checked emptyChunk.length before 
> entering the synchronized block, unconditionally replaced the shared static 
> buffer inside the lock, and returned the shared static field after leaving 
> the lock. This allowed a smaller concurrent caller to shrink the shared 
> cached buffer after a larger caller had grown it.
> {code:title=Stacktrace}
> ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException
>       at java.lang.System.arraycopy(Native Method)
>       at 
> org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetBuffer(CoderUtil.java:76)
>       at 
> org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetOutputBuffers(CoderUtil.java:96)
>       at 
> org.apache.ozone.erasurecode.rawcoder.RSRawEncoder.doEncode(RSRawEncoder.java:69)
>       at 
> org.apache.ozone.erasurecode.rawcoder.RawErasureEncoder.encode(RawErasureEncoder.java:88)
>       at 
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.generateParityCells(ECKeyOutputStream.java:305)
>       at 
> org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:475)
>       at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
>       at 
> org.apache.hadoop.fs.ozone.OzoneFSOutputStream.close(OzoneFSOutputStream.java:70)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
> {code}
> {code:title=Current logic w/o the fix}
>   static byte[] getEmptyChunk(int leastLength) {
>     if (emptyChunk.length >= leastLength) {
>       return emptyChunk; // In most time
>     }
>     synchronized (CoderUtil.class) {
>       emptyChunk = new byte[leastLength];
>     }
> {code}
> Repro:
> 1. emptyChunk starts as byte[4096].
> 2. Thread A calls getEmptyChunk(4097) and blocks before entering the 
> synchronized block.
> 3. Thread B calls getEmptyChunk(8194), enters the synchronized block, and 
> sets emptyChunk = byte[8194].
> 4. Thread A resumes and unconditionally sets emptyChunk = byte[4097].
> 5. Thread B returns the shared static emptyChunk, now byte[4097].
> 6. System.arraycopy(..., len=8194) throws ArrayIndexOutOfBoundsException.
> This is a TOCTOU-style race on the shared emptyChunk cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to