[ 
https://issues.apache.org/jira/browse/SPARK-43221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Yang updated SPARK-43221:
-------------------------------
    Description: 
Spark on Yarn Cluster

When multiple executors exist on a node, and the same block exists on both 
executors, with some in memory and some on disk.

Probabilistically, the executor failed to obtain the block,throw Exception:

java.lang.ArrayIndexOutofBoundsException: 0

    at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:183)

 

Next, I will replay the process of the problem occurring: 

step 1:

The executor requests the driver to obtain block 
information(locationsAndStatusOption). The input parameters are BlockId and the 
host of its own node. Please note that it does not carry port information

line:1092

!image-2023-04-21-00-24-22-059.png!

step 2:

On the driver side, the driver obtains all blockManagers holding the block 
based on the BlockId. For non remote shuffle scenarios, the driver will 
retrieve the first one with the blockId and blockManager from the locations

Assuming that there are two BlockManagers holding the BlockId on this node, 
BM-1 holds the Block and stores it in memory, and BM-2 holds the Block and 
stores it in disk

Assuming the returned status is of type memory and its disksize is 0

line: 852, 856

!image-2023-04-21-00-30-41-851.png!

step 3:

This method will return a BlockLocationsAndStatus object. If there are BMs 
using disk, the disk's path information will be stored in localDirs

!image-2023-04-21-00-50-10-918.png!

step 4:

When the executor obtains locationsAndStatusOption, localDirs is not empty, but 
status.diskSize is 0

line: 1102

!image-2023-04-21-00-54-11-968.png!

step 5:

The readDiskBlockFromSameHostExecutor only determines whether the Block file 
exists, and then directly uses the incoming blocksize to read the byte array. 
If the blocksize is 0, it returns an empty byte array

Only checked if the file exists

line: 1234, 1240

!image-2023-04-21-00-57-29-140.png!

Taking values from an empty array, causing an out of bounds problem

  was:
Spark on Yarn Cluster

When multiple executors exist on a node, and the same block exists on both 
executors, with some in memory and some on disk.

Probabilistically, the executor failed to obtain the block,throw Exception:

java.lang.ArrayIndexOutofBoundsException: 0

    at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:183)

 

Next, I will replay the process of the problem occurring: 

step 1:

The executor requests the driver to obtain block 
information(locationsAndStatusOption). The input parameters are BlockId and the 
host of its own node. Please note that it does not carry port information

line:1092

!image-2023-04-21-00-24-22-059.png!

step 2:

On the driver side, the driver obtains all blockManagers holding the block 
based on the BlockId. For non remote shuffle scenarios, the driver will 
retrieve the first one with the blockId and blockManager from the locations

Assuming that there are two BlockManagers holding the BlockId on this node, 
BM-1 holds the Block and stores it in memory, and BM-2 holds the Block and 
stores it in disk

Assuming the returned status is of type memory and its disksize is 0

line: 852, 856

!image-2023-04-21-00-30-41-851.png!

step 3:

This method will return a BlockLocationsAndStatus object. If there are BMs 
using disk, the disk's path information will be stored in localDirs

!image-2023-04-21-00-50-10-918.png!

step 4:

When the executor obtains locationsAndStatusOption, localDirs is not empty, but 
status.diskSize is 0

line: 1102

!image-2023-04-21-00-54-11-968.png!

step 5:

The readDiskBlockFromSameHostExecutor only determines whether the Block file 
exists, and then directly uses the incoming blocksize to read the byte array. 
If the blocksize is 0, it returns an empty byte array

only check 

line: 1234, 1240

!image-2023-04-21-00-57-29-140.png!


> Executor obtained error information 
> ------------------------------------
>
>                 Key: SPARK-43221
>                 URL: https://issues.apache.org/jira/browse/SPARK-43221
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 3.1.1, 3.2.0, 3.3.0
>            Reporter: Qiang Yang
>            Priority: Major
>         Attachments: image-2023-04-21-00-19-58-021.png, 
> image-2023-04-21-00-24-22-059.png, image-2023-04-21-00-30-41-851.png, 
> image-2023-04-21-00-50-10-918.png, image-2023-04-21-00-53-20-720.png, 
> image-2023-04-21-00-54-11-968.png, image-2023-04-21-00-57-29-140.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark on Yarn Cluster
> When multiple executors exist on a node, and the same block exists on both 
> executors, with some in memory and some on disk.
> Probabilistically, the executor failed to obtain the block,throw Exception:
> java.lang.ArrayIndexOutofBoundsException: 0
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:183)
>  
> Next, I will replay the process of the problem occurring: 
> step 1:
> The executor requests the driver to obtain block 
> information(locationsAndStatusOption). The input parameters are BlockId and 
> the host of its own node. Please note that it does not carry port information
> line:1092
> !image-2023-04-21-00-24-22-059.png!
> step 2:
> On the driver side, the driver obtains all blockManagers holding the block 
> based on the BlockId. For non remote shuffle scenarios, the driver will 
> retrieve the first one with the blockId and blockManager from the locations
> Assuming that there are two BlockManagers holding the BlockId on this node, 
> BM-1 holds the Block and stores it in memory, and BM-2 holds the Block and 
> stores it in disk
> Assuming the returned status is of type memory and its disksize is 0
> line: 852, 856
> !image-2023-04-21-00-30-41-851.png!
> step 3:
> This method will return a BlockLocationsAndStatus object. If there are BMs 
> using disk, the disk's path information will be stored in localDirs
> !image-2023-04-21-00-50-10-918.png!
> step 4:
> When the executor obtains locationsAndStatusOption, localDirs is not empty, 
> but status.diskSize is 0
> line: 1102
> !image-2023-04-21-00-54-11-968.png!
> step 5:
> The readDiskBlockFromSameHostExecutor only determines whether the Block file 
> exists, and then directly uses the incoming blocksize to read the byte array. 
> If the blocksize is 0, it returns an empty byte array
> Only checked if the file exists
> line: 1234, 1240
> !image-2023-04-21-00-57-29-140.png!
> Taking values from an empty array, causing an out of bounds problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to