liang yu created MAPREDUCE-7508:
-----------------------------------
Summary: FileInputFormat can throw ArrayIndexOutofBoundsException
because of some concurrent execution.
Key: MAPREDUCE-7508
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7508
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mapreduce-client
Reporter: liang yu
Using Spark Streaming (version 2.4.0), Hadoop Mapreduce-client (version 2.6.5)
Scenario:
I am using Spark Streaming to process files stored in HDFS. In my setup, the
upstream system sometimes starts two identical tasks that attempt to create and
write to the same HDFS file simultaneously. This can lead to conflicts where a
file is created and written to twice in quick succession.
Problem:
When Spark scans for files, it uses the FileInputFormat.getSplits() method to
split the file. The first step in getSplits is to retrieve the file's length.
If the file length is not zero, the next step is to get the block locations
array for that file. However, if the two upstream programs rapidly create and
write to the same file (i.e., the file is overwritten or appended to almost
simultaneously), a race condition may occur:
The file's length is already non-zero,
but calling getFileBlockLocations() returns an empty array because the file is
being overwritten or is not yet fully written.
When this happens, subsequent logic in getSplits (such as accessing the last
element of the block locations array) will throw an
ArrayIndexOutOfBoundsException because the block locations array is
unexpectedly empty.
Summary:
This issue can occur when multiple upstream writers operate on the same HDFS
file nearly simultaneously. As a result, Spark jobs may intermittently fail due
to an unhandled empty block locations array in FileInputFormat.getSplits() when
processing files that are in the process of being overwritten or not yet fully
written.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]