[Spark Core][BlockManager] Spark job fails if blockmgr dirs are cleaned up

Olga Averianova Fri, 17 Jan 2025 05:30:42 -0800

Hi team,

I have observed the following problem.
I have an application running in daemon mode. Within this application I use 
Spark in local mode, initializing SparkContext once per application start. But 
Spark jobs could be triggered at very different time - sometimes once per day, 
sometimes once per week.
When there's a big gap in jobs run, the newly triggered job fails with 
following error:


Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task serialization failed: java.nio.file.NoSuchFileException: 
/tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
java.nio.file.NoSuchFileException: 
/tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
               at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
               at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
               at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
               at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
               at java.nio.file.Files.createDirectory(Files.java:674)
               at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:108)
               at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:131)
               at 
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2008)
               at 
4ce2f7b9eb2d7c64e793fe7e55f682c41f824b97e$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1489)
               at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
               at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1381)
               at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1870)
               at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:154)
               at 
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
               at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
               at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
               at 
org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1548)
               at 
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1530)
               at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1535)
               at 
org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1353)
               at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1295)
               at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2931)
               at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
               at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
               at 
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

Why this happened: /tmp dir outdated folders (let's say 1 week old) are being 
cleaned up. And it seems like somewhere in SparkContext it stores blockId and 
tries to use it while I actually don't expect that new job run depends on 
previous one.
The WA for this is to restart application but this is not suitable for us.

After code analyzing found this in 
org.apache.spark.storage.BlockManager#removeBlockInternal

// Removals are idempotent in disk store and memory store. At worst, we get a 
warning.
val removedFromMemory = memoryStore.remove(blockId)
val removedFromDisk = diskStore.remove(blockId)
if (!removedFromMemory && !removedFromDisk) {
  logWarning(s"Block $blockId could not be removed as it was not found on disk 
or in memory")
}

In case of unsuccessful removal you expect only WARL log but actually the job 
fails.
This happens because in org.apache.spark.storage.DiskBlockManager#getFile it's 
used


Files.createDirectory(path)

and this in its turn uses mkdir. And mkdir doesn't allow creating recursive 
directory (blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e in this case).

So this operation is not idempotent!

The only issue found in stackoverflow 
https://stackoverflow.com/questions/41238121/spark-java-ioexception-failed-to-create-local-dir-in-tmp-blockmgr
 and there's still no proper explanation and resolution.

I'm using spark-core_2.12-3.4.1.jar

Can you suggest anything for this issue? Could it be reported as a bug?

Thanks

Regards,
--------------
Olga Averianova, Senior Software Engineer



________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

[Spark Core][BlockManager] Spark job fails if blockmgr dirs are cleaned up

Reply via email to