Hi team,
I have observed the following problem.
I have an application running in daemon mode. Within this application I use
Spark in local mode, initializing SparkContext once per application start. But
Spark jobs could be triggered at very different time - sometimes once per day,
sometimes once per week.
When there's a big gap in jobs run, the newly triggered job fails with
following error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task serialization failed: java.nio.file.NoSuchFileException:
/tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
java.nio.file.NoSuchFileException:
/tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
at java.nio.file.Files.createDirectory(Files.java:674)
at
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:108)
at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:131)
at
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2008)
at
4ce2f7b9eb2d7c64e793fe7e55f682c41f824b97e$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1489)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1381)
at
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1870)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:154)
at
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
at
org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1548)
at
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1530)
at
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1535)
at
org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1353)
at
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1295)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2931)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
at
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Why this happened: /tmp dir outdated folders (let's say 1 week old) are being
cleaned up. And it seems like somewhere in SparkContext it stores blockId and
tries to use it while I actually don't expect that new job run depends on
previous one.
The WA for this is to restart application but this is not suitable for us.
After code analyzing found this in
org.apache.spark.storage.BlockManager#removeBlockInternal
// Removals are idempotent in disk store and memory store. At worst, we get a
warning.
val removedFromMemory = memoryStore.remove(blockId)
val removedFromDisk = diskStore.remove(blockId)
if (!removedFromMemory && !removedFromDisk) {
logWarning(s"Block $blockId could not be removed as it was not found on disk
or in memory")
}
In case of unsuccessful removal you expect only WARL log but actually the job
fails.
This happens because in org.apache.spark.storage.DiskBlockManager#getFile it's
used
Files.createDirectory(path)
and this in its turn uses mkdir. And mkdir doesn't allow creating recursive
directory (blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e in this case).
So this operation is not idempotent!
The only issue found in stackoverflow
https://stackoverflow.com/questions/41238121/spark-java-ioexception-failed-to-create-local-dir-in-tmp-blockmgr
and there's still no proper explanation and resolution.
I'm using spark-core_2.12-3.4.1.jar
Can you suggest anything for this issue? Could it be reported as a bug?
Thanks
Regards,
--------------
Olga Averianova, Senior Software Engineer
________________________________
The information transmitted herein is intended only for the person or entity to
which it is addressed and may contain confidential, proprietary and/or
privileged material. Any review, retransmission, dissemination or other use of,
or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.