Jia Yu created SEDONA-495:
-----------------------------
Summary: Raster data source uses shared FileSystem connections
which lead to race condition
Key: SEDONA-495
URL: https://issues.apache.org/jira/browse/SEDONA-495
Project: Apache Sedona
Issue Type: Bug
Reporter: Jia Yu
The raster data source's OutputWriter uses `new
Path(savePath).getFileSystem(context.getConfiguration)` to get a Hadoop
FileSystem instance and a OutputWriter instance is initiated per task. This
function will return a shared connection among all tasks on an executor.
https://github.com/apache/sedona/blob/master/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterFileFormat.scala#L85
It is common that a multi-core executor gets multiple concurrent tasks (one
task per core). In the current implementation, if one task is completed, the
connection is closed and all other tasks are having IO exception.
The best practice is to use `FileSystem.newInstance` for each task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)