Hello everyone,
I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming
using Spark, I give the `cleanSource` option as "archive" and the
`sourceArchiveDir` option as "archived" as in the code below.
```
spark.readStream
.option("cleanSource", "archive")
.option("sourceArchiveDir", "archived")
.option("enforceSchema", false)
.option("header", includeHeader)
.option("inferSchema", inferSchema)
.options(otherOptions)
.schema(csvSchema.orNull)
.csv(FileUtils.getPath(sourceSettings.dataFolderPath,
mappingSource.path).toString)
```
The code ```FileUtils.getPath(sourceSettings.dataFolderPath,
mappingSource.path)``` returns a relative path like:
test-data\streaming-folder\patients
When I start stream, spark does not move source csv to archive folder. After
working on it a bit, I started debugging the spark source codes. I found the
```override protected def cleanTask(entry: FileEntry): Unit``` method in the
`FileStreamSource.scala` file in the `org.apache.spark.sql.execution.streaming`
package.
On line 569, the ```!fileSystem.rename(curPath, newPath)``` code supposed to
move source file to archive folder. However, when I debugged, I noticed that
the curPath and newPath values were as follows:
**curPath**:
`file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`
**newPath**:
`file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`
It seems that absolute path of csv file were appended when creating `newPath`
because there are two `C:/dev/be/data-integration-suite` in the newPath. This
is the reason spark archiving does not work. Instead, newPath should be:
`file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv`.
I guess this is more related to spark library and maybe it's a spark related
bug? Is there any workaround or spark config to overcome this problem?
Thanks
Best regards,
Yunus Emre