Spark streaming sourceArchiveDir does not move file to archive directory

Yunus Emre G?rses Tue, 19 Sep 2023 05:15:27 -0700

Hello everyone,

I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming 
using Spark, I give the `cleanSource` option as "archive" and the 
`sourceArchiveDir` option as "archived" as in the code below.


```
spark.readStream
  .option("cleanSource", "archive")
  .option("sourceArchiveDir", "archived")
  .option("enforceSchema", false)
  .option("header", includeHeader)
  .option("inferSchema", inferSchema)
  .options(otherOptions)
  .schema(csvSchema.orNull)
  .csv(FileUtils.getPath(sourceSettings.dataFolderPath, 
mappingSource.path).toString)
```

The code ```FileUtils.getPath(sourceSettings.dataFolderPath, 
mappingSource.path)``` returns a relative path like: 
test-data\streaming-folder\patients

When I start stream, spark does not move source csv to archive folder. After 
working on it a bit, I started debugging the spark source codes. I found the 
```override protected def cleanTask(entry: FileEntry): Unit``` method in the 
`FileStreamSource.scala` file in the `org.apache.spark.sql.execution.streaming` 
package.
On line 569, the ```!fileSystem.rename(curPath, newPath)``` code supposed to 
move source file to archive folder. However, when I debugged, I noticed that 
the curPath and newPath values were as follows:

**curPath**: 
`file:/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`

**newPath**: 
`file:/C:/dev/be/data-integration-suite/archived/C:/dev/be/data-integration-suite/test-data/streaming-folder/patients/patients-success.csv`

It seems that absolute path of csv file were appended when creating `newPath` 
because there are two `C:/dev/be/data-integration-suite` in the newPath. This 
is the reason spark archiving does not work. Instead, newPath should be: 
`file:/C:/dev/be/data-integration-suite/archived/test-data/streaming-folder/patients/patients-success.csv`.
 I guess this is more related to spark library and maybe it's a spark related 
bug? Is there any workaround or spark config to overcome this problem?

Thanks
Best regards,
Yunus Emre

Spark streaming sourceArchiveDir does not move file to archive directory

Reply via email to