Structured Streaming using File Source - How to handle live files

ArtemisDev Sun, 07 Jun 2020 10:43:22 -0700

We were trying to use structured streaming from file source, but hadproblems getting the files read by Spark properly. We have anotherprocess generating the data files in the Spark data source directory ona continuous basis. What we have observed was that the moment a datafile is created before the data producing process finished, it was readby Spark immediately without reaching the EOF. Then Spark will neverrevisit the file. So we only ended up with empty data content. Theonly way to make it to work is to produce the data files in a separatedirectory (e.g. /tmp) and move them to the Spark's file source dir afterthe data generation completes.

My questions: Is this a behavior by design or is there any way tocontrol the Spark streaming process not to import a file while it isstill being used by another process? In other words, do we have to usethe tmp dir to move data files around or can the data producing processand Spark share the same directory?


Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Structured Streaming using File Source - How to handle live files

Reply via email to