We were trying to use structured streaming from file source, but had problems getting the files read by Spark properly.  We have another process generating the data files in the Spark data source directory on a continuous basis.  What we have observed was that the moment a data file is created before the data producing process finished, it was read by Spark immediately without reaching the EOF.  Then Spark will never revisit the file.  So we only ended up with empty data content.  The only way to make it to work is to produce the data files in a separate directory (e.g. /tmp) and move them to the Spark's file source dir after the data generation completes.

My questions:  Is this a behavior by design or is there any way to control the Spark streaming process not to import a file while it is still being used by another process?  In other words, do we have to use the tmp dir to move data files around or can the data producing process and Spark share the same directory?

Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to