Hi, Yeah we generally read files from hdfs or object stores like S3, gcs, etc where files cannot be updated.
Regards Gourav On Sun, 7 Jun 2020, 22:36 Jungtaek Lim, <kabhwan.opensou...@gmail.com> wrote: > Hi Nick, > > I guess that's by design - Spark assumes the input file will not be > modified once it is placed on the input path. This makes Spark easy to > track the list of processed files vs unprocessed files. Assume input files > can be modified, then Spark will have to enumerate all of files and track > how many lines/bytes it reads "per file", even the bad case it may read the > incomplete line (if the writer doesn't guarantee that) and crash or bring > incorrect results. > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <arte...@dtechspace.com> wrote: > >> We were trying to use structured streaming from file source, but had >> problems getting the files read by Spark properly. We have another >> process generating the data files in the Spark data source directory on >> a continuous basis. What we have observed was that the moment a data >> file is created before the data producing process finished, it was read >> by Spark immediately without reaching the EOF. Then Spark will never >> revisit the file. So we only ended up with empty data content. The >> only way to make it to work is to produce the data files in a separate >> directory (e.g. /tmp) and move them to the Spark's file source dir after >> the data generation completes. >> >> My questions: Is this a behavior by design or is there any way to >> control the Spark streaming process not to import a file while it is >> still being used by another process? In other words, do we have to use >> the tmp dir to move data files around or can the data producing process >> and Spark share the same directory? >> >> Thanks! >> >> -- Nick >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>