Re: Structured Streaming using File Source - How to handle live files

2020-06-13 Thread Gourav Sengupta
Hi, Yeah we generally read files from hdfs or object stores like S3, gcs, etc where files cannot be updated. Regards Gourav On Sun, 7 Jun 2020, 22:36 Jungtaek Lim, wrote: > Hi Nick, > > I guess that's by design - Spark assumes the input file will not be > modified once it is placed on the

Re: Structured Streaming using File Source - How to handle live files

2020-06-07 Thread Jungtaek Lim
Hi Nick, I guess that's by design - Spark assumes the input file will not be modified once it is placed on the input path. This makes Spark easy to track the list of processed files vs unprocessed files. Assume input files can be modified, then Spark will have to enumerate all of files and track

Structured Streaming using File Source - How to handle live files

2020-06-07 Thread ArtemisDev
We were trying to use structured streaming from file source, but had problems getting the files read by Spark properly.  We have another process generating the data files in the Spark data source directory on a continuous basis.  What we have observed was that the moment a data file is created