We have a single file directory that's being used by both the file
generator/publisher and the Spark job consumer. When using microbatch
files in structured streaming, we encountered the following problems:
1. We would like to have a Spark streaming job consume only data files
after a predefined date/time. While you could specify the
"modifiedAfter" option in DataFrameReader, but that option isn't
available when reading files in structured streaming. Is any
specific reason why is option isn't applicable to structured
streaming since we are using the same reader API? Is there anyway
to circumvent this problem?
2. One common problem with structured streaming is that if a single
file directory is used for both file producer and spark consumer in
streaming, the Spark consumer will consume a file immediately even
before the file generation is completed (i.e. before EOF marker is
produced by the file producer), and won't re-read the file again
after it's completed. So you will end up with incomplete data
content in your data frame. We solved this problem by creating a
separate consumer directory along with a customized script that
moves files one at a time from the producer directory to the
consumer directory after each file generation is completed. But in
a real production environment, this type customization may not be
possible, as the operation folks usually don't like change any
system configuration. Is there any options in DataFrameReader or
any other easy way to solve this problem?
Thanks for your help in advance!