Problems with DataFrameReader in Structured Streaming

Artemis User Wed, 13 Apr 2022 12:53:37 -0700

We have a single file directory that's being used by both the filegenerator/publisher and the Spark job consumer. When using microbatchfiles in structured streaming, we encountered the following problems:


1. We would like to have a Spark streaming job consume only data files
   after a predefined date/time.  While you could specify the
   "modifiedAfter" option in DataFrameReader, but that option isn't
   available when reading files in structured streaming.  Is any
   specific reason why is option isn't applicable to structured
   streaming since we are using the same reader API?  Is there anyway
   to circumvent this problem?
2. One common problem with structured streaming is that if a single
   file directory is used for both file producer and spark consumer in
   streaming, the Spark consumer will consume a file immediately even
   before the file generation is completed (i.e. before EOF marker is
   produced by the file producer), and won't re-read the file again
   after it's completed.  So you will end up with incomplete data
   content in your data frame.  We solved this problem by creating a
   separate consumer directory along with a customized script that
   moves files one at a time from the producer directory to the
   consumer directory after each file generation is completed.  But in
   a real production environment, this type customization may not be
   possible, as the operation folks usually don't like change any
   system configuration.  Is there any options in DataFrameReader or
   any other easy way to solve this problem?


Thanks for your help in advance!

Problems with DataFrameReader in Structured Streaming

Reply via email to