You're hitting an existing issue https://issues.apache.org/jira/browse/SPARK-17604. While there's no active PR to address it, I've been planning to take a look sooner than later.
Btw, you may also want to take a look at my previous mail - the topic on the mail thread was regarding file stream sink metadata growing bigger, but in fact that's basically the same issue, so you may get some information from there. (tl;dr. I have bunch of PRs for addressing multiple issues on file stream source and sink, just having lack of some love.) https://lists.apache.org/thread.html/rb4ebf1d20d13db0a78694e8d301e51c326f803cb86fc1a1f66f2ae7e%40%3Cuser.spark.apache.org%3E Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav <py.dc...@gmail.com> wrote: > Hi Team, > > While Running Spark Below are some finding. > > 1. FileStreamSourceLog is responsible for maintaining input source > file list. > 2. Spark Streaming delete expired log files on the basis of s > *park.sql.streaming.fileSource.log.deletion* and > *spark.sql.streaming.minBatchesToRetain.* > 3. But while compacting logs Spark Streaming write the complete list > of files streaming has seen till now in HDFS into one single .compact file. > 4. Over the course of time this compact file is consuming around > 2GB-5GB in HDFS which will delay creation of compact file after every 10th > Batch and also job restart time will increase. > 5. Why Spark Streaming is logging files in the system which are > already deleted . While creating compact file there must be some configured > timeout so that Spark can skip writing expired list of input files. > > *Also kindly let me know if i missed something and there is some > configuration already present to handle this. * > > Regards > Pappu Yadav >