[ https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772632#comment-16772632 ]
Jungtaek Lim commented on SPARK-20971: -------------------------------------- Maybe better to clarify what we would like to do here. * If we are trying to reduce the count of log files, CompactibleFileStreamLog is now covering it. * If we are trying to reduce actual file entities which was read but guaranteed to be not re-read anymore, CompactibleFileStreamLog makes thing a bit complicated - because it will put all existing entities as well as new entities into one when compacting. `compactLogs` is the only place to remove entities which means we can only remove entities in compacted batches. Btw, calling `purge` breaks CompactibleFileStreamLog since CompactibleFileStreamLog expects non-compacted batches to be exist, but `purge` just removes all of metadata files matching criteria. The safest way seems to be just disallowing `purge` for CompactibleFileStreamLog, otherwise we have to concern about the intention of calling `purge`, like I would like to clarify above. > Purge the metadata log for FileStreamSource > ------------------------------------------- > > Key: SPARK-20971 > URL: https://issues.apache.org/jira/browse/SPARK-20971 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.1.1 > Reporter: Shixiong Zhu > Priority: Major > > Currently > [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258] > is empty. We can delete unused metadata logs in this method to reduce the > size of log files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org