Jungtaek Lim created SPARK-30294:
------------------------------------

             Summary: Read-only state store unnecessarily creates and deletes 
the temp file for delta file every batch
                 Key: SPARK-30294
                 URL: https://issues.apache.org/jira/browse/SPARK-30294
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


[https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
{code:java}
    /** Abort all the updates made on this store. This store will not be usable 
any more. */
    override def abort(): Unit = {
      // This if statement is to ensure that files are deleted only if there 
are changes to the
      // StateStore. We have two StateStores for each task, one which is used 
only for reading, and
      // the other used for read+write. We don't want the read-only to delete 
state files.
      if (state == UPDATING) {
        state = ABORTED
        cancelDeltaFile(compressedStream, deltaFileStream)
      } else {
        state = ABORTED
      }
      logInfo(s"Aborted version $newVersion for $this")
    } {code}
Despite of the comment, read-only state store also does the same things for 
preparing write - creates the temporary file, initializes output streams for 
the file, closes these output streams, and deletes the temporary file. That is 
just unnecessary and gives confusion as according to the log messages two 
different instances seem to write to same delta file.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to