[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230321#comment-17230321 ] Apache Spark commented on SPARK-30294: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/30344 > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.1.0 > > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198682#comment-17198682 ] Jungtaek Lim commented on SPARK-30294: -- Agree. Let me update the type. > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198600#comment-17198600 ] L. C. Hsieh commented on SPARK-30294: - As this doesn't cause error or correctness issue, it is more like an improvement instead of a bug, it seems to me. > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch
[ https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998935#comment-16998935 ] Jungtaek Lim commented on SPARK-30294: -- Working on the fix. I might bring the solution first which opens the chance to optimize for read-only state store, and try to go with workaround solution if the community is not happy with the solution. > Read-only state store unnecessarily creates and deletes the temp file for > delta file every batch > > > Key: SPARK-30294 > URL: https://issues.apache.org/jira/browse/SPARK-30294 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155] > {code:java} > /** Abort all the updates made on this store. This store will not be > usable any more. */ > override def abort(): Unit = { > // This if statement is to ensure that files are deleted only if there > are changes to the > // StateStore. We have two StateStores for each task, one which is used > only for reading, and > // the other used for read+write. We don't want the read-only to delete > state files. > if (state == UPDATING) { > state = ABORTED > cancelDeltaFile(compressedStream, deltaFileStream) > } else { > state = ABORTED > } > logInfo(s"Aborted version $newVersion for $this") > } {code} > Despite of the comment, read-only state store also does the same things for > preparing write - creates the temporary file, initializes output streams for > the file, closes these output streams, and deletes the temporary file. That > is just unnecessary and gives confusion as according to the log messages two > different instances seem to write to same delta file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org