[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-11-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230321#comment-17230321
 ] 

Apache Spark commented on SPARK-30294:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/30344

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.1.0
>
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-09-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198682#comment-17198682
 ] 

Jungtaek Lim commented on SPARK-30294:
--

Agree. Let me update the type.

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2020-09-18 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198600#comment-17198600
 ] 

L. C. Hsieh commented on SPARK-30294:
-

As this doesn't cause error or correctness issue, it is more like an 
improvement instead of a bug, it seems to me.

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30294) Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

2019-12-18 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998935#comment-16998935
 ] 

Jungtaek Lim commented on SPARK-30294:
--

Working on the fix. I might bring the solution first which opens the chance to 
optimize for read-only state store, and try to go with workaround solution if 
the community is not happy with the solution.

> Read-only state store unnecessarily creates and deletes the temp file for 
> delta file every batch
> 
>
> Key: SPARK-30294
> URL: https://issues.apache.org/jira/browse/SPARK-30294
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155]
> {code:java}
> /** Abort all the updates made on this store. This store will not be 
> usable any more. */
> override def abort(): Unit = {
>   // This if statement is to ensure that files are deleted only if there 
> are changes to the
>   // StateStore. We have two StateStores for each task, one which is used 
> only for reading, and
>   // the other used for read+write. We don't want the read-only to delete 
> state files.
>   if (state == UPDATING) {
> state = ABORTED
> cancelDeltaFile(compressedStream, deltaFileStream)
>   } else {
> state = ABORTED
>   }
>   logInfo(s"Aborted version $newVersion for $this")
> } {code}
> Despite of the comment, read-only state store also does the same things for 
> preparing write - creates the temporary file, initializes output streams for 
> the file, closes these output streams, and deletes the temporary file. That 
> is just unnecessary and gives confusion as according to the log messages two 
> different instances seem to write to same delta file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org