[ https://issues.apache.org/jira/browse/SPARK-23004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tathagata Das reassigned SPARK-23004: ------------------------------------- Assignee: Tathagata Das > Structured Streaming raise "llegalStateException: Cannot remove after already > committed or aborted" > --------------------------------------------------------------------------------------------------- > > Key: SPARK-23004 > URL: https://issues.apache.org/jira/browse/SPARK-23004 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.3.0 > Environment: Run on yarn or local both raise the exception. > Reporter: secfree > Assignee: Tathagata Das > Priority: Major > > A structured streaming query with a streaming aggregation can throw the > following error in rare cases. > {code:java} > java.lang.IllegalStateException: Cannot remove after already committed or > aborted at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:659 > ) at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.remove(HDFSBackedStateStoreProvider.scala:143) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$1.hasNext(statefulOperators.scala:233) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:111) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:103) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) 18-01-05 13:29:57 WARN > TaskSetManager:66 Lost task 68.0 in stage 1933.0 (TID 196171, localhost, > executor driver): java.lang.IllegalStateException: Cannot remove after > already committed or aborted at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:659){code} > > This can happen when the following conditions are accidentally hit. > # Streaming aggregation with aggregation function that is a subset of > {{[TypedImperativeAggregation|https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L473]}} > (for example, {{collect_set}}, {{collect_list}}, {{percentile}}, etc.). > # Query running in {{update}} mode > # After the shuffle, a partition has exactly 128 records. > This happens because of the following. > # The {{StateStoreSaveExec}} used in streaming aggregations has the > [following > logic|https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L359] > when used in {{update}} mode. > ## There is an iterator that reads data from its parent iterator and updates > the StateStore. > ## When the parent iterator is fully consumed (i.e. {{baseIterator.hasNext}} > returns false) then all state changes are committed by calling > {{StateStore.commit}}. > ## The implementation of {{StateStore.commit()}} in {{HDFSBackedStateStore}} > does not allow itself to be called twice. However, the logic is such that, if > {{hasNext}} is called multiple times after {{baseIterator.hasNext}} has > returned false then each time it will call {{StateStore.commit}}. > ## For most aggregation functions, this is okay because {{hasNext}} is only > called once. But thats not the case with {{ImperativeTypedAggregates}}. > # {{ImperativeTypedAggregates}} are executed using > {{ObjectHashAggregateExec}} which will try to use two kinds of hashmaps for > aggregations. > ## It will first try to use an unsorted hashmap. If the size of the hashmap > increases beyond a certain threshold (default 128), then it will switch to > using a sorted hashmap. > ## The [switching > logic|https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala] > in {{ObjectAggregationIterator}} (used by {{ObjectHashAggregateExec}}) is > such that when the number of records matches the threshold (i.e. 128), it > will end up calling the {{iterator.hasNext}} twice. > When combined with the above two conditions are combined, it leads to the > above error. This latent bug has existed since Spark 2.1 when > {{ObjectHashAggregateExec}} was introduced in Spark. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org