[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466708#comment-17466708 ] Yuan Mei commented on FLINK-20044: -- This task seems a duplication of FLINK-5463 > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Minor > Labels: auto-deprioritized-major, auto-deprioritized-minor > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436637#comment-17436637 ] Yu Li commented on FLINK-20044: --- Thanks for the quick response [~wind_ljy]. Sure, let's keep watching. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Minor > Labels: auto-deprioritized-major, stale-minor > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436634#comment-17436634 ] Jiayi Liao commented on FLINK-20044: [~liyu] We haven't upgrade our Flink version recently. But I think the problem is still valid after reviewing the codes on the latest branch. How about we keep watch the issue, and see if there is any feedback from other users? > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Minor > Labels: auto-deprioritized-major, stale-minor > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436616#comment-17436616 ] Yu Li commented on FLINK-20044: --- [~wind_ljy] are we still observing the same issue in product environment? It's a little bit stale but we will keep watching it if the later releases still have the issue. Thanks. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Minor > Labels: auto-deprioritized-major, stale-minor > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336038#comment-17336038 ] Flink Jira Bot commented on FLINK-20044: This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Major > Labels: stale-major > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327556#comment-17327556 ] Flink Jira Bot commented on FLINK-20044: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Major > Labels: stale-major > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228504#comment-17228504 ] Yu Li commented on FLINK-20044: --- I could see you also reopened FLINK-5463 and from the description these two JIRAs are reporting the same issue. I'm linking these two together and I believe we should mark one as duplicate of the other once confirmed. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Major > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228315#comment-17228315 ] Jiayi Liao commented on FLINK-20044: [~sewen] I guess this is not a cancellation situation here because I didn't see any cancellation logs on TaskExecutor from the context. I think the task throws an exception and in the try-finally code block in {{StreamTask}}, the thread hangs on {{disposeAllOperators}}. And we also cannot observe the exception because the exception is printed on {{Task}} , which is executed after the try-finally code block in {{StreamTask}}. > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Major > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-20044) Disposal of RocksDB could last forever
[ https://issues.apache.org/jira/browse/FLINK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228297#comment-17228297 ] Stephan Ewen commented on FLINK-20044: -- [~wind_ljy] The TaskManagers should kill the process after some time if the cancellation does not succeed. Is that not happening here? > Disposal of RocksDB could last forever > -- > > Key: FLINK-20044 > URL: https://issues.apache.org/jira/browse/FLINK-20044 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.0 >Reporter: Jiayi Liao >Priority: Major > > The task cannot fail itself because it's stuck on the disposal of RocksDB, > which also affects the job. I saw this for several times in recent months, > most of the errors come from the broken disk. But I think we should also do > something to deal with it more elegantly from Flink's perspective. > {code:java} > "LookUp_Join -> Sink_Unnamed (898/1777)- execution # 4" #411 prio=5 os_prio=0 > tid=0x7fc9b0286800 nid=0xff6fc runnable [0x7fc966cfc000] >java.lang.Thread.State: RUNNABLE > at org.rocksdb.RocksDB.disposeInternal(Native Method) > at org.rocksdb.RocksObject.disposeInternal(RocksObject.java:37) > at > org.rocksdb.AbstractImmutableNativeReference.close(AbstractImmutableNativeReference.java:57) > at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:263) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.dispose(RocksDBKeyedStateBackend.java:349) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.dispose(AbstractStreamOperator.java:371) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:124) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:618) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:517) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:733) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:539) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)