[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617378#comment-17617378 ] March Wang commented on HBASE-25720: Hi [~Xiaolin Ha] , Sorry late response, I will do it. Thank you so much! > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614598#comment-17614598 ] Xiaolin Ha commented on HBASE-25720: Hi, [~MarchWang] , I saw the problem you described in HBASE-27413. The idea here is to abort the RS ASAP when sync WAL failed, even in the stage of preparing to flush memstore, while currently only fail when committing the flush of memstore will abort the RS. The PR is not accepted, but you can backport it, we are using it on our production environment smoothly. For the sync wal stuck problems, several issues are helpful, I think they can solve mostly of your problems, especially HBASE-22301, HBASE-26347, and HBASE-25905. > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613267#comment-17613267 ] jason commented on HBASE-25720: --- confused about the status, status is 'Resolved' but resolution is "Won't Fix', i check the github commit, it doesn't accepted. interesting. > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613248#comment-17613248 ] March Wang commented on HBASE-25720: Hi [~Xiaolin Ha], Could you please let me know how to fix this issue? Thanks! > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378168#comment-17378168 ] Michael Stack commented on HBASE-25720: --- [~Xiaolin Ha] I ask because I'm looking at a related issue around AsyncFSWAL – HBASE-26042. Thanks. > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377934#comment-17377934 ] Xiaolin Ha commented on HBASE-25720: Hi, [~stack], we noticed this problem always after the RS killed itselves, sorry there is no jstack now, and we have no more ideas about the reason of WAL stuck. But we have made a script monitor for this problem, I'll attach the jstack files once get and will dig more about this problem, thanks. > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25720) Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM
[ https://issues.apache.org/jira/browse/HBASE-25720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377532#comment-17377532 ] Michael Stack commented on HBASE-25720: --- Anything in the log before your png? That shows perhaps how or why the WAL system is stuck? A jstack? Thanks [~Xiaolin Ha] > Sync WAL stuck when prepare flush cache will prevent flush cache and cause OOM > -- > > Key: HBASE-25720 > URL: https://issues.apache.org/jira/browse/HBASE-25720 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.4.13 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Attachments: prepare-flush-cache-stuck.png > > > We call HRegion#doSyncOfUnflushedWALChanges when preparing to flush cache. > But this WAL sync may stuck, and abort the flush of cache. > !prepare-flush-cache-stuck.png|width=519,height=246! > If we cannot aware of this problem in time, RS will OOM kill. > I think we should force abort RS when sync stuck in preparing, like in > committing snapshots. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)