RE: Could not clear historyMap due to WAL reservation on cp

2020-12-08 Thread shivakumar
Hi Alexandr Shapkin
With master branch codebase also I'm able to re-produce this issue
(reproducible on 2.7.6, 2.9 and master)
Looks like there is a major bug in the way older WAL segment clean-up is
implemented during checkpoint.
Not sure how connecting to visor is causing this issue, any chance that
Visor puts some deadlock or lock on WAL segments ?

I grep for warning messages in server logs and i saw "history map size is"
growing like 80,81,82..
I don't know how WAL segment clean-up is implemented, do you suspect on
anything ??
There are no exceptions also in server logs.
I'm thinking to file a bug as infinitely growing WAL is a major concern. 



 
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:39:35,555Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
80"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:40:27,647Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
81"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:41:19,744Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
82"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"1
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:12,946Z","logger":"CheckpointPagesWriterFactory","timezone":"UTC","marker":"","log":"2
checkpoint pages were not written yet due to unsuccessful page write lock
acquisition and will be retried"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:42:16,872Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
83"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:43:16,064Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
84"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:17,383Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
85"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:44:56,140Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
86"}
{"type":"log","host":"ignite-cluster-ignite-shiva-0","level":"WARN","systemid":"039c963b","system":"ignite-service","time":"2020-12-08T15:45:42,978Z","logger":"CheckpointHistory","timezone":"UTC","marker":"","log":"Could
not clear historyMap due to WAL reservation on cp: CheckpointEntry
[id=d9335266-e4d8-4c08-bc44-e08f191526d0, timestamp=1607438456546,
ptr=WALPointer [idx=300, fileOff=148642467, len=10370]], history map size is
87"}

RE: Could not clear historyMap due to WAL reservation on cp

2020-12-07 Thread shivakumar
Thanks for your reply Alexandr Shapkin

The ticket you shared https://issues.apache.org/jira/browse/IGNITE-13373
more related to preloading
releaseHistoryForPreloading()
But the issue am facing is somehow WAL segments which are not needed for
crash recovery after checkpoint are not getting cleaned, Do you think there
is something fixed in the above jira ticket related to my issue ?

I tried collecting logs, the logs are very huge, i couldn't find any
exceptions or error messages but I saw below warnings: 
"Partition states validation has failed for group"
"Could not clear historyMap due to WAL reservation on cp"

Please find the attached logs files (not full log but part of logs where i
saw above warnings) instance1.log
  
historymap.log
  

Shiva



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: Could not clear historyMap due to WAL reservation on cp

2020-12-06 Thread Alexandr Shapkin
Hi, Could you share the full logs with the exception? There is an interesting  JIRA ticket that could be the reason for this [1] [1] - https://issues.apache.org/jira/browse/IGNITE-13373   From: shivakumarSent: Thursday, December 3, 2020 10:51 AMTo: user@ignite.apache.orgSubject: Could not clear historyMap due to WAL reservation on cp Hi I have deployed 5 node Ignite cluster on K8S with persistence enabled(version 2.9.0 on java 11)I started ingesting data to 3 tables and after ingesting large amount ofdata using JDBC batch insertion (around 20 million records to each of 3tables with backup set to 1), now i connected to visor shell (from one ofthe pod which i deployed just to use as visor shell) using the same igniteconfig file which is used for ignite servers and after visor shell connectsto ignite cluster the unwanted wal record cleanup stopped (which should runpost checkpoint ) and WAL started growing linearly as there is continuousdata ingestion. This is making WAL disk run out of space and pods crashes. when i see logs there are continuous warning messages saying Could not clear historyMap due to WAL reservation on cp: CheckpointEntry[id=e8bb9c22-0709-416f-88d6-16c5ca534024, timestamp=1606979158669,ptr=FileWALPointer [idx=1468, fileOff=45255321, len=9857]], history map sizeis 4  and if i see checkpoint finish messages  Checkpoint finished [cpId=ca254956-5550-45d6-87c5-892b7e07b13b,pages=494933, markPos=FileWALPointer [idx=1472, fileOff=72673736, len=9857],walSegmentsCleared=0, walSegmentsCovered=[1470 - 1471], markDuration=464ms,pagesWrite=8558ms, fsync=3597ms, total=14027ms] here you can see walSegmentsCleared=0   means there are no WAl segmentscleared even after checkpoint, not sure what is causing this behaviour. we are ingesting very large data (~25mb/s)please someone help in this issue   regards,Shiva --Sent from: http://apache-ignite-users.70518.x6.nabble.com/