[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-11 Thread Dmitriy Govorukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Govorukhin updated IGNITE-11749:

Fix Version/s: 2.8

> Implement automatic pages history dump on CorruptedTreeException
> 
>
> Key: IGNITE-11749
> URL: https://issues.apache.org/jira/browse/IGNITE-11749
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Priority: Major
> Fix For: 2.8
>
>
> Currently, the only way to debug possible bugs in checkpointer/recovery 
> mechanics is to manually parse WAL files after the corruption happened. This 
> is not practical for several reasons. First, it requires manual actions which 
> depend on the content of the exception. Second, it is not always possible to 
> obtain WAL files (it may contain sensitive data).
> We need to add a mechanics which will dump all information required for 
> primary analysis of the corruption to the exception handler. For example, if 
> an exception happened when materializing a link {{0xabcd}} written on an 
> index page {{0xdcba}}, we need to dump history of both pages changes, 
> checkpoint records on the analysis interval. Possibly, we should include 
> FreeList pages to which the aforementioned pages were included to.
> Example of output:
> {noformat}
> [2019-05-07 11:57:57,350][INFO 
> ][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
> Next WAL record :: PageSnapshot [fullPageId = FullPageId 
> [pageId=0002, effectivePageId=, 
> grpId=-2100569601], page = [
> Header [
>   type=11 (PageMetaIO),
>   ver=1,
>   crc=0,
>   pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
> ],
> PageMeta[
>   treeRoot=844420635164675,
>   lastSuccessfulFullSnapshotId=0,
>   lastSuccessfulSnapshotId=0,
>   nextSnapshotTag=1,
>   lastSuccessfulSnapshotTag=0,
>   lastAllocatedPageCount=0,
>   candidatePageCount=0
> ]],
> super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
> fileOff=103, len=4129], type=PAGE_RECORD]]]
> Next WAL record :: CheckpointRecord 
> [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, end=false, cpMark=FileWALPointer 
> [idx=0, fileOff=29, len=29], super=WALRecord [size=1963, chainSize=0, 
> pos=FileWALPointer [idx=0, fileOff=39686, len=1963], type=CHECKPOINT_RECORD]]
> Next WAL record :: PageSnapshot [fullPageId = FullPageId 
> [pageId=0002, effectivePageId=, 
> grpId=-1368047378], page = [
> Header [
>   type=11 (PageMetaIO),
>   ver=1,
>   crc=0,
>   pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
> ],
> PageMeta[
>   treeRoot=844420635164675,
>   lastSuccessfulFullSnapshotId=0,
>   lastSuccessfulSnapshotId=0,
>   nextSnapshotTag=1,
>   lastSuccessfulSnapshotTag=0,
>   lastAllocatedPageCount=0,
>   candidatePageCount=0
> ]],
> super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
> fileOff=55961, len=4129], type=PAGE_RECORD]]]
> Next WAL record :: CheckpointRecord 
> [cpId=145e599e-66fc-45f5-bde4-b0c392125968, end=false, cpMark=null, 
> super=WALRecord [size=21409, chainSize=0, pos=FileWALPointer [idx=0, 
> fileOff=13101788, len=21409], type=CHECKPOINT_RECORD]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-07 Thread Dmitriy Govorukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Govorukhin updated IGNITE-11749:

Description: 
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, 
end=false, cpMark=FileWALPointer [idx=0, fileOff=29, len=29], super=WALRecord 
[size=1963, chainSize=0, pos=FileWALPointer [idx=0, fileOff=39686, len=1963], 
type=CHECKPOINT_RECORD]]
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-1368047378], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=55961, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=145e599e-66fc-45f5-bde4-b0c392125968, 
end=false, cpMark=null, super=WALRecord [size=21409, chainSize=0, 
pos=FileWALPointer [idx=0, fileOff=13101788, len=21409], 
type=CHECKPOINT_RECORD]]
{noformat}

  was:
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:
{noformat}
control.sh --diagnostic page_history print_to_log print_to_file [page_ids 
] [dump_path ] [--yes]

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

dump_path  - custom path to folder(absolute or relative 
of work_dir).

{noformat}
Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0

[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-07 Thread Anton Kalashnikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Kalashnikov updated IGNITE-11749:
---
Description: 
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:
{noformat}
control.sh --diagnostic page_history print_to_log print_to_file [page_ids 
] [dump_path ] [--yes]

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

dump_path  - custom path to folder(absolute or relative 
of work_dir).

{noformat}
Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, 
end=false, cpMark=FileWALPointer [idx=0, fileOff=29, len=29], super=WALRecord 
[size=1963, chainSize=0, pos=FileWALPointer [idx=0, fileOff=39686, len=1963], 
type=CHECKPOINT_RECORD]]
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-1368047378], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=55961, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=145e599e-66fc-45f5-bde4-b0c392125968, 
end=false, cpMark=null, super=WALRecord [size=21409, chainSize=0, 
pos=FileWALPointer [idx=0, fileOff=13101788, len=21409], 
type=CHECKPOINT_RECORD]]
{noformat}

  was:
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:
{noformat}
control.sh --diagnostic page_history [page_ids pageId1,pageId2] print_to_log 
print_to_file [--yes]

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

{noformat}
Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, 

[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-07 Thread Anton Kalashnikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Kalashnikov updated IGNITE-11749:
---
Description: 
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:
{noformat}
control.sh --diagnostic page_history [page_ids pageId1,pageId2] print_to_log 
print_to_file [--yes]

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

{noformat}
Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, 
end=false, cpMark=FileWALPointer [idx=0, fileOff=29, len=29], super=WALRecord 
[size=1963, chainSize=0, pos=FileWALPointer [idx=0, fileOff=39686, len=1963], 
type=CHECKPOINT_RECORD]]
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-1368047378], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=55961, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=145e599e-66fc-45f5-bde4-b0c392125968, 
end=false, cpMark=null, super=WALRecord [size=21409, chainSize=0, 
pos=FileWALPointer [idx=0, fileOff=13101788, len=21409], 
type=CHECKPOINT_RECORD]]
{noformat}

  was:
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:

{noformat}

--diagnostic page_history page_ids 234324,3455 print_to_log print_to_file

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

{noformat}

Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),

[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-07 Thread Anton Kalashnikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Kalashnikov updated IGNITE-11749:
---
Description: 
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

 

console.sh command:

{noformat}

--diagnostic page_history page_ids 234324,3455 print_to_log print_to_file

 

--diagnostic - command for dumping some diagnostic info

page_history - subcommand for dumping only page_history. Required.

page_ids {list_of_page_ids} - list of page ids for dumping

print_to_log, print_to_file - place for dumping(file or log or both). At least 
one of them is required.

{noformat}

Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, 
end=false, cpMark=FileWALPointer [idx=0, fileOff=29, len=29], super=WALRecord 
[size=1963, chainSize=0, pos=FileWALPointer [idx=0, fileOff=39686, len=1963], 
type=CHECKPOINT_RECORD]]
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-1368047378], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=55961, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=145e599e-66fc-45f5-bde4-b0c392125968, 
end=false, cpMark=null, super=WALRecord [size=21409, chainSize=0, 
pos=FileWALPointer [idx=0, fileOff=13101788, len=21409], 
type=CHECKPOINT_RECORD]]
{noformat}

  was:
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, 

[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-05-07 Thread Anton Kalashnikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Kalashnikov updated IGNITE-11749:
---
Description: 
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.

Example of output:
{noformat}
[2019-05-07 11:57:57,350][INFO 
][test-runner-#58%diagnostic.DiagnosticProcessorTest%][PageHistoryDiagnoster] 
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-2100569601], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=103, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=c6ba7793-113b-4b54-8530-45e1708ca44c, 
end=false, cpMark=FileWALPointer [idx=0, fileOff=29, len=29], super=WALRecord 
[size=1963, chainSize=0, pos=FileWALPointer [idx=0, fileOff=39686, len=1963], 
type=CHECKPOINT_RECORD]]
Next WAL record :: PageSnapshot [fullPageId = FullPageId 
[pageId=0002, effectivePageId=, grpId=-1368047378], 
page = [
Header [
type=11 (PageMetaIO),
ver=1,
crc=0,
pageId=844420635164672(offset=0, flags=10, partId=65535, index=0)
],
PageMeta[
treeRoot=844420635164675,
lastSuccessfulFullSnapshotId=0,
lastSuccessfulSnapshotId=0,
nextSnapshotTag=1,
lastSuccessfulSnapshotTag=0,
lastAllocatedPageCount=0,
candidatePageCount=0
]],
super = [WALRecord [size=4129, chainSize=0, pos=FileWALPointer [idx=0, 
fileOff=55961, len=4129], type=PAGE_RECORD]]]
Next WAL record :: CheckpointRecord [cpId=145e599e-66fc-45f5-bde4-b0c392125968, 
end=false, cpMark=null, super=WALRecord [size=21409, chainSize=0, 
pos=FileWALPointer [idx=0, fileOff=13101788, len=21409], 
type=CHECKPOINT_RECORD]]
{noformat}

  was:
Currently, the only way to debug possible bugs in checkpointer/recovery 
mechanics is to manually parse WAL files after the corruption happened. This is 
not practical for several reasons. First, it requires manual actions which 
depend on the content of the exception. Second, it is not always possible to 
obtain WAL files (it may contain sensitive data).

We need to add a mechanics which will dump all information required for primary 
analysis of the corruption to the exception handler. For example, if an 
exception happened when materializing a link {{0xabcd}} written on an index 
page {{0xdcba}}, we need to dump history of both pages changes, checkpoint 
records on the analysis interval. Possibly, we should include FreeList pages to 
which the aforementioned pages were included to.


> Implement automatic pages history dump on CorruptedTreeException
> 
>
> Key: IGNITE-11749
> URL: https://issues.apache.org/jira/browse/IGNITE-11749
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Priority: Major
>
> Currently, the only way to debug possible bugs in checkpointer/recovery 
> mechanics is to manually parse WAL files after the corruption happened. This 
> is not practical for several reasons. First, it requires manual actions which 
> depend on the content of the exception. Second, it is not always possible to 
> obtain WAL files (it may contain sensitive data).
> We need to add a mechanics which will dump all information required for 
> primary analysis of the corruption to the exception handler. For example, if 
> an exception happened when materializing a link {{0xabcd}} written on an 
> index page {{0xdcba}}, we need to dump history of both pages changes, 
> checkpoint records on the analysis interval. Possibly, we should include 
> FreeList pages to which the aforementioned pages were included 

[jira] [Updated] (IGNITE-11749) Implement automatic pages history dump on CorruptedTreeException

2019-04-16 Thread Alexey Goncharuk (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Goncharuk updated IGNITE-11749:
--
Ignite Flags:   (was: Docs Required)

> Implement automatic pages history dump on CorruptedTreeException
> 
>
> Key: IGNITE-11749
> URL: https://issues.apache.org/jira/browse/IGNITE-11749
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Alexey Goncharuk
>Priority: Major
>
> Currently, the only way to debug possible bugs in checkpointer/recovery 
> mechanics is to manually parse WAL files after the corruption happened. This 
> is not practical for several reasons. First, it requires manual actions which 
> depend on the content of the exception. Second, it is not always possible to 
> obtain WAL files (it may contain sensitive data).
> We need to add a mechanics which will dump all information required for 
> primary analysis of the corruption to the exception handler. For example, if 
> an exception happened when materializing a link {{0xabcd}} written on an 
> index page {{0xdcba}}, we need to dump history of both pages changes, 
> checkpoint records on the analysis interval. Possibly, we should include 
> FreeList pages to which the aforementioned pages were included to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)