Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Marek Królikowski Fri, 23 Dec 2011 08:58:27 -0800

Hello
After 24hours test both servers still working with np.
I run one more script on both servers:
TEST-MAIL1 ~ # cat terror3.sh
#!/bin/bash
while true
do
du -sh /mnt/EMC/TEST-MAIL2
find /mnt/EMC/TEST-MAIL2
sleep 30
done;


TEST-MAIL2 ~ # cat terror3.sh
#!/bin/bash
while true
do
du -sh /mnt/EMC/TEST-MAIL1
find . /mnt/EMC/TEST-MAIL1
sleep 30
done;

This script do find and du -sh on file who upload another machine to ocfs2.

Cheers

-----Oryginalna wiadomość----- 
From: srinivas eeda
Sent: Thursday, December 22, 2011 9:12 PM
To: Marek Królikowski
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

We need to know what happened to node 2. Was the node rebooted because
of a network timeout or kernel panic? can you please configure
netconsole, serial console and rerun the test?

On 12/22/2011 8:08 AM, Marek Królikowski wrote:
> Hello
> After 24 hours i see TEST-MAIL2 reboot ( possible kernel panic) but 
> TEST-MAIL1 got in dmesg:
> TEST-MAIL1 ~ #dmesg
> [cut]
> o2net: accepted connection from node TEST-MAIL2 (num 1) at 
> 172.17.1.252:7777
> o2dlm: Node 1 joins domain B24C4493BBC74FEAA3371E2534BB3611
> o2dlm: Nodes in domain B24C4493BBC74FEAA3371E2534BB3611: 0 1
> o2net: connection to node TEST-MAIL2 (num 1) at 172.17.1.252:7777 has been 
> idle for 60.0 seconds, shutting it down.
> (swapper,0,0):o2net_idle_timer:1562 Here are some times that might help 
> debug the situation: (Timer: 33127732045, Now 33187808090, DataReady 
> 33127732039, Advance 33127732051-33127732051, Key 0xebb9cd47, Func 506, 
> FuncTime 33127732045-33127732048)
> o2net: no longer connected to node TEST-MAIL2 (num 1) at 172.17.1.252:7777
> (du,5099,12):dlm_do_master_request:1324 ERROR: link to 1 went down!
> (du,5099,12):dlm_get_lock_resource:907 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res M000000000000000000000cf023ef70, 
> error -112 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -112
> (dlm_thread,14321,1):dlm_send_proxy_ast_msg:484 ERROR: 
> B24C4493BBC74FEAA3371E2534BB3611: res P000000000000000000000000000000, 
> error -107 send AST to node 1
> (dlm_thread,14321,1):dlm_flush_asts:605 ERROR: status = -107
> (kworker/u:3,5071,0):o2net_connect_expired:1724 ERROR: no connection 
> established with node 1 after 60.0 seconds, giving up and returning 
> errors.
> (o2hb-B24C4493BB,14310,0):o2dlm_eviction_cb:267 o2dlm has evicted node 1 
> from group B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (ocfs2rec,5504,6):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:M0000000000000000000015f023ef70: at least 
> one node (1) to recover before lock mastery can begin
> (du,5099,12):dlm_restart_lock_mastery:1213 ERROR: node down! 1
> (du,5099,12):dlm_wait_for_lock_mastery:1030 ERROR: status = -11
> (du,5099,12):dlm_get_lock_resource:888 
> B24C4493BBC74FEAA3371E2534BB3611:N000000000020924f: at least one node (1) 
> to recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:834 
> B24C4493BBC74FEAA3371E2534BB3611:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> (dlm_reco_thread,14322,0):dlm_get_lock_resource:868 
> B24C4493BBC74FEAA3371E2534BB3611: recovery map is not empty, but must 
> master $RECOVERY lock now
> (dlm_reco_thread,14322,0):dlm_do_recovery:523 (14322) Node 0 is the 
> Recovery Master for the Dead Node 1 for Domain 
> B24C4493BBC74FEAA3371E2534BB3611
> (ocfs2rec,5504,6):ocfs2_replay_journal:1549 Recovering node 1 from slot 1 
> on device (253,0)
> (ocfs2rec,5504,6):ocfs2_begin_quota_recovery:407 Beginning quota recovery 
> in slot 1
> (kworker/u:0,2909,0):ocfs2_finish_quota_recovery:599 Finishing quota 
> recovery in slot 1
>
> And i try give this command:
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory
> debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP off
> debugfs.ocfs2: Unable to write log mask "ENTRY": No such file or directory
>
> But not working....
>
>
> -----Oryginalna wiadomość----- From: Srinivas Eeda
> Sent: Wednesday, December 21, 2011 8:43 PM
> To: Marek Królikowski
> Cc: ocfs2-users@oss.oracle.com
> Subject: Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from 
> both
>
> Those numbers look good. Basically with the fixes backed out and another
> fix I gave, you are not seeing that many orphans hanging around and
> hence not seeing the process stuck kernel stacks. You can run the test
> longer or if you are satisfied, please enable quotas and re-run the test
> with the modified kernel. You might see a dead lock which needs to be
> fixed(I was not able to reproduce this yet). If the system hangs, please
> capture the following and provide me the output
>
> 1. echo t > /proc/sysrq-trigger
> 2. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> allow
> 3. wait for 10 minutes
> 4. debugfs.ocfs2 -l ENTRY EXIT DLM_GLUE QUOTA INODE DISK_ALLOC EXTENT_MAP 
> off
> 5. echo t > /proc/sysrq-trigger
> 


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 - Kernel panic on many write/read from both

Reply via email to