Re: [robinhood-support] [EXTERNAL] Re: Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Mervini, Joseph A Wed, 30 Aug 2017 01:19:19 -0700

That is the weird thing about this problem. There is NO crash dump (even though 
it is enabled) and there is nothing in either the console or any system logs 
that point to why it’s rebooting. The machine just goes “Poof” and reboots as 
if power was switch off and then back on again. I would consider it a hardware 
problem but I am having identical problems on 3 different platforms, 2 being 
Dell R720s and one Penguin 1905E.


I have applied all the recommended tuning parameters for mariadb and lustre. 
The only thing that I haven’t messed with is the robinhood config file 
(modified from the basic.conf)  but I would suspect that if there were an issue 
there it would be exposed immediately. In any event this is what I am using:

[root@littlejohn log]# cat /etc/robinhood.d/scratch1.conf
General {
    fs_path = "/scratch1";
    # filesystem type, as displayed by 'mount' (e.g. ext4, xfs, lustre, ...)
    fs_type = lustre;
}

Log {
    log_file = "/var/log/robinhood.log";
    report_file = "/var/log/robinhood_actions.log";
    alert_file = "/var/log/robinhood_alerts.log";
    debug_level = full;
    stats_interval = 5min;
}

ListManager {
    MySQL {
        server = localhost;
        db = scratch1;
        user = robinhood;
        password_file = /etc/robinhood.d/.dbpassword;
    }
}

# Lustre 2.x only
ChangeLog {
    MDT {
        mdt_name = "MDT0000";
        reader_id = "cl1";
    }
}

Is there perhaps something I am missing?

Also I am building robinhood from the robinhood-master.zip archive (md5sum 
fbf96fddad156b69c3db5bbdf5e3840d). After unpacking the archive I change into 
the robinhood-master directory, initalize the git repo and commit everything, 
run autogen.sh, configure and make rpms and everything runs clean with the 
exception of the following warnings:

  CC       rbh_cfg.lo
conf_lex.c:1767:17: warning: 'yyunput' defined but not used [-Wunused-function]
     static void yyunput (int c, register char * yy_bp )
                 ^
conf_lex.c:1808:16: warning: 'input' defined but not used [-Wunused-function]
     static int input  (void)
                ^

make[3]: Entering directory 
`/root/robinhood-master/rpms/BUILD/robinhood-3.0/src/tools'
  CC       lhsmtool_cmd-lhsmtool_cmd.o
lhsmtool_cmd.c: In function 'ct_setup':
lhsmtool_cmd.c:891:2: warning: 'g_thread_init' is deprecated (declared at 
/usr/include/glib-2.0/glib/deprecated/gthread.h:265) [-Wdeprecated-declarations]
  g_thread_init(NULL);
  ^
lhsmtool_cmd.c:906:3: warning: 'g_thread_create' is deprecated (declared at 
/usr/include/glib-2.0/glib/deprecated/gthread.h:104): Use 'g_thread_new' 
instead [-Wdeprecated-declarations]
   g_thread_create(subproc_mgr_main, NULL, false, NULL);
   ^
  CCLD     lhsmtool_cmd


I am baffled. Any help would be greatly appreciated.

====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]<mailto:[email protected]>



On Aug 23, 2017, at 9:02 AM, LEIBOVICI Thomas 
<[email protected]<mailto:[email protected]>> wrote:

If you get kernel crashes, a system dump, the system console or the system logs 
would give more information about the root cause, rather than a userland 
process.

On 08/23/17 16:55, Mervini, Joseph A wrote:
I just re-ran the scan using strace -f and this is the tail of the output up 
until the machine rebooted:

2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
[pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
[pid 13257] <... write resumed> )       = 58
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13271] newfstatat(10, "ompi-f77.pc",  <unfinished ...>
[pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 13270] <... openat resumed> )      = 12
[pid 13257] <... futex resumed> )       = 1
[pid 13255] <... futex resumed> )       = 0
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 
<unfinished ...>
[pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13270] <... ioctl resumed> )       = 0
[pid 13257] <... poll resumed> )        = 0 (Timeout)
[pid 13255] <... futex resumed> )       = 1
[pid 13254] <... futex resumed> )       = 1
[pid 13253] <... futex resumed> )       = 0
[pid 13268] <... futex resumed> )       = 0
[pid 13270] close(12 <unfinished ...>
[pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
[pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 <unfinished ...>
2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE 
id='0x200000405:0xd5b0:0x0'
[pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13257] <... write resumed> )       = 11
[pid 13255] <... write resumed> )       = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, ...}, 
AT_SYMLINK_NOFOLLOW) = 0
[pid 13257] read(15,  <unfinished ...>
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 13271] openat(10, "ompi-f77.pc", O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME 
<unfinished ...>
[pid 13266] <... futex resumed> )       = 0
[pid 13255] <... futex resumed> )       = 1
[pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 
08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE 
id='0x200000405:0xcc68:0x0'
 <unfinished ...>
[pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 13254] <... write resumed> )       = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 13270] <... close resumed> )       = 0
[pid 13259] <... futex resumed> )       = 0
Write failed: Broken pipe

NOTE: I captured the entire output from strace in a screen log.

====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]<mailto:[email protected]>



On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

I have not received a response to this posting but have continued to try and 
figure out why this problem persists.

Since I initially opened the request I have been able to duplicate it on three 
different machines. I have also tried multiple kernel versions and lustre 2.8 
client versions. I have also completely rebuilt my lustre file system will a 
newer version of 2.8 (2.8.0.9.) The problem only presents itself when running a 
scan against the 2.8 lustre file system. A 2.5 lustre file system works fine.

I decided to run the simultaneous scan against both file systems using valgrind 
and although (again) the 2.5 version of the file system completed the system 
rebooted prior to the 2.8 version scan completed. However in both scan 
valgrind’s output was similar with output like this:

==8883== Thread 7:
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==    at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==    at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883==
==8883== More than 10000000 total errors detected.  I'm not reporting any more.
==8883== Final error counts will be inaccurate.  Go fix your program!
==8883== Rerun with --error-limit=no to disable this cutoff.  Note
==8883== that errors may occur in your program without prior warning from
==8883== Valgrind, because errors are no longer being displayed.
==8883==
====


Is this considered normal?


Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]<mailto:[email protected]>



On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

I have a problem similar to 
https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which the 
robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 2.8.0.8 client 
will reboot when the initial scan is run. I am running this in a testbed 
environment prior to deployment on our production system because I want to get 
a complete handle on it before I commit to the deployment. I have 2 separate 
lustre file systems that I am running against: One is a 408TB lustre 2.8 file 
system with ~16M inodes, the other is a 204TB lustre 2.5.5 file system with ~3M 
inodes.

The curious thing is that I had successfully scanned both file systems 
independently on the system with everything working (including web-gui) and 
then basically blew away the databases to get a datapoint on how the system 
performed and the time it took if I ran a scan on both file systems 
simultaneously. It appears that it is only impacting the 2.8 file system 
database. I just ran a fresh scan against the 2.5.5 file system without 
problem. I then stated a new scan against the 2.8 file system an once again it 
rebooted.

Like the other support ticket above, when I ran the scan only on the 2.8 file 
system in debug mode it also reported messages similar to “2017/07/10 15:44:58 
[15191/6] FS_Scan | openat failed on <parent_fd=18>/libippch.so: Too many 
levels of symbolic links”. I check a large number of the files that were being 
reported and for the most part they were library files with only a couple of 
symlinks to the .so file in the same directory.

The only other thing of note that I was able to capture is this from the 
console output:

[ 3301.937577] LustreError: 15209:0:(linux-module.c:92:obd_ioctl_getdata()) 
Version mismatch kernel (10004) vs application (0)
[ 3301.950059] LustreError: 15209:0:(class_obd.c:230:class_handle_ioctl()) OBD 
ioctl: data error"

There was no indication of a fault in any of the log files and I was running 
top and htop during the process and neither CPU or memory was exhausted. Nor 
did I see anything suspicious happening on the file system itself.

Any help or clues as to why this is failing would be greatly appreciated. 
Thanks in advance.
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]<mailto:[email protected]>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://slashdot.org/>! 
http://sdm.link/slashdot_______________________________________________
robinhood-support mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/robinhood-support





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org<http://Slashdot.org>! http://sdm.link/slashdot



_______________________________________________
robinhood-support mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/robinhood-support

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] [EXTERNAL] Re: Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Reply via email to