Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Mervini, Joseph A Wed, 23 Aug 2017 13:12:50 -0700

I added the following block to my config file to deal with the the watchdog 
problem:


FS_Scan {
   Ignore {
     path == "**/dev/watchdog"
     or path == "**/dev/watchdog0"
   }
}
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]



> On Aug 23, 2017, at 11:50 AM, Mervini, Joseph A <[email protected]> wrote:
> 
> I am able to verify that the cause of the failures was in fact due to 
> /dev/watchdog and /dev/watchdog0 existing on the lustre file system. The 
> problem was easily duplicated by confining the scan to the dev directory and 
> in each case before the individual files were removed the system would 
> reboot. After removing the files the scan ran to completion. 
> 
> Thanks again to Justin Miller for the information and explanation. 
> ====
> 
> Joe Mervini
> Sandia National Laboratories
> High Performance Computing
> 505.844.6770
> [email protected]
> 
> 
> 
>> On Aug 23, 2017, at 10:28 AM, Mervini, Joseph A <[email protected]> wrote:
>> 
>> Justin
>> 
>> Yes - as a matter of fact I have copied the entire OS to the lustre file 
>> system to create a ton of real files. I am testing that hypothesis now.
>> 
>> Thanks so much for the suggestion! I will post my results.
>> ====
>> 
>> Joe Mervini
>> Sandia National Laboratories
>> High Performance Computing
>> 505.844.6770
>> [email protected]
>> 
>> 
>> 
>>> On Aug 23, 2017, at 10:03 AM, Justin Miller <[email protected]> wrote:
>>> 
>>> Does the data you’re scanning include backups of OS root directories? 
>>> 
>>> We have seen the case where a scan causes a system to reboot with almost no 
>>> explanation because the data being scanned included a copy of /dev from a 
>>> OS root filesystem backup, and inside that copy of /dev was a character 
>>> device file for a watchdog device handler. The Lustre client running the 
>>> scan also had the modules loaded to handle the watchdog.
>>> 
>>> The watchdog timer starts during the scan when path2fid does an open. The 
>>> watchdog timer isn’t terminated until a special flag is specified, and 
>>> Lustre doesn’t know about the flag, so the watchdog times out and reboots 
>>> the system.
>>> 
>>> - Justin Miller
>>> 
>>>> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <[email protected]> wrote:
>>>> 
>>>> I just re-ran the scan using strace -f and this is the tail of the output 
>>>> up until the machine rebooted:
>>>> 
>>>> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
>>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>>> unavailable)
>>>> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
>>>> [pid 13257] <... write resumed> )       = 58
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13271] newfstatat(10, "ompi-f77.pc",  <unfinished ...>
>>>> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>>> unavailable)
>>>> [pid 13270] <... openat resumed> )      = 12
>>>> [pid 13257] <... futex resumed> )       = 1
>>>> [pid 13255] <... futex resumed> )       = 0
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 
>>>> <unfinished ...>
>>>> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13270] <... ioctl resumed> )       = 0
>>>> [pid 13257] <... poll resumed> )        = 0 (Timeout)
>>>> [pid 13255] <... futex resumed> )       = 1
>>>> [pid 13254] <... futex resumed> )       = 1
>>>> [pid 13253] <... futex resumed> )       = 0
>>>> [pid 13268] <... futex resumed> )       = 0
>>>> [pid 13270] close(12 <unfinished ...>
>>>> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
>>>> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 
>>>> <unfinished ...>
>>>> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES 
>>>> WHERE id='0x200000405:0xd5b0:0x0'
>>>> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13257] <... write resumed> )       = 11
>>>> [pid 13255] <... write resumed> )       = 108
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, 
>>>> ...}, AT_SYMLINK_NOFOLLOW) = 0
>>>> [pid 13257] read(15,  <unfinished ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>>> unavailable)
>>>> [pid 13271] openat(10, "ompi-f77.pc", 
>>>> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...>
>>>> [pid 13266] <... futex resumed> )       = 0
>>>> [pid 13255] <... futex resumed> )       = 1
>>>> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 
>>>> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE 
>>>> id='0x200000405:0xcc68:0x0'
>>>> <unfinished ...>
>>>> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>>> ...>
>>>> [pid 13254] <... write resumed> )       = 108
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> [pid 13270] <... close resumed> )       = 0
>>>> [pid 13259] <... futex resumed> )       = 0
>>>> Write failed: Broken pipe
>>>> 
>>>> NOTE: I captured the entire output from strace in a screen log.
>>>> 
>>>> ====
>>>> 
>>>> Joe Mervini
>>>> Sandia National Laboratories
>>>> High Performance Computing
>>>> 505.844.6770
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <[email protected]> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I have not received a response to this posting but have continued to try 
>>>>> and figure out why this problem persists. 
>>>>> 
>>>>> Since I initially opened the request I have been able to duplicate it on 
>>>>> three different machines. I have also tried multiple kernel versions and 
>>>>> lustre 2.8 client versions. I have also completely rebuilt my lustre file 
>>>>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents 
>>>>> itself when running a scan against the 2.8 lustre file system. A 2.5 
>>>>> lustre file system works fine.
>>>>> 
>>>>> I decided to run the simultaneous scan against both file systems using 
>>>>> valgrind and although (again) the 2.5 version of the file system 
>>>>> completed the system rebooted prior to the 2.8 version scan completed. 
>>>>> However in both scan valgrind’s output was similar with output like this:
>>>>> 
>>>>> ==8883== Thread 7:
>>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>>> ==8883==    at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>>> ==8883==
>>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>>> ==8883==    at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>>> ==8883==
>>>>> ==8883==
>>>>> ==8883== More than 10000000 total errors detected.  I'm not reporting any 
>>>>> more.
>>>>> ==8883== Final error counts will be inaccurate.  Go fix your program!
>>>>> ==8883== Rerun with --error-limit=no to disable this cutoff.  Note
>>>>> ==8883== that errors may occur in your program without prior warning from
>>>>> ==8883== Valgrind, because errors are no longer being displayed.
>>>>> ==8883== 
>>>>> ====
>>>>> 
>>>>> 
>>>>> Is this considered normal?
>>>>> 
>>>>> 
>>>>> Joe Mervini
>>>>> Sandia National Laboratories
>>>>> High Performance Computing
>>>>> 505.844.6770
>>>>> [email protected]
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I have a problem similar to 
>>>>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which 
>>>>>> the robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 
>>>>>> 2.8.0.8 client will reboot when the initial scan is run. I am running 
>>>>>> this in a testbed environment prior to deployment on our production 
>>>>>> system because I want to get a complete handle on it before I commit to 
>>>>>> the deployment. I have 2 separate lustre file systems that I am running 
>>>>>> against: One is a 408TB lustre 2.8 file system with ~16M inodes, the 
>>>>>> other is a 204TB lustre 2.5.5 file system with ~3M inodes. 
>>>>>> 
>>>>>> The curious thing is that I had successfully scanned both file systems 
>>>>>> independently on the system with everything working (including web-gui) 
>>>>>> and then basically blew away the databases to get a datapoint on how the 
>>>>>> system performed and the time it took if I ran a scan on both file 
>>>>>> systems simultaneously. It appears that it is only impacting the 2.8 
>>>>>> file system database. I just ran a fresh scan against the 2.5.5 file 
>>>>>> system without problem. I then stated a new scan against the 2.8 file 
>>>>>> system an once again it rebooted.
>>>>>> 
>>>>>> Like the other support ticket above, when I ran the scan only on the 2.8 
>>>>>> file system in debug mode it also reported messages similar to 
>>>>>> “2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on 
>>>>>> <parent_fd=18>/libippch.so: Too many levels of symbolic links”. I check 
>>>>>> a large number of the files that were being reported and for the most 
>>>>>> part they were library files with only a couple of symlinks to the .so 
>>>>>> file in the same directory.
>>>>>> 
>>>>>> The only other thing of note that I was able to capture is this from the 
>>>>>> console output:
>>>>>> 
>>>>>> [ 3301.937577] LustreError: 
>>>>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel 
>>>>>> (10004) vs application (0) 
>>>>>> [ 3301.950059] LustreError: 
>>>>>> 15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error"
>>>>>> 
>>>>>> There was no indication of a fault in any of the log files and I was 
>>>>>> running top and htop during the process and neither CPU or memory was 
>>>>>> exhausted. Nor did I see anything suspicious happening on the file 
>>>>>> system itself. 
>>>>>> 
>>>>>> Any help or clues as to why this is failing would be greatly 
>>>>>> appreciated. Thanks in advance.
>>>>>> ====
>>>>>> 
>>>>>> Joe Mervini
>>>>>> Sandia National Laboratories
>>>>>> High Performance Computing
>>>>>> 505.844.6770
>>>>>> [email protected]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, Slashdot.org! 
>>>>>> http://sdm.link/slashdot_______________________________________________
>>>>>> robinhood-support mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! 
>>>> http://sdm.link/slashdot_______________________________________________
>>>> robinhood-support mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> robinhood-support mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> robinhood-support mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> robinhood-support mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/robinhood-support

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Reply via email to