I added the following block to my config file to deal with the the watchdog
problem:
FS_Scan {
Ignore {
path == "**/dev/watchdog"
or path == "**/dev/watchdog0"
}
}
====
Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]
> On Aug 23, 2017, at 11:50 AM, Mervini, Joseph A <[email protected]> wrote:
>
> I am able to verify that the cause of the failures was in fact due to
> /dev/watchdog and /dev/watchdog0 existing on the lustre file system. The
> problem was easily duplicated by confining the scan to the dev directory and
> in each case before the individual files were removed the system would
> reboot. After removing the files the scan ran to completion.
>
> Thanks again to Justin Miller for the information and explanation.
> ====
>
> Joe Mervini
> Sandia National Laboratories
> High Performance Computing
> 505.844.6770
> [email protected]
>
>
>
>> On Aug 23, 2017, at 10:28 AM, Mervini, Joseph A <[email protected]> wrote:
>>
>> Justin
>>
>> Yes - as a matter of fact I have copied the entire OS to the lustre file
>> system to create a ton of real files. I am testing that hypothesis now.
>>
>> Thanks so much for the suggestion! I will post my results.
>> ====
>>
>> Joe Mervini
>> Sandia National Laboratories
>> High Performance Computing
>> 505.844.6770
>> [email protected]
>>
>>
>>
>>> On Aug 23, 2017, at 10:03 AM, Justin Miller <[email protected]> wrote:
>>>
>>> Does the data you’re scanning include backups of OS root directories?
>>>
>>> We have seen the case where a scan causes a system to reboot with almost no
>>> explanation because the data being scanned included a copy of /dev from a
>>> OS root filesystem backup, and inside that copy of /dev was a character
>>> device file for a watchdog device handler. The Lustre client running the
>>> scan also had the modules loaded to handle the watchdog.
>>>
>>> The watchdog timer starts during the scan when path2fid does an open. The
>>> watchdog timer isn’t terminated until a special flag is specified, and
>>> Lustre doesn’t know about the flag, so the watchdog times out and reboots
>>> the system.
>>>
>>> - Justin Miller
>>>
>>>> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <[email protected]> wrote:
>>>>
>>>> I just re-ran the scan using strace -f and this is the tail of the output
>>>> up until the machine rebooted:
>>>>
>>>> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
>>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
>>>> unavailable)
>>>> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
>>>> [pid 13257] <... write resumed> ) = 58
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13271] newfstatat(10, "ompi-f77.pc", <unfinished ...>
>>>> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
>>>> unavailable)
>>>> [pid 13270] <... openat resumed> ) = 12
>>>> [pid 13257] <... futex resumed> ) = 1
>>>> [pid 13255] <... futex resumed> ) = 0
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680
>>>> <unfinished ...>
>>>> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13270] <... ioctl resumed> ) = 0
>>>> [pid 13257] <... poll resumed> ) = 0 (Timeout)
>>>> [pid 13255] <... futex resumed> ) = 1
>>>> [pid 13254] <... futex resumed> ) = 1
>>>> [pid 13253] <... futex resumed> ) = 0
>>>> [pid 13268] <... futex resumed> ) = 0
>>>> [pid 13270] close(12 <unfinished ...>
>>>> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
>>>> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108
>>>> <unfinished ...>
>>>> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES
>>>> WHERE id='0x200000405:0xd5b0:0x0'
>>>> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13257] <... write resumed> ) = 11
>>>> [pid 13255] <... write resumed> ) = 108
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665,
>>>> ...}, AT_SYMLINK_NOFOLLOW) = 0
>>>> [pid 13257] read(15, <unfinished ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
>>>> unavailable)
>>>> [pid 13271] openat(10, "ompi-f77.pc",
>>>> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...>
>>>> [pid 13266] <... futex resumed> ) = 0
>>>> [pid 13255] <... futex resumed> ) = 1
>>>> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23
>>>> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE
>>>> id='0x200000405:0xcc68:0x0'
>>>> <unfinished ...>
>>>> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
>>>> ...>
>>>> [pid 13254] <... write resumed> ) = 108
>>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> [pid 13270] <... close resumed> ) = 0
>>>> [pid 13259] <... futex resumed> ) = 0
>>>> Write failed: Broken pipe
>>>>
>>>> NOTE: I captured the entire output from strace in a screen log.
>>>>
>>>> ====
>>>>
>>>> Joe Mervini
>>>> Sandia National Laboratories
>>>> High Performance Computing
>>>> 505.844.6770
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <[email protected]> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I have not received a response to this posting but have continued to try
>>>>> and figure out why this problem persists.
>>>>>
>>>>> Since I initially opened the request I have been able to duplicate it on
>>>>> three different machines. I have also tried multiple kernel versions and
>>>>> lustre 2.8 client versions. I have also completely rebuilt my lustre file
>>>>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents
>>>>> itself when running a scan against the 2.8 lustre file system. A 2.5
>>>>> lustre file system works fine.
>>>>>
>>>>> I decided to run the simultaneous scan against both file systems using
>>>>> valgrind and although (again) the 2.5 version of the file system
>>>>> completed the system rebooted prior to the 2.8 version scan completed.
>>>>> However in both scan valgrind’s output was similar with output like this:
>>>>>
>>>>> ==8883== Thread 7:
>>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>>> ==8883== at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>>> ==8883==
>>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>>> ==8883== at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>>> ==8883==
>>>>> ==8883==
>>>>> ==8883== More than 10000000 total errors detected. I'm not reporting any
>>>>> more.
>>>>> ==8883== Final error counts will be inaccurate. Go fix your program!
>>>>> ==8883== Rerun with --error-limit=no to disable this cutoff. Note
>>>>> ==8883== that errors may occur in your program without prior warning from
>>>>> ==8883== Valgrind, because errors are no longer being displayed.
>>>>> ==8883==
>>>>> ====
>>>>>
>>>>>
>>>>> Is this considered normal?
>>>>>
>>>>>
>>>>> Joe Mervini
>>>>> Sandia National Laboratories
>>>>> High Performance Computing
>>>>> 505.844.6770
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a problem similar to
>>>>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which
>>>>>> the robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre
>>>>>> 2.8.0.8 client will reboot when the initial scan is run. I am running
>>>>>> this in a testbed environment prior to deployment on our production
>>>>>> system because I want to get a complete handle on it before I commit to
>>>>>> the deployment. I have 2 separate lustre file systems that I am running
>>>>>> against: One is a 408TB lustre 2.8 file system with ~16M inodes, the
>>>>>> other is a 204TB lustre 2.5.5 file system with ~3M inodes.
>>>>>>
>>>>>> The curious thing is that I had successfully scanned both file systems
>>>>>> independently on the system with everything working (including web-gui)
>>>>>> and then basically blew away the databases to get a datapoint on how the
>>>>>> system performed and the time it took if I ran a scan on both file
>>>>>> systems simultaneously. It appears that it is only impacting the 2.8
>>>>>> file system database. I just ran a fresh scan against the 2.5.5 file
>>>>>> system without problem. I then stated a new scan against the 2.8 file
>>>>>> system an once again it rebooted.
>>>>>>
>>>>>> Like the other support ticket above, when I ran the scan only on the 2.8
>>>>>> file system in debug mode it also reported messages similar to
>>>>>> “2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on
>>>>>> <parent_fd=18>/libippch.so: Too many levels of symbolic links”. I check
>>>>>> a large number of the files that were being reported and for the most
>>>>>> part they were library files with only a couple of symlinks to the .so
>>>>>> file in the same directory.
>>>>>>
>>>>>> The only other thing of note that I was able to capture is this from the
>>>>>> console output:
>>>>>>
>>>>>> [ 3301.937577] LustreError:
>>>>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel
>>>>>> (10004) vs application (0)
>>>>>> [ 3301.950059] LustreError:
>>>>>> 15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error"
>>>>>>
>>>>>> There was no indication of a fault in any of the log files and I was
>>>>>> running top and htop during the process and neither CPU or memory was
>>>>>> exhausted. Nor did I see anything suspicious happening on the file
>>>>>> system itself.
>>>>>>
>>>>>> Any help or clues as to why this is failing would be greatly
>>>>>> appreciated. Thanks in advance.
>>>>>> ====
>>>>>>
>>>>>> Joe Mervini
>>>>>> Sandia National Laboratories
>>>>>> High Performance Computing
>>>>>> 505.844.6770
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Check out the vibrant tech community on one of the world's most
>>>>>> engaging tech sites, Slashdot.org!
>>>>>> http://sdm.link/slashdot_______________________________________________
>>>>>> robinhood-support mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org!
>>>> http://sdm.link/slashdot_______________________________________________
>>>> robinhood-support mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> robinhood-support mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> robinhood-support mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> robinhood-support mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/robinhood-support
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support