Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Mervini, Joseph A Wed, 23 Aug 2017 09:28:54 -0700

Justin

Yes - as a matter of fact I have copied the entire OS to the lustre file system 
to create a ton of real files. I am testing that hypothesis now.


Thanks so much for the suggestion! I will post my results.
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
[email protected]



> On Aug 23, 2017, at 10:03 AM, Justin Miller <[email protected]> wrote:
> 
> Does the data you’re scanning include backups of OS root directories? 
> 
> We have seen the case where a scan causes a system to reboot with almost no 
> explanation because the data being scanned included a copy of /dev from a OS 
> root filesystem backup, and inside that copy of /dev was a character device 
> file for a watchdog device handler. The Lustre client running the scan also 
> had the modules loaded to handle the watchdog.
> 
> The watchdog timer starts during the scan when path2fid does an open. The 
> watchdog timer isn’t terminated until a special flag is specified, and Lustre 
> doesn’t know about the flag, so the watchdog times out and reboots the system.
> 
> - Justin Miller
> 
>> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <[email protected]> wrote:
>> 
>> I just re-ran the scan using strace -f and this is the tail of the output up 
>> until the machine rebooted:
>> 
>> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
>> [pid 13257] <... write resumed> )       = 58
>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13271] newfstatat(10, "ompi-f77.pc",  <unfinished ...>
>> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> [pid 13270] <... openat resumed> )      = 12
>> [pid 13257] <... futex resumed> )       = 1
>> [pid 13255] <... futex resumed> )       = 0
>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 
>> <unfinished ...>
>> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid 13270] <... ioctl resumed> )       = 0
>> [pid 13257] <... poll resumed> )        = 0 (Timeout)
>> [pid 13255] <... futex resumed> )       = 1
>> [pid 13254] <... futex resumed> )       = 1
>> [pid 13253] <... futex resumed> )       = 0
>> [pid 13268] <... futex resumed> )       = 0
>> [pid 13270] close(12 <unfinished ...>
>> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
>> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 <unfinished 
>> ...>
>> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES 
>> WHERE id='0x200000405:0xd5b0:0x0'
>> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13257] <... write resumed> )       = 11
>> [pid 13255] <... write resumed> )       = 108
>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, 
>> ...}, AT_SYMLINK_NOFOLLOW) = 0
>> [pid 13257] read(15,  <unfinished ...>
>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>> unavailable)
>> [pid 13271] openat(10, "ompi-f77.pc", 
>> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...>
>> [pid 13266] <... futex resumed> )       = 0
>> [pid 13255] <... futex resumed> )       = 1
>> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 
>> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE 
>> id='0x200000405:0xcc68:0x0'
>> <unfinished ...>
>> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>> ...>
>> [pid 13254] <... write resumed> )       = 108
>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
>> [pid 13270] <... close resumed> )       = 0
>> [pid 13259] <... futex resumed> )       = 0
>> Write failed: Broken pipe
>> 
>> NOTE: I captured the entire output from strace in a screen log.
>> 
>> ====
>> 
>> Joe Mervini
>> Sandia National Laboratories
>> High Performance Computing
>> 505.844.6770
>> [email protected]
>> 
>> 
>> 
>>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <[email protected]> wrote:
>>> 
>>> Hello,
>>> 
>>> I have not received a response to this posting but have continued to try 
>>> and figure out why this problem persists. 
>>> 
>>> Since I initially opened the request I have been able to duplicate it on 
>>> three different machines. I have also tried multiple kernel versions and 
>>> lustre 2.8 client versions. I have also completely rebuilt my lustre file 
>>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents 
>>> itself when running a scan against the 2.8 lustre file system. A 2.5 lustre 
>>> file system works fine.
>>> 
>>> I decided to run the simultaneous scan against both file systems using 
>>> valgrind and although (again) the 2.5 version of the file system completed 
>>> the system rebooted prior to the 2.8 version scan completed. However in 
>>> both scan valgrind’s output was similar with output like this:
>>> 
>>> ==8883== Thread 7:
>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>> ==8883==    at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>> ==8883==
>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>> ==8883==    at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>> ==8883==
>>> ==8883==
>>> ==8883== More than 10000000 total errors detected.  I'm not reporting any 
>>> more.
>>> ==8883== Final error counts will be inaccurate.  Go fix your program!
>>> ==8883== Rerun with --error-limit=no to disable this cutoff.  Note
>>> ==8883== that errors may occur in your program without prior warning from
>>> ==8883== Valgrind, because errors are no longer being displayed.
>>> ==8883== 
>>> ====
>>> 
>>> 
>>> Is this considered normal?
>>> 
>>> 
>>> Joe Mervini
>>> Sandia National Laboratories
>>> High Performance Computing
>>> 505.844.6770
>>> [email protected]
>>> 
>>> 
>>> 
>>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <[email protected]> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I have a problem similar to 
>>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which the 
>>>> robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 2.8.0.8 
>>>> client will reboot when the initial scan is run. I am running this in a 
>>>> testbed environment prior to deployment on our production system because I 
>>>> want to get a complete handle on it before I commit to the deployment. I 
>>>> have 2 separate lustre file systems that I am running against: One is a 
>>>> 408TB lustre 2.8 file system with ~16M inodes, the other is a 204TB lustre 
>>>> 2.5.5 file system with ~3M inodes. 
>>>> 
>>>> The curious thing is that I had successfully scanned both file systems 
>>>> independently on the system with everything working (including web-gui) 
>>>> and then basically blew away the databases to get a datapoint on how the 
>>>> system performed and the time it took if I ran a scan on both file systems 
>>>> simultaneously. It appears that it is only impacting the 2.8 file system 
>>>> database. I just ran a fresh scan against the 2.5.5 file system without 
>>>> problem. I then stated a new scan against the 2.8 file system an once 
>>>> again it rebooted.
>>>> 
>>>> Like the other support ticket above, when I ran the scan only on the 2.8 
>>>> file system in debug mode it also reported messages similar to “2017/07/10 
>>>> 15:44:58 [15191/6] FS_Scan | openat failed on <parent_fd=18>/libippch.so: 
>>>> Too many levels of symbolic links”. I check a large number of the files 
>>>> that were being reported and for the most part they were library files 
>>>> with only a couple of symlinks to the .so file in the same directory.
>>>> 
>>>> The only other thing of note that I was able to capture is this from the 
>>>> console output:
>>>> 
>>>> [ 3301.937577] LustreError: 
>>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel 
>>>> (10004) vs application (0) 
>>>> [ 3301.950059] LustreError: 15209:0:(class_obd.c:230:class_handle_ioctl()) 
>>>> OBD ioctl: data error"
>>>> 
>>>> There was no indication of a fault in any of the log files and I was 
>>>> running top and htop during the process and neither CPU or memory was 
>>>> exhausted. Nor did I see anything suspicious happening on the file system 
>>>> itself. 
>>>> 
>>>> Any help or clues as to why this is failing would be greatly appreciated. 
>>>> Thanks in advance.
>>>> ====
>>>> 
>>>> Joe Mervini
>>>> Sandia National Laboratories
>>>> High Performance Computing
>>>> 505.844.6770
>>>> [email protected]
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! 
>>>> http://sdm.link/slashdot_______________________________________________
>>>> robinhood-support mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>> 
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! 
>> http://sdm.link/slashdot_______________________________________________
>> robinhood-support mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> robinhood-support mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/robinhood-support

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Reply via email to