Justin Yes - as a matter of fact I have copied the entire OS to the lustre file system to create a ton of real files. I am testing that hypothesis now.
Thanks so much for the suggestion! I will post my results. ==== Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 [email protected] > On Aug 23, 2017, at 10:03 AM, Justin Miller <[email protected]> wrote: > > Does the data you’re scanning include backups of OS root directories? > > We have seen the case where a scan causes a system to reboot with almost no > explanation because the data being scanned included a copy of /dev from a OS > root filesystem backup, and inside that copy of /dev was a character device > file for a watchdog device handler. The Lustre client running the scan also > had the modules loaded to handle the watchdog. > > The watchdog timer starts during the scan when path2fid does an open. The > watchdog timer isn’t terminated until a special flag is specified, and Lustre > doesn’t know about the flag, so the watchdog times out and reboots the system. > > - Justin Miller > >> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <[email protected]> wrote: >> >> I just re-ran the scan using strace -f and this is the tail of the output up >> until the machine rebooted: >> >> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT >> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >> unavailable) >> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0 >> [pid 13257] <... write resumed> ) = 58 >> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13271] newfstatat(10, "ompi-f77.pc", <unfinished ...> >> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >> unavailable) >> [pid 13270] <... openat resumed> ) = 12 >> [pid 13257] <... futex resumed> ) = 1 >> [pid 13255] <... futex resumed> ) = 0 >> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 >> <unfinished ...> >> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...> >> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >> [pid 13270] <... ioctl resumed> ) = 0 >> [pid 13257] <... poll resumed> ) = 0 (Timeout) >> [pid 13255] <... futex resumed> ) = 1 >> [pid 13254] <... futex resumed> ) = 1 >> [pid 13253] <... futex resumed> ) = 0 >> [pid 13268] <... futex resumed> ) = 0 >> [pid 13270] close(12 <unfinished ...> >> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...> >> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 <unfinished >> ...> >> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES >> WHERE id='0x200000405:0xd5b0:0x0' >> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13257] <... write resumed> ) = 11 >> [pid 13255] <... write resumed> ) = 108 >> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, >> ...}, AT_SYMLINK_NOFOLLOW) = 0 >> [pid 13257] read(15, <unfinished ...> >> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >> unavailable) >> [pid 13271] openat(10, "ompi-f77.pc", >> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...> >> [pid 13266] <... futex resumed> ) = 0 >> [pid 13255] <... futex resumed> ) = 1 >> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 >> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE >> id='0x200000405:0xcc68:0x0' >> <unfinished ...> >> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >> ...> >> [pid 13254] <... write resumed> ) = 108 >> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1 >> [pid 13270] <... close resumed> ) = 0 >> [pid 13259] <... futex resumed> ) = 0 >> Write failed: Broken pipe >> >> NOTE: I captured the entire output from strace in a screen log. >> >> ==== >> >> Joe Mervini >> Sandia National Laboratories >> High Performance Computing >> 505.844.6770 >> [email protected] >> >> >> >>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <[email protected]> wrote: >>> >>> Hello, >>> >>> I have not received a response to this posting but have continued to try >>> and figure out why this problem persists. >>> >>> Since I initially opened the request I have been able to duplicate it on >>> three different machines. I have also tried multiple kernel versions and >>> lustre 2.8 client versions. I have also completely rebuilt my lustre file >>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents >>> itself when running a scan against the 2.8 lustre file system. A 2.5 lustre >>> file system works fine. >>> >>> I decided to run the simultaneous scan against both file systems using >>> valgrind and although (again) the 2.5 version of the file system completed >>> the system rebooted prior to the 2.8 version scan completed. However in >>> both scan valgrind’s output was similar with output like this: >>> >>> ==8883== Thread 7: >>> ==8883== Conditional jump or move depends on uninitialised value(s) >>> ==8883== at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168) >>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246) >>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274) >>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616) >>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145) >>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so) >>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so) >>> ==8883== >>> ==8883== Conditional jump or move depends on uninitialised value(s) >>> ==8883== at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168) >>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246) >>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274) >>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616) >>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145) >>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so) >>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so) >>> ==8883== >>> ==8883== >>> ==8883== More than 10000000 total errors detected. I'm not reporting any >>> more. >>> ==8883== Final error counts will be inaccurate. Go fix your program! >>> ==8883== Rerun with --error-limit=no to disable this cutoff. Note >>> ==8883== that errors may occur in your program without prior warning from >>> ==8883== Valgrind, because errors are no longer being displayed. >>> ==8883== >>> ==== >>> >>> >>> Is this considered normal? >>> >>> >>> Joe Mervini >>> Sandia National Laboratories >>> High Performance Computing >>> 505.844.6770 >>> [email protected] >>> >>> >>> >>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <[email protected]> wrote: >>>> >>>> Hello, >>>> >>>> I have a problem similar to >>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which the >>>> robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 2.8.0.8 >>>> client will reboot when the initial scan is run. I am running this in a >>>> testbed environment prior to deployment on our production system because I >>>> want to get a complete handle on it before I commit to the deployment. I >>>> have 2 separate lustre file systems that I am running against: One is a >>>> 408TB lustre 2.8 file system with ~16M inodes, the other is a 204TB lustre >>>> 2.5.5 file system with ~3M inodes. >>>> >>>> The curious thing is that I had successfully scanned both file systems >>>> independently on the system with everything working (including web-gui) >>>> and then basically blew away the databases to get a datapoint on how the >>>> system performed and the time it took if I ran a scan on both file systems >>>> simultaneously. It appears that it is only impacting the 2.8 file system >>>> database. I just ran a fresh scan against the 2.5.5 file system without >>>> problem. I then stated a new scan against the 2.8 file system an once >>>> again it rebooted. >>>> >>>> Like the other support ticket above, when I ran the scan only on the 2.8 >>>> file system in debug mode it also reported messages similar to “2017/07/10 >>>> 15:44:58 [15191/6] FS_Scan | openat failed on <parent_fd=18>/libippch.so: >>>> Too many levels of symbolic links”. I check a large number of the files >>>> that were being reported and for the most part they were library files >>>> with only a couple of symlinks to the .so file in the same directory. >>>> >>>> The only other thing of note that I was able to capture is this from the >>>> console output: >>>> >>>> [ 3301.937577] LustreError: >>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel >>>> (10004) vs application (0) >>>> [ 3301.950059] LustreError: 15209:0:(class_obd.c:230:class_handle_ioctl()) >>>> OBD ioctl: data error" >>>> >>>> There was no indication of a fault in any of the log files and I was >>>> running top and htop during the process and neither CPU or memory was >>>> exhausted. Nor did I see anything suspicious happening on the file system >>>> itself. >>>> >>>> Any help or clues as to why this is failing would be greatly appreciated. >>>> Thanks in advance. >>>> ==== >>>> >>>> Joe Mervini >>>> Sandia National Laboratories >>>> High Performance Computing >>>> 505.844.6770 >>>> [email protected] >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! >>>> http://sdm.link/slashdot_______________________________________________ >>>> robinhood-support mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support >>> >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! >> http://sdm.link/slashdot_______________________________________________ >> robinhood-support mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/robinhood-support > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > robinhood-support mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/robinhood-support ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ robinhood-support mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/robinhood-support
