Please read the Kaboom chapter of the manual. It will explain how to manually run the program under the debugger. I believe you left of the -s -f options when running it so the traceback doesn't contain any useful information.
For vagrind, I don't know what options to use, please read their documentation. For improving the output from smartalloc, find all the sm_check() calls in the SD and modify them to have true as the last argument. You can also add more sm_checks at various strategic places where you think the code breaks to get closer to the problem. On Sunday 16 September 2007 02:36, Marc Schiffbauer wrote: > * Kern Sibbald schrieb am 15.09.07 um 22:28 Uhr: > > On Saturday 15 September 2007 21:09, Marc Schiffbauer wrote: > > > What can I do/try/test now? > > I played with "Minimum Block Size" a bit. When set this seems to > make the SD crash even on labeling tapes. However I now removed that > setting again and labeling works fine so far. > > I now erased a tape (mt erase), then labeled it using bconsole. > > > - Run the debugger on it, and make it crash in some different way, maybe > > that will tell us something. > > I have a litte incremental job (about 700MB). It runs just fine, but > at the very end of the job the SD crashed. I ran the SD with gdb > attached. > > This is the output on the console: > > In "Terminated Jobs" the job is "OK" > > Terminated Jobs: > JobId Level Files Bytes Status Finished Name > ====================================================================== > [...] > 535 Incr 114 740.2 M OK 16-Sep-07 01:31 > lisa-ImportantData > > but the summary is not: > > 16-Sep 01:31 lisa-sd: Job write elapsed time = 00:02:36, Transfer rate = > 4.745 M bytes/second 16-Sep 01:31 lisa-sd: Sending spooled attrs to the > Director. Despooling 29,191 bytes ... 16-Sep 01:31 lisa-dir: > lisa-ImportantData.2007-09-16_01.28.17 Error: Bacula lisa-dir 2.2.4 > (14Sep07): 16-Sep-2007 01:31:31 > Build OS: i386-pc-linux-gnu debian 3.1 > JobId: 535 > Job: lisa-ImportantData.2007-09-16_01.28.17 > Backup Level: Incremental, since=2007-09-14 03:06:16 > Client: "lisa-fd" 2.2.4 (14Sep07) > i386-pc-linux-gnu,debian,3.1 FileSet: "lisa ImportantData > FileSet" 2007-03-29 13:23:26 Pool: "Default" (From Job > resource) > Storage: "lisa-sd" (From command line) > Scheduled time: 16-Sep-2007 01:27:55 > Start time: 16-Sep-2007 01:28:50 > End time: 16-Sep-2007 01:31:31 > Elapsed time: 2 mins 41 secs > Priority: 10 > FD Files Written: 114 > SD Files Written: 0 > FD Bytes Written: 740,281,859 (740.2 MB) > SD Bytes Written: 0 (0 B) > Rate: 4598.0 KB/s > Software Compression: None > VSS: no > Encryption: no > Volume name(s): Tape_13 > Volume Session Id: 2 > Volume Session Time: 1189898149 > Last Volume Bytes: 740,920,320 (740.9 MB) > Non-fatal FD errors: 0 > SD Errors: 0 > FD termination status: OK > SD termination status: Error > Termination: *** Backup Error *** > > > Maybe I did something wrong, but I got no backtrace... : > > [EMAIL PROTECTED]:~# gdb /usr/sbin/bacula-sd 7791 > GNU gdb 6.3-debian > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and > you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for > details. > This GDB was configured as "i386-linux"...Using host libthread_db > library "/lib/libthread_db.so.1". > > Attaching to program: /usr/sbin/bacula-sd, process 7791 > Reading symbols from /lib/libacl.so.1...done. > Loaded symbols for /lib/libacl.so.1 > Reading symbols from /usr/lib/libz.so.1...done. > Loaded symbols for /usr/lib/libz.so.1 > Reading symbols from /usr/lib/libpython2.3.so.1.0...done. > Loaded symbols for /usr/lib/libpython2.3.so.1.0 > Reading symbols from /lib/libutil.so.1...done. > Loaded symbols for /lib/libutil.so.1 > Reading symbols from /lib/librt.so.1...done. > Loaded symbols for /lib/librt.so.1 > Reading symbols from /lib/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 16384 (LWP 7791)] > [New Thread 32769 (LWP 7793)] > [New Thread 16386 (LWP 7794)] > > > [New Thread 32771 (LWP 7795)] > Loaded symbols for /lib/libpthread.so.0 > Reading symbols from /lib/libdl.so.2...done. > Loaded symbols for /lib/libdl.so.2 > Reading symbols from /lib/libwrap.so.0...done. > Loaded symbols for /lib/libwrap.so.0 > Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.7...done. > Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.7 > Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...done. > Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7 > Reading symbols from /usr/lib/libstdc++.so.5...done. > Loaded symbols for /usr/lib/libstdc++.so.5 > Reading symbols from /lib/libm.so.6...done. > Loaded symbols for /lib/libm.so.6 > Reading symbols from /lib/libgcc_s.so.1...done. > Loaded symbols for /lib/libgcc_s.so.1 > Reading symbols from /lib/libc.so.6...done. > Loaded symbols for /lib/libc.so.6 > Reading symbols from /lib/libattr.so.1...done. > Loaded symbols for /lib/libattr.so.1 > Reading symbols from /lib/ld-linux.so.2...done. > Loaded symbols for /lib/ld-linux.so.2 > Reading symbols from /lib/libnsl.so.1...done. > Loaded symbols for /lib/libnsl.so.1 > Reading symbols from /lib/libnss_compat.so.2...done. > Loaded symbols for /lib/libnss_compat.so.2 > Reading symbols from /lib/libnss_nis.so.2...done. > Loaded symbols for /lib/libnss_nis.so.2 > Reading symbols from /lib/libnss_files.so.2...done. > Loaded symbols for /lib/libnss_files.so.2 > 0x404a0001 in select () from /lib/libc.so.6 > (gdb) > (gdb) > (gdb) run > The program being debugged has been started already. > Start it from the beginning? (y or n) n > Program not restarted. > (gdb) cont > Continuing. > [New Thread 49156 (LWP 7817)] > [Thread 16386 (LWP 7794) exited] > [Thread 49156 (LWP 7817) exited] > [New Thread 65541 (LWP 7852)] > [Thread 65541 (LWP 7852) exited] > [New Thread 81926 (LWP 7855)] > [New Thread 98311 (LWP 7862)] > [Thread 98311 (LWP 7862) exited] > [New Thread 114696 (LWP 7865)] > [Thread 114696 (LWP 7865) exited] > [New Thread 131081 (LWP 7867)] > [Thread 81926 (LWP 7855) exited] > [Thread 131081 (LWP 7867) exited] > [New Thread 147466 (LWP 7871)] > [Thread 147466 (LWP 7871) exited] > [New Thread 163851 (LWP 7873)] > [Thread 163851 (LWP 7873) exited] > [New Thread 180236 (LWP 7874)] > [Thread 180236 (LWP 7874) exited] > [New Thread 196621 (LWP 7876)] > > [New Thread 213006 (LWP 7878)] > [New Thread 229391 (LWP 7915)] > [New Thread 245776 (LWP 7916)] > [Thread 229391 (LWP 7915) exited] > [Thread 213006 (LWP 7878) exited] > [Thread 245776 (LWP 7916) exited] > [Thread 196621 (LWP 7876) exited] > > [New Thread 262161 (LWP 7934)] > [Thread 262161 (LWP 7934) exited] > [New Thread 278546 (LWP 7942)] > [New Thread 294931 (LWP 7950)] > [Thread 294931 (LWP 7950) exited] > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 278546 (LWP 7942)] > 0x080d2050 in ?? () > (gdb) Quit > (gdb) > Continuing. > Kaboom! bacula-sd, lisa-sd got signal 11 - Segmentation violation. > Attempting traceback. > Kaboom! exepath=/usr/sbin/ > Calling: /usr/sbin/btraceback /usr/sbin/bacula-sd 7791 > Traceback complete, attempting cleanup ... > [Thread 32771 (LWP 7795) exited] > Orphaned buffer: lisa-sd 528 bytes buf=80c2f08 allocated at bnet.c:674 > Orphaned buffer: lisa-sd 272 bytes buf=80cc3f8 allocated at jcr.c:253 > Orphaned buffer: lisa-sd 528 bytes buf=80e1518 allocated at bnet.c:673 > Orphaned buffer: lisa-sd 528 bytes buf=80e1748 allocated at jcr.c:255 > Orphaned buffer: lisa-sd 528 bytes buf=80e1a90 allocated at bnet.c:673 > Orphaned buffer: lisa-sd 528 bytes buf=80e2618 allocated at bnet.c:674 > Orphaned buffer: lisa-sd 146 bytes buf=80e1f70 allocated at job.c:114 > Orphaned buffer: lisa-sd 146 bytes buf=80e2020 allocated at job.c:117 > Orphaned buffer: lisa-sd 146 bytes buf=80e20d0 allocated at job.c:120 > Orphaned buffer: lisa-sd 146 bytes buf=80e2180 allocated at job.c:128 > Orphaned buffer: lisa-sd 128 bytes buf=80d0938 allocated at bnet.c:667 > Orphaned buffer: lisa-sd 7 bytes buf=80d1f30 allocated at bnet.c:675 > Orphaned buffer: lisa-sd 12 bytes buf=80d09d8 allocated at bnet.c:676 > Orphaned buffer: lisa-sd 528 bytes buf=80d19d8 allocated at bnet.c:674 > Orphaned buffer: lisa-sd 128 bytes buf=80d2050 allocated at bnet.c:667 > Orphaned buffer: lisa-sd 7 bytes buf=80d20f0 allocated at bnet.c:675 > Orphaned buffer: lisa-sd 12 bytes buf=80d2118 allocated at bnet.c:676 > Orphaned buffer: lisa-sd 8 bytes buf=80d2148 allocated at workq.c:167 > Orphaned buffer: lisa-sd 16 bytes buf=80e1f40 allocated at jcr.c:247 > Orphaned buffer: lisa-sd 24 bytes buf=80d1698 allocated at > dircmd.c:185 Orphaned buffer: lisa-sd 40 bytes buf=80d0a08 allocated > at job.c:140 Orphaned buffer: lisa-sd 24 bytes buf=80d1f68 allocated > at reserve.c:583 Orphaned buffer: lisa-sd 40 bytes buf=80e25c0 > allocated at alist.c:53 Orphaned buffer: lisa-sd 24 bytes buf=80e2e20 > allocated at reserve.c:606 Orphaned buffer: lisa-sd 8 bytes > buf=80e1ef0 allocated at reserve.c:621 Orphaned buffer: lisa-sd 40 > bytes buf=80cc528 allocated at alist.c:53 Orphaned buffer: lisa-sd 128 > bytes buf=80d2de0 allocated at bnet.c:667 Orphaned buffer: lisa-sd 7 > bytes buf=80d2d50 allocated at bnet.c:675 Orphaned buffer: lisa-sd 12 > bytes buf=80cc570 allocated at bnet.c:676 Orphaned buffer: lisa-sd 65652 > bytes buf=80f2dc8 allocated at bsock.c:583 > > Program exited with code 01. > (gdb) backtrace > No stack. > (gdb) print my_name > $1 = '\0' <repeats 29 times> > (gdb) bt > No stack. > (gdb) thread apply all bt > (gdb) f 0 > No stack. > (gdb) info locals > No registers. > (gdb) bt > No stack. > (gdb) f 1 > No stack. > (gdb) info locals > No registers. > (gdb) f 2 > No stack. > (gdb) info locals > No registers. > (gdb) f 3 > No stack. > (gdb) info locals > No registers. > (gdb) f 4 > No stack. > (gdb) info locals > No registers. > (gdb) f 5 > No stack. > (gdb) info locals > No registers. > (gdb) f 6 > No stack. > (gdb) info locals > No registers. > (gdb) f 7 > No stack. > (gdb) info locals > No registers. > (gdb) detach > (gdb) quit > [EMAIL PROTECTED]:~# > > > - Make the smartalloc routine dump the *full* information it has on the > > buffer that was overrun. > > How can I do this? > > > - Run the SD with valgrind, maybe it will point out what is overrunning > > the buffer. > > Do I need somw special options to make this work? > > [EMAIL PROTECTED]:~# valgrind /usr/sbin/bacula-sd -f -c > /etc/bacula/bacula-sd.conf > ==8314== Memcheck, a memory error detector for x86-linux. > ==8314== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al. > ==8314== Using valgrind-2.4.0, a program supervision framework for > x86-linux. ==8314== Copyright (C) 2000-2005, and GNU GPL'd, by Julian > Seward et al. ==8314== For more details, rerun with: -v > ==8314== > ==8314== Signal 11 (SIGSEGV) appears to have lost its siginfo; I can't go > on. ==8314== This may be because one of your programs has consumed your > ==8314== ration of siginfo structures. > ==8314== Signal 11 (SIGSEGV) appears to have lost its siginfo; I can't go > on. ==8314== This may be because one of your programs has consumed your > ==8314== ration of siginfo structures. > ==8314== > ==8314== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) > ==8314== malloc/free: in use at exit: 0 bytes in 0 blocks. > ==8314== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. > ==8314== For counts of detected errors, rerun with: -v > ==8314== No malloc'd blocks -- no leaks are possible. > Segmentation fault > [EMAIL PROTECTED]:~# > > > - Figure out why it is the only SD crashing (at least that is what I > > deduce from "one of my SD ... crashes ..." (paraphrased). > > The other one is on debian etch amd64 and is only storing to disk. > The crashing SD is on the same machine as the Dir. > > > - Back up to 2.0.3, rebuild it and see if it crashes too in the same way. > > I built a 2.0.3 SD and dir and did the same job again (still 2.2.4 FD): > > *Worked perfectly*, no crash. So I think we can eliminate the case > of any hardware failure. > > 16-Sep 02:29 lisa-sd: Job write elapsed time = 00:02:52, Transfer rate = > 4.397 M bytes/second 16-Sep 02:29 lisa-sd: Sending spooled attrs to the > Director. Despooling 29,772 bytes ... 16-Sep 02:29 lisa-dir: Bacula 2.0.3 > (06Mar07): 16-Sep-2007 02:29:25 JobId: 536 > Job: lisa-ImportantData.2007-09-16_02.24.18 > Backup Level: Incremental, since=2007-09-14 03:06:16 > Client: "lisa-fd" 2.2.4 (14Sep07) > i386-pc-linux-gnu,debian,3.1 FileSet: "lisa ImportantData > FileSet" 2007-03-29 13:23:26 Pool: "Default" (From Job > resource) > Storage: "lisa-sd" (From Job resource) > Scheduled time: 16-Sep-2007 02:24:12 > Start time: 16-Sep-2007 02:25:36 > End time: 16-Sep-2007 02:29:25 > Elapsed time: 3 mins 49 secs > Priority: 500 > FD Files Written: 116 > SD Files Written: 116 > FD Bytes Written: 756,371,227 (756.3 MB) > SD Bytes Written: 756,383,199 (756.3 MB) > Rate: 3302.9 KB/s > Software Compression: None > VSS: no > Encryption: no > Volume name(s): Tape_13 > Volume Session Id: 1 > Volume Session Time: 1189902192 > Last Volume Bytes: 1,497,904,128 (1.497 GB) > Non-fatal FD errors: 0 > SD Errors: 0 > FD termination status: OK > SD termination status: OK > Termination: Backup OK > > 16-Sep 02:29 lisa-dir: Begin pruning Jobs. > 16-Sep 02:29 lisa-dir: No Jobs found to prune. > 16-Sep 02:29 lisa-dir: Begin pruning Files. > 16-Sep 02:29 lisa-dir: No Files found to prune. > 16-Sep 02:29 lisa-dir: End auto prune. > > *status client=lisa-fd > Connecting to Client lisa-fd at lisa:9102 > > lisa-fd Version: 2.2.4 (14 September 2007) i386-pc-linux-gnu debian 3.1 > Daemon started 15-Sep-07 13:23, 6 Jobs run since started. > Heap: heap=694,224 smbytes=215,620 max_bytes=317,828 bufs=80 max_bufs=200 > Sizeof: boffset_t=8 size_t=4 debug=1 trace=0 > > Running Jobs: > Director connected at: 16-Sep-07 02:29 > No Jobs running. > ==== > > Terminated Jobs: > JobId Level Files Bytes Status Finished Name > ====================================================================== > [...] > 535 Incr 114 740.2 M OK 16-Sep-07 01:31 > lisa-ImportantData 536 Incr 116 756.3 M OK 16-Sep-07 > 02:29 lisa-ImportantData ==== > * > > > Any more hints? ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
