On Wed, Jun 30, 2010 at 8:35 AM, Robert LeBlanc <rob...@leblancnet.us>wrote:

> On Wed, Jun 30, 2010 at 1:06 AM, Kern Sibbald <k...@sibbald.com> wrote:
>
>>
>> This seems to a support issue.  The dump that you posted shows no
>> indication
>> of a crash, which means that your understanding of a crash an mine are
>> different.
>>
>> This is possibly a deadlock, but I won't spend any more time on it until
>> the
>> problem is a bit clearer.
>>
>> Best regards,
>>
>> Kern
>>
>> By the way, if this is a production system, you should be running on
>> Lenny,
>> which is known to be stable, and we support it.
>>
>
> I'm not really sure what you need as a good backtrace, since I'm not a
> programmer. I always thought that segfault lead to a program crashing. I
> just don't know enough about gdb to know when there is enough information.
> All I know is that when it crashes when running as a daemon, I get a
> traceback that is useless in my e-mail (says no ptrace). When I run it under
> gdb and get the segfault, when I type 'cont' it says that bacula-sd has
> exited, and when I run it again, it doesn't complain that a process is
> already running. In both cases, there is no process called bacula-sd running
> on the system.
>
> I updated/upgraded about 10 clients yesterday to using TLS, and I did not
> get a crash from the SD. I will keep running it under the debugger in case
> it crashes again, although, I'm not sure how useful it will be if I can not
> operate gdb correctly to get you anything helpful. I have a feeling it's
> some perfect storm of configuration that may be causing the issue. I've been
> running Bacula for 6 years and never have had a problem like this. I'm just
> trying to help the project be as robust as possible because we like it and
> it has treated us so well in the past.
>
> As a side note, I get a lot more connection timeouts and broken pipes when
> using TLS, adding heartbeat interval helps, but it is not a silver bullet.
> Most of the back-ups are succeeding with only a few here and there having
> problems. Not using TLS and not having heartbeat interval, the back-ups
> aways succeed. I'll keep working through things and see if I can come up
> with anything.
>
> Thank you for the time and the great project.
>
>
> Robert LeBlanc
> Life Sciences & Undergraduate Education Computer Support
> Brigham Young University
>
> P.S. We are working on a support contract and will be talking with you in
> about 24 hours with many others from our group who are also interested in
> using Bacula.
>

I know you are probably getting tired of hearing from me, but I had another
crash today. I'm attaching the backtrace that I got this time. I typed
'cont' after the backtrace and all it said was that all the threads exited
(this is in the log this time). Here is what was before the back trace:

[Thread 0x7fffebfff710 (LWP 25670) exited]
[New Thread 0x7fffebfff710 (LWP 25671)]
[Thread 0x7fffebfff710 (LWP 25671) exited]
[Thread 0x7ffff0e88710 (LWP 24428) exited]
[Thread 0x7ffff1e8a710 (LWP 25530) exited]
[Thread 0x7ffff2e8c710 (LWP 25663) exited]
[New Thread 0x7ffff2e8c710 (LWP 25785)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2e8c710 (LWP 25785)]
0x00007ffff77c5b1c in ?? () from /usr/lib/libz.so.1
(gdb) set loggin file /home/rleblanc/bacula-sd-seg.log
(gdb) set logging on
Copying output to /home/rleblanc/bacula-sd-seg.log.
(gdb) thread apply all bt

Thread 219 (Thread 0x7ffff2e8c710 (LWP 25785)):
#0  0x00007ffff77c5b1c in ?? () from /usr/lib/libz.so.1
#1  0x00007ffff77c6ef7 in ?? () from /usr/lib/libz.so.1
#2  0x00007ffff77c40eb in ?? () from /usr/lib/libz.so.1
#3  0x00007ffff77c2251 in deflate () from /usr/lib/libz.so.1
#4  0x00007ffff5eea6f2 in ?? () from /usr/lib/libcrypto.so.0.9.8

The question that I have is am I missing some debug symbols in other
packages like open-ssl that would help? I'm not a programmer so backtraces
are pretty much a wall of text to me. I want to give helpful info so that
others may not run into the same problem into the future.

If this is not helpful, I'm not sure what else to do, so I'll give up and
just create a cron job that will restart bacula-sd if it crashes or modify
btraceback to restart bacula-sd.

Thanks,

Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University
Thread 219 (Thread 0x7ffff2e8c710 (LWP 25785)):
#0  0x00007ffff77c5b1c in ?? () from /usr/lib/libz.so.1
#1  0x00007ffff77c6ef7 in ?? () from /usr/lib/libz.so.1
#2  0x00007ffff77c40eb in ?? () from /usr/lib/libz.so.1
#3  0x00007ffff77c2251 in deflate () from /usr/lib/libz.so.1
#4  0x00007ffff5eea6f2 in ?? () from /usr/lib/libcrypto.so.0.9.8
#5  0x00007ffff5ee9ab0 in COMP_compress_block () from /usr/lib/libcrypto.so.0.9.8
#6  0x00007ffff61897be in ssl3_do_compress () from /usr/lib/libssl.so.0.9.8
#7  0x00007ffff61898fe in ?? () from /usr/lib/libssl.so.0.9.8
#8  0x00007ffff6189e16 in ssl3_write_bytes () from /usr/lib/libssl.so.0.9.8
#9  0x00007ffff719308e in openssl_bsock_readwrite (bsock=0xf1b7a8, ptr=0x7dab3c "", nbytes=171) at tls.c:626
#10 tls_bsock_writen (bsock=0xf1b7a8, ptr=0x7dab3c "", nbytes=171) at tls.c:691
#11 0x00007ffff7172e03 in write_nbytes (bsock=0x7fffdc0d8530, ptr=0x7dab3c "", nbytes=171) at bnet.c:128
#12 0x00007ffff717605d in BSOCK::send (this=0xf1b7a8) at bsock.c:360
#13 0x00007ffff71765e3 in BSOCK::fsend (this=0xf1b7a8, fmt=0x7ffff71a2ef0 "Jmsg Job=%s type=%d level=%lld %s") at bsock.c:415
#14 0x00007ffff7186ab3 in dispatch_message (jcr=<value optimized out>, type=<value optimized out>, mtime=<value optimized out>, 
    msg=0x7ffff2e8a610 "lsbacsd0-sd JobId 104032: JobId=104032 Job=\"lsxserve1.2010-07-01_20.04.00_13\" marked to be canceled.\n") at message.c:834
#15 0x00007ffff7187404 in Jmsg (jcr=0x96aa18, type=6, mtime=0, fmt=0x443e80 "JobId=%d Job=\"%s\" marked to be canceled.\n") at message.c:1225
#16 0x000000000041f765 in cancel_cmd (cjcr=<value optimized out>) at dircmd.c:335
#17 0x000000000042171f in handle_connection_request (arg=<value optimized out>) at dircmd.c:233
#18 0x00007ffff719a859 in workq_server (arg=<value optimized out>) at workq.c:346
#19 0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#20 0x00007ffff538401d in clone () from /lib/libc.so.6
#21 0x0000000000000000 in ?? ()

Thread 216 (Thread 0x7ffff1689710 (LWP 25669)):
#0  0x00007ffff537d8b3 in select () from /lib/libc.so.6
#1  0x00007ffff71923ff in openssl_bsock_readwrite (bsock=0x747888, ptr=0x7ffff16889fc "", nbytes=4) at tls.c:646
#2  tls_bsock_readn (bsock=0x747888, ptr=0x7ffff16889fc "", nbytes=4) at tls.c:697
#3  0x00007ffff7175be7 in BSOCK::recv (this=0x747888) at bsock.c:451
#4  0x00000000004241a0 in do_fd_commands (jcr=0x9521f8) at fd_cmds.c:149
#5  0x0000000000424bfa in run_job (jcr=0x9521f8) at fd_cmds.c:124
#6  0x000000000042549b in run_cmd (jcr=0x9521f8) at job.c:225
#7  0x000000000042171f in handle_connection_request (arg=<value optimized out>) at dircmd.c:233
#8  0x00007ffff719a859 in workq_server (arg=<value optimized out>) at workq.c:346
#9  0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#10 0x00007ffff538401d in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 215 (Thread 0x7ffff368d710 (LWP 25668)):
#0  0x00007ffff537d8b3 in select () from /lib/libc.so.6
#1  0x00007ffff71923ff in openssl_bsock_readwrite (bsock=0xce54d8, ptr=0x7ffff368c9fc "", nbytes=4) at tls.c:646
#2  tls_bsock_readn (bsock=0xce54d8, ptr=0x7ffff368c9fc "", nbytes=4) at tls.c:697
#3  0x00007ffff7175be7 in BSOCK::recv (this=0xce54d8) at bsock.c:451
#4  0x00000000004241a0 in do_fd_commands (jcr=0x67f158) at fd_cmds.c:149
#5  0x0000000000424bfa in run_job (jcr=0x67f158) at fd_cmds.c:124
#6  0x000000000042549b in run_cmd (jcr=0x67f158) at job.c:225
#7  0x000000000042171f in handle_connection_request (arg=<value optimized out>) at dircmd.c:233
#8  0x00007ffff719a859 in workq_server (arg=<value optimized out>) at workq.c:346
#9  0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#10 0x00007ffff538401d in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 212 (Thread 0x7ffff3e8e710 (LWP 25665)):
#0  0x00007ffff67d04d9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007ffff719a91e in workq_server (arg=<value optimized out>) at workq.c:323
#2  0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#3  0x00007ffff538401d in clone () from /lib/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 208 (Thread 0x7ffff268b710 (LWP 25532)):
#0  0x00007ffff67d30bd in read () from /lib/libpthread.so.0
#1  0x00007ffff5e93fd1 in ?? () from /usr/lib/libcrypto.so.0.9.8
#2  0x00007ffff5e92279 in BIO_read () from /usr/lib/libcrypto.so.0.9.8
#3  0x00007ffff6189ffd in ssl3_read_n () from /usr/lib/libssl.so.0.9.8
#4  0x00007ffff618a443 in ssl3_read_bytes () from /usr/lib/libssl.so.0.9.8
#5  0x00007ffff6186fbc in ssl3_shutdown () from /usr/lib/libssl.so.0.9.8
#6  0x00007ffff71924cc in tls_bsock_shutdown (bsock=0xce6888) at tls.c:582
#7  0x00007ffff717532f in BSOCK::close (this=0xce6888) at bsock.c:889
#8  0x00007ffff717fb0e in free_common_jcr (file=<value optimized out>, line=<value optimized out>, jcr=0xc13ec8) at jcr.c:445
#9  b_free_jcr (file=<value optimized out>, line=<value optimized out>, jcr=0xc13ec8) at jcr.c:571
#10 0x0000000000421638 in handle_connection_request (arg=<value optimized out>) at dircmd.c:252
#11 0x00007ffff719a859 in workq_server (arg=<value optimized out>) at workq.c:346
#12 0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#13 0x00007ffff538401d in clone () from /lib/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 201 (Thread 0x7ffff509c710 (LWP 25010)):
#0  0x00007ffff67d305d in write () from /lib/libpthread.so.0
#1  0x0000000000437610 in write_spool_header (dcr=0xed2298) at spool.c:509
#2  0x0000000000437bc9 in write_block_to_spool_file (dcr=0xed2298) at spool.c:485
#3  0x0000000000410700 in do_append_data (jcr=0x692498) at append.c:199
#4  0x0000000000424a7b in append_data_cmd (jcr=0x692498) at fd_cmds.c:203
#5  0x000000000042423b in do_fd_commands (jcr=0x692498) at fd_cmds.c:162
#6  0x0000000000424bfa in run_job (jcr=0x692498) at fd_cmds.c:124
#7  0x000000000042549b in run_cmd (jcr=0x692498) at job.c:225
#8  0x000000000042171f in handle_connection_request (arg=<value optimized out>) at dircmd.c:233
#9  0x00007ffff719a859 in workq_server (arg=<value optimized out>) at workq.c:346
#10 0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#11 0x00007ffff538401d in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7ffff489b710 (LWP 22093)):
#0  0x00007ffff67d04d9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007ffff719a3c8 in watchdog_thread (arg=<value optimized out>) at watchdog.c:308
#2  0x00007ffff67cb8ba in start_thread () from /lib/libpthread.so.0
#3  0x00007ffff538401d in clone () from /lib/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7fe6720 (LWP 22088)):
#0  0x00007ffff537d8b3 in select () from /lib/libc.so.6
#1  0x00007ffff71739f1 in bnet_thread_server (addrs=<value optimized out>, max_clients=<value optimized out>, client_wq=<value optimized out>, handle_client_request=<value optimized out>)
    at bnet_server.c:161
#2  0x0000000000408b82 in main (argc=<value optimized out>, argv=<value optimized out>) at stored.c:312
Continuing.
[Thread 0x7ffff2e8c710 (LWP 25785) exited]
[Thread 0x7ffff368d710 (LWP 25668) exited]
[Thread 0x7ffff3e8e710 (LWP 25665) exited]
[Thread 0x7ffff268b710 (LWP 25532) exited]
[Thread 0x7ffff509c710 (LWP 25010) exited]
[Thread 0x7ffff489b710 (LWP 22093) exited]
[Thread 0x7ffff1689710 (LWP 25669) exited]

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to