On Fri, Jul 2, 2010 at 2:59 AM, Kern Sibbald <k...@sibbald.com> wrote:

> > The question that I have is am I missing some debug symbols in other
> > packages like open-ssl that would help? I'm not a programmer so
> backtraces
> > are pretty much a wall of text to me. I want to give helpful info so that
> > others may not run into the same problem into the future.
> >
> > If this is not helpful, I'm not sure what else to do, so I'll give up and
> > just create a cron job that will restart bacula-sd if it crashes or
> modify
> > btraceback to restart bacula-sd.
> >
>
>
> The dump does not clearly show what is going on.  I suspect this is because
> you are not following the advice in the manual (e.g. you should not use
> "set
> loggin"...) as it seems to only partially show what is going on.
>
> However, if I am interpreting what you show above and what is in the log
> file
> as being all the same output, it looks like the problems are coming because
> either the operator or by a directive, a cancel command has been sent to
> the
> SD.
>
> In Bacula 5.0.2, cancelling jobs is known to occassionally crash the
> Director
> and the SD.  Perhaps it happens more frequently when TLS is running.  My
> best
> guess is that the libz routines have a signal bug, or perhaps there is a
> problem in the Bacula code -- I am not sure.
>
> I do know that we have a number of fixes for the cancel command in Bacula
> 5.0.3, which will probably be released near the end of the month.  Most if
> not all of the fixes are in the Source Forge bacula repo under Branch-5.0.
>
> In the mean time, you should try to find out why Bacula is attempting to
> cancel the job and make sure that does not happen.  Perhaps it is a max
> runtime or something that is set too short or a rogue operator :-)
>
> I believe that your bug is a duplicate of bug #1568, which is a bug in zlib
> that causes it to crash when a signal is received. You will notice that the
> tracebacks look very similar to yours.
>
> You might want to talk to Frank Sweetzer about how he is resolving the
> problem.  He is also at a University ...
>
>
I think this is helpful for me. Debian does run bacula under bacula.tape,
I'll change it to run under root.root and see if that helps with the
automated backtrace. I do think there is some sort of error in the SSL, and
the problem may be compounded by the cancel bug, here's why:

I was able to test this on a machine that was not able to get a good backup.
When running a TLS job, the connection is established and the FD starts
transferring data to the SD. I watch as the spool size increments and when
it stops, I look on the client and the SEND-Q in netstat for the connection
to the SD starts incrementing. 30 minutes later, I get "Connection times
out", and then the job is canceled (not put in error state). (Disabling TLS
allowed the client to complete the back-up on the first try).

When I get a "Broken pipe", then bacula puts the job in error state, but
connection timed out is always canceled. I think this may be triggering the
crash. I'll pull head and see if it runs into the same problem. I'm afraid
that you might be right about the SSL bug and it is definitely out of your
hands. I'll see what I can do to submit a bug to openSSL about it.

Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to