Dustin:

We're using "amgtar" to dump, so I believe that amgtar is
also used for the validation. At some point amgtar calls
gtar.

I've done the 'truss' on amcheckdump and will send a much
reduced version as a .zip file attachment in a follow-up
email. The truss output covers the period in time between seeing
the following in the window where amcheckdump was started:

Validating image thor1:/root datestamp 20100803111814 level 2 part 1/1 on tape 
HBK1_17 file #2

and:

Image was not successfully validated: Error writing data to validation command: 
Error writing fd 9: Broken pipe.

It looks to me like:

  - process 10344 (amcheckdump) creates a pipe
    with file descriptors 8 and 9 at each end.
  - the process forks off a child (10350)
  - the parent closes fd8
  - the child closes fd9 and eventually dups
    fd8 onto fd0 (std in)

So, we end up with a parent and child at both ends of a pipe
with fd9 on the parent side and fd0 on the child side.

Both the parent and child write/read sucessfully for a while.

Eventually, (line 4461) the child seems to exit normally with
exit status of 0. It looks to me like this is an orderly
exit with normal cleanup, closing of file descriptors, etc.

Unfortunately, as soon as the child exists the next parent
write gets a SIGPIPE signal (lines 4462 and 5488).

This doesn't happen with every DLE checked - about 20% seem
to check out okay.

Is it possible that this is a race condition with it being
random chance whether the parent finishes reading before the
child exits? Could there be a missing "flush" command somewhere?

Any insight you can provide would be appreciated!

Do you know if anyone else has sucessfully used amcheckdump
with the mod 3306 changes on Solaris (or on any other platforms?).

Thanks,

Sean

>
>On Mon, Aug 2, 2010 at 12:56 AM, Sean Walmsley
><s...@fpp.nuclearsafetysolutions.com> wrote:
>> Since this error is coming from just a few lines after the line
>> that caused the hanging problem, and since amcheckdump should only
>> be writing to the /dev/null device that wasn't being opened properly
>> before, we thought that the problems might be related.
>
>I assume that the validation command is gtar?  Is it giving any kind
>of error message?  Can you use 'truss' to see if you can figure out
>why it's exiting while there's still data flowing?
>
>Dustin
>


=================================================================
Sean Walmsley        sean at fpp . nuclearsafetysolutions dot com
Nuclear Safety Solutions Ltd.  416-592-4608 (V)  416-592-5528 (F)
700 University Ave M/S H04 J19, Toronto, Ontario, M5G 1X6, CANADA

Reply via email to