Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-23 Thread Brian Dessent
Christopher Faylor wrote:

 After poking at this a little, I think it would be better to issue a
 linux-like error message.
 
 In my sandbox, I now have this:
 
 bash-3.2$ ./libtest
 /cygdrive/s/test/libtest.exe: error while loading shared libraries: liba.dll: 
 cannot open shared object file: No such file or directory
 
 I haven't done the work to report on missing symbols yet but I think
 that's a much less common error condition.

Excellent.  The wording isn't really that important to me.   But I think
what is important is that we don't allow the situation where something
was unable to start and we are totally silent.  That leads to confusion
because people start to try to debug or blame the program being run when
in fact the program never saw the light of day in the first place.  It's
totally baffling when it happens and you're not aware to look for it. 
So even if we can't give the name of the symbol in the case of a missing
import, I think it's still important to say something.

Brian


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-23 Thread Christopher Faylor
On Sun, Mar 23, 2008 at 01:48:15PM -0700, Brian Dessent wrote:
Christopher Faylor wrote:
After poking at this a little, I think it would be better to issue a
linux-like error message.

In my sandbox, I now have this:

bash-3.2$ ./libtest /cygdrive/s/test/libtest.exe: error while loading
shared libraries: liba.dll: cannot open shared object file: No such
file or directory

I haven't done the work to report on missing symbols yet but I think
that's a much less common error condition.

Excellent.  The wording isn't really that important to me.  But I think
what is important is that we don't allow the situation where something
was unable to start and we are totally silent.  That leads to confusion
because people start to try to debug or blame the program being run
when in fact the program never saw the light of day in the first place.
It's totally baffling when it happens and you're not aware to look for
it.  So even if we can't give the name of the symbol in the case of a
missing import, I think it's still important to say something.

Yes.  I really have been meaning to fix this for a long time.  It's my
fault that cygwin has this bug.  I appreciate your point the way to
how this could be solved.

cgf


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-23 Thread Christopher Faylor
On Sun, Mar 23, 2008 at 08:09:06PM -0400, Christopher Faylor wrote:
On Sun, Mar 23, 2008 at 01:48:15PM -0700, Brian Dessent wrote:
Christopher Faylor wrote:
After poking at this a little, I think it would be better to issue a
linux-like error message.

In my sandbox, I now have this:

bash-3.2$ ./libtest /cygdrive/s/test/libtest.exe: error while loading
shared libraries: liba.dll: cannot open shared object file: No such
file or directory

I haven't done the work to report on missing symbols yet but I think
that's a much less common error condition.

Excellent.  The wording isn't really that important to me.  But I think
what is important is that we don't allow the situation where something
was unable to start and we are totally silent.  That leads to confusion
because people start to try to debug or blame the program being run
when in fact the program never saw the light of day in the first place.
It's totally baffling when it happens and you're not aware to look for
it.  So even if we can't give the name of the symbol in the case of a
missing import, I think it's still important to say something.

Yes.  I really have been meaning to fix this for a long time.  It's my
fault that cygwin has this bug.  I appreciate your point the way to
 ing
how this could be solved.

cgf


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-22 Thread Christopher Faylor
On Fri, Mar 14, 2008 at 11:11:43PM -0700, Brian Dessent wrote:
Christopher Faylor wrote:
That was going to be my first observation, actually.  I'm still trying
to digest the patch but it seems like it wouldn't work well with the
fork retry code.

The patch doesn't change any behavior though: in current Cygwin if the
thing we're exec()ing returns a Win32 exit code of C135 (or
whatever) then we retry creating it 5 times.  With the patch, we still
retry starting it 5 times but after the last one fails we recognise the
situation and print a message.

After poking at this a little, I think it would be better to issue a
linux-like error message.

In my sandbox, I now have this:

bash-3.2$ ./libtest
/cygdrive/s/test/libtest.exe: error while loading shared libraries: liba.dll: 
cannot open shared object file: No such file or directory

I haven't done the work to report on missing symbols yet but I think
that's a much less common error condition.

So thanks for the patch and the concept but I think I'd rather go with a
more linux-like solution.

cgf


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-22 Thread Christopher Faylor
On Fri, Mar 14, 2008 at 04:08:46AM -0700, Brian Dessent wrote:
Brian Dessent wrote:

 isn't present, etc.  I was really hoping to figure out a cool way to get
 that info, perhaps by poking around in the TEB or PEB somewhere, but I
 haven't gotten that far.  If anyone has any general ideas where to look
 for NTLDR's internal state, I'm all ears.  I have a hunch it would be
 possible to get if we were running the exec'd process in a debugger loop
 and pumping WaitForDebugEvent() messages, since those can have
 parameters attached to exception codes.  But that's a little too
 extreme.

For anyone curious, it's absolutely possible to do the above.  For the
C139 fault (missing procedure point entry), %ebx at the time of the
fault points right at the AsciiZ name of the missing import in the
.idata section, -8(%ebp) points to the import name in UNICODE, and
-10(%ebp) points to the DLL name in UNICODE.

For the C135 fault (the unable to locate component popup), %esi at
the time of the fault points right to the missing library name in
UNICODE.

For the C005 fault (the LDR hits an access violation trying to fixup
a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was
relocating and 24(%ebp) points to an AsciiZ filename of the module which
that symbol is supposed to be pointing into.

Now I'm sure a lot of those above offsets are just coincidental, as I
haven't done much testing to see if it's reliably set as above.  However
it does mean that it would be relatively easy to use the debug API to
step a process through its initialization and find out exactly why it's
faulting.  I've been working on something along those lines for cygcheck
which will also give dynamic process tracing, i.e. runtime LoadLibrary
stuff.  Combined with enabling the LDR snaps debug output, there is a
tremendous amount of debug capability hidden here.

I took a more low-tech and undoubtedly less foolproof approach and just
stepped through the list of dlls for the failing executable, trying to
load them one at a time, reporting on the first dll to fail.  This is
not necessarily the same scenario as what the Windows dynamic linker
is doing but it should report an *a* missing dll if not necessarily the
dll that was causing the problem.  It also won't necessarily get the
search order right if the executable being started isn't in the same
directory as the parent executable.  That's fixable, though.

cgf


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-15 Thread Brian Dessent
Christopher Faylor wrote:

 That was going to be my first observation, actually.  I'm still trying
 to digest the patch but it seems like it wouldn't work well with the
 fork retry code.

The patch doesn't change any behavior though: in current Cygwin if the
thing we're exec()ing returns a Win32 exit code of C135 (or
whatever) then we retry creating it 5 times.  With the patch, we still
retry starting it 5 times but after the last one fails we recognise the
situation and print a message.

Brian


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-14 Thread Brian Dessent
Brian Dessent wrote:

 isn't present, etc.  I was really hoping to figure out a cool way to get
 that info, perhaps by poking around in the TEB or PEB somewhere, but I
 haven't gotten that far.  If anyone has any general ideas where to look
 for NTLDR's internal state, I'm all ears.  I have a hunch it would be
 possible to get if we were running the exec'd process in a debugger loop
 and pumping WaitForDebugEvent() messages, since those can have
 parameters attached to exception codes.  But that's a little too
 extreme.

For anyone curious, it's absolutely possible to do the above.  For the
C139 fault (missing procedure point entry), %ebx at the time of the
fault points right at the AsciiZ name of the missing import in the
.idata section, -8(%ebp) points to the import name in UNICODE, and
-10(%ebp) points to the DLL name in UNICODE.

For the C135 fault (the unable to locate component popup), %esi at
the time of the fault points right to the missing library name in
UNICODE.

For the C005 fault (the LDR hits an access violation trying to fixup
a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was
relocating and 24(%ebp) points to an AsciiZ filename of the module which
that symbol is supposed to be pointing into.

Now I'm sure a lot of those above offsets are just coincidental, as I
haven't done much testing to see if it's reliably set as above.  However
it does mean that it would be relatively easy to use the debug API to
step a process through its initialization and find out exactly why it's
faulting.  I've been working on something along those lines for cygcheck
which will also give dynamic process tracing, i.e. runtime LoadLibrary
stuff.  Combined with enabling the LDR snaps debug output, there is a
tremendous amount of debug capability hidden here.

Brian


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-14 Thread Corinna Vinschen
On Mar 14 04:08, Brian Dessent wrote:
 Brian Dessent wrote:
 
  isn't present, etc.  I was really hoping to figure out a cool way to get
  that info, perhaps by poking around in the TEB or PEB somewhere, but I
  haven't gotten that far.  If anyone has any general ideas where to look
  for NTLDR's internal state, I'm all ears.  I have a hunch it would be
  possible to get if we were running the exec'd process in a debugger loop
  and pumping WaitForDebugEvent() messages, since those can have
  parameters attached to exception codes.  But that's a little too
  extreme.
 
 For anyone curious, it's absolutely possible to do the above.  For the
 C139 fault (missing procedure point entry), %ebx at the time of the
 fault points right at the AsciiZ name of the missing import in the
 .idata section, -8(%ebp) points to the import name in UNICODE, and
 -10(%ebp) points to the DLL name in UNICODE.
 
 For the C135 fault (the unable to locate component popup), %esi at
 the time of the fault points right to the missing library name in
 UNICODE.
 
 For the C005 fault (the LDR hits an access violation trying to fixup
 a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was
 relocating and 24(%ebp) points to an AsciiZ filename of the module which
 that symbol is supposed to be pointing into.
 
 Now I'm sure a lot of those above offsets are just coincidental, as I
 haven't done much testing to see if it's reliably set as above.  However
 it does mean that it would be relatively easy to use the debug API to
 step a process through its initialization and find out exactly why it's
 faulting.  I've been working on something along those lines for cygcheck
 which will also give dynamic process tracing, i.e. runtime LoadLibrary
 stuff.  Combined with enabling the LDR snaps debug output, there is a
 tremendous amount of debug capability hidden here.

That's really cool.  Your patch looks good, but it's Chris' code so
he will have the final say.

What we also could do instead of adding this to the DLL is to add this
to cygcheck and/or strace only.  If somebody complains on the list that
a process just exits, we can point him to run it under cygcheck and it
will tell you what's wrong.  That would be already quite nice, imho.


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-14 Thread Christopher Faylor
On Thu, Mar 13, 2008 at 07:46:37PM -0700, Brian Dessent wrote:
Brian Dessent wrote:

 As we all know, Cygwin calls SetErrorMode (SEM_FAILCRITICALERRORS) to
 suppress those pop up GUI messageboxes from the operating system when 

Oh, I forgot to mention:

In the course of testing this I came to realize that because of some
sort of retry if fork doesn't seem to be working code (not sure of the
details), every time that this situation comes up we are actually
launching five copies of the binary.

That was going to be my first observation, actually.  I'm still trying
to digest the patch but it seems like it wouldn't work well with the
fork retry code.

cgf


[patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-13 Thread Brian Dessent

As we all know, Cygwin calls SetErrorMode (SEM_FAILCRITICALERRORS) to
suppress those pop up GUI messageboxes from the operating system when a
process encounters an unhandled exception.  This has the advantage of
making things more POSIX-like, and I'm sure people that run long
testsuites or unattended headless servers appreciate not coming in after
a long weeked to find that their server has been wedged for days waiting
for someone to click on OK.

But of course if you follow the user list you also know that this is a
double edged sword, in that currently if a required DLL is missing there
is zero indication other than an curiously arbitrary exit status code of
53 decimal, or 0x35 hex which is the low byte of 0xC135,
STATUS_DLL_NOT_FOUND.

Anyway, the attached patch fixes all that by adding logic to let the
actual NTSTATUS logic percolate up to the waiting parent, so that it can
recognise these kinds of common(ish) faults and print a friendly message
-- or at least something other than silently dieing with no output.

After printing the message, the NTSTATUS is discarded and the exit
status code is replaced with a synthetic exit status corresponding to
killed by SIGKILL as read by the wait()ing parent, which means the
shell will also append Killed to that message.  I tried 0x80 |
SIGSEGV, corresponding to segmentation fault and core dumped but
since we aren't actually generating a core file, it seemed a little
weird to see the shell say that there was one generated.  The point here
is that the exit status that the parent (in most cases the shell) sees
is totally arbitrary, so we can put whatever makes the most sense
there.  I just figured that the shell printing Killed most closely
corresponds to the actual situation of the OS terminating the process
due to an unhandled exception.

There are three specific cases that I had in mind to handle with a
graceful message:

1. the user is missing a DLL
2. the DLL that is found is missing symbols
3. relocs in the .rdata section

In addition to catching and hopefully explaining those, it also prints a
generic default case for any other exception code.

Also, I'm attaching a Makefile that will create a test executable for
each of the three cases above.  It's totally standalone, you can type
make check and it will build and run the checks.  This is what the
current output looks like:

$ ./dll_not_found
dll_not_found.exe: one or more DLLs that this program requires cannot be
located by the system.  Make sure the PATH is correct and re-run the
setup program to install any packages indicated as necessary to satisfy
library dependencies.
Killed

$ ./missing_import
missing_import.exe: an entry point for one of more symbols could not be
found during program initialization.  Usually this means an incorrect
or out of date version of one or more DLLs is being erroniously found
on the PATH.
Killed

$ ./rdata_relocs 
rdata_relocs.exe: the process encountered an unhandled access violation
fault.
If this happens immediately and consistently at process startup,
one likely cause is relocs in the .rdata section as a result of
the runtime pseudo-reloc feature being applied to data imports
in 'const' structures.  Relinking with a linker script that marks
the .rdata section writeable can solve this problem.
Killed

In all three of these cases, the current behavior is that you would get
a GUI popup box from csrss.exe if you ran them from strace (since strace
does not call SetErrorMode() and that setting is inherited) but you get
absolutely no indication of an error if you run them from a Cygwin
process... other than $? if you know to check it.

I'm not 100% convinced that the change to the sigproc/pinfo stuff is
totally correct and safe, as it's pretty involved code and I had to
scatch my head for a while to figure out how everything interacts.  So
please do kick the tires.

BTW, when you *do* get the GUI popup messageboxes, they are helpful in
that the identify the precise DLL that's missing or the function that
isn't present, etc.  I was really hoping to figure out a cool way to get
that info, perhaps by poking around in the TEB or PEB somewhere, but I
haven't gotten that far.  If anyone has any general ideas where to look
for NTLDR's internal state, I'm all ears.  I have a hunch it would be
possible to get if we were running the exec'd process in a debugger loop
and pumping WaitForDebugEvent() messages, since those can have
parameters attached to exception codes.  But that's a little too
extreme.

Brian2008-03-13  Brian Dessent  [EMAIL PROTECTED]

* ntdll.h: Add several missing NTSTATUS defines.
* pinfo.cc (pinfo::maybe_set_exit_code_from_windows): Recognise
and preserve the exit status of a child that terminated with
an unhandled exception.
(pinfo::exit): Make the whole NTSTATUS exit code available to
the wait()ing parent when an exec()d process fails due to
an unhandled exception.
* pinfo.h (class _pinfo): Fix 

Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-13 Thread Eric Blake

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

According to Brian Dessent on 3/13/2008 7:45 PM:
| Anyway, the attached patch fixes all that by adding logic to let the
| actual NTSTATUS logic percolate up to the waiting parent, so that it can
| recognise these kinds of common(ish) faults and print a friendly message
| -- or at least something other than silently dieing with no output.

Cool!  However, I haven't looked at the patch itself, yet.

| $ ./dll_not_found
| dll_not_found.exe: one or more DLLs that this program requires cannot be
| located by the system.  Make sure the PATH is correct and re-run the
| setup program to install any packages indicated as necessary to satisfy
| library dependencies.
| Killed

Should we also mention 'cygcheck ./dll_not_found' to find out which ones
are missing?

|
| $ ./missing_import
| missing_import.exe: an entry point for one of more symbols could not be
| found during program initialization.  Usually this means an incorrect
| or out of date version of one or more DLLs is being erroniously found
| on the PATH.
| Killed

s/erroniously/erroneously/

- --
Don't work too hard, make some time for fun as well!

Eric Blake [EMAIL PROTECTED]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkfZ21gACgkQ84KuGfSFAYCjQwCfdmAdES3oXUTF0rf9eMFCvDBJ
SbIAn1xfTEwKHZDUAloRo4VdvEt99xWJ
=W9DL
-END PGP SIGNATURE-


Re: [patch] recognise when an exec()d process terminates due to unhandled exception

2008-03-13 Thread Brian Dessent
Eric Blake wrote:

 Should we also mention 'cygcheck ./dll_not_found' to find out which ones
 are missing?

It might be a good idea.  On the other hand it's kind of long already. 
I'm totally not married to what I've got for the wording though,
consider it a very rough draft.

 | missing_import.exe: an entry point for one of more symbols could not be
 | found during program initialization.  Usually this means an incorrect
 | or out of date version of one or more DLLs is being erroniously found
 | on the PATH.
 | Killed
 
 s/erroniously/erroneously/

Drat and s/one of more/one or more/ as well.

Brian