Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Sun, Mar 23, 2008 at 08:09:06PM -0400, Christopher Faylor wrote: >On Sun, Mar 23, 2008 at 01:48:15PM -0700, Brian Dessent wrote: >>Christopher Faylor wrote: >>>After poking at this a little, I think it would be better to issue a >>>linux-like error message. >>> >>>In my sandbox, I now have this: >>> >>>bash-3.2$ ./libtest /cygdrive/s/test/libtest.exe: error while loading >>>shared libraries: liba.dll: cannot open shared object file: No such >>>file or directory >>> >>>I haven't done the work to report on missing symbols yet but I think >>>that's a much less common error condition. >> >>Excellent. The wording isn't really that important to me. But I think >>what is important is that we don't allow the situation where something >>was unable to start and we are totally silent. That leads to confusion >>because people start to try to debug or blame the program being run >>when in fact the program never saw the light of day in the first place. >>It's totally baffling when it happens and you're not aware to look for >>it. So even if we can't give the name of the symbol in the case of a >>missing import, I think it's still important to say something. > >Yes. I really have been meaning to fix this for a long time. It's my >fault that cygwin has this bug. I appreciate your point the way to ing >how this could be solved. cgf
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Sun, Mar 23, 2008 at 01:48:15PM -0700, Brian Dessent wrote: >Christopher Faylor wrote: >>After poking at this a little, I think it would be better to issue a >>linux-like error message. >> >>In my sandbox, I now have this: >> >>bash-3.2$ ./libtest /cygdrive/s/test/libtest.exe: error while loading >>shared libraries: liba.dll: cannot open shared object file: No such >>file or directory >> >>I haven't done the work to report on missing symbols yet but I think >>that's a much less common error condition. > >Excellent. The wording isn't really that important to me. But I think >what is important is that we don't allow the situation where something >was unable to start and we are totally silent. That leads to confusion >because people start to try to debug or blame the program being run >when in fact the program never saw the light of day in the first place. >It's totally baffling when it happens and you're not aware to look for >it. So even if we can't give the name of the symbol in the case of a >missing import, I think it's still important to say something. Yes. I really have been meaning to fix this for a long time. It's my fault that cygwin has this bug. I appreciate your point the way to how this could be solved. cgf
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
Christopher Faylor wrote: > After poking at this a little, I think it would be better to issue a > linux-like error message. > > In my sandbox, I now have this: > > bash-3.2$ ./libtest > /cygdrive/s/test/libtest.exe: error while loading shared libraries: liba.dll: > cannot open shared object file: No such file or directory > > I haven't done the work to report on missing symbols yet but I think > that's a much less common error condition. Excellent. The wording isn't really that important to me. But I think what is important is that we don't allow the situation where something was unable to start and we are totally silent. That leads to confusion because people start to try to debug or blame the program being run when in fact the program never saw the light of day in the first place. It's totally baffling when it happens and you're not aware to look for it. So even if we can't give the name of the symbol in the case of a missing import, I think it's still important to say something. Brian
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Fri, Mar 14, 2008 at 04:08:46AM -0700, Brian Dessent wrote: >Brian Dessent wrote: > >> isn't present, etc. I was really hoping to figure out a cool way to get >> that info, perhaps by poking around in the TEB or PEB somewhere, but I >> haven't gotten that far. If anyone has any general ideas where to look >> for NTLDR's internal state, I'm all ears. I have a hunch it would be >> possible to get if we were running the exec'd process in a debugger loop >> and pumping WaitForDebugEvent() messages, since those can have >> parameters attached to exception codes. But that's a little too >> extreme. > >For anyone curious, it's absolutely possible to do the above. For the >C139 fault (missing procedure point entry), %ebx at the time of the >fault points right at the AsciiZ name of the missing import in the >.idata section, -8(%ebp) points to the import name in UNICODE, and >-10(%ebp) points to the DLL name in UNICODE. > >For the C135 fault (the "unable to locate component popup"), %esi at >the time of the fault points right to the missing library name in >UNICODE. > >For the C005 fault (the LDR hits an access violation trying to fixup >a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was >relocating and 24(%ebp) points to an AsciiZ filename of the module which >that symbol is supposed to be pointing into. > >Now I'm sure a lot of those above offsets are just coincidental, as I >haven't done much testing to see if it's reliably set as above. However >it does mean that it would be relatively easy to use the debug API to >step a process through its initialization and find out exactly why it's >faulting. I've been working on something along those lines for cygcheck >which will also give dynamic process tracing, i.e. runtime LoadLibrary >stuff. Combined with enabling the LDR snaps debug output, there is a >tremendous amount of debug capability hidden here. I took a more low-tech and undoubtedly less foolproof approach and just stepped through the list of dlls for the failing executable, trying to load them one at a time, reporting on the first dll to fail. This is not necessarily the same scenario as what the Windows "dynamic linker" is doing but it should report an *a* missing dll if not necessarily the dll that was causing the problem. It also won't necessarily get the search order right if the executable being started isn't in the same directory as the "parent" executable. That's fixable, though. cgf
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Fri, Mar 14, 2008 at 11:11:43PM -0700, Brian Dessent wrote: >Christopher Faylor wrote: >>That was going to be my first observation, actually. I'm still trying >>to digest the patch but it seems like it wouldn't work well with the >>fork retry code. > >The patch doesn't change any behavior though: in current Cygwin if the >thing we're exec()ing returns a Win32 exit code of C135 (or >whatever) then we retry creating it 5 times. With the patch, we still >retry starting it 5 times but after the last one fails we recognise the >situation and print a message. After poking at this a little, I think it would be better to issue a linux-like error message. In my sandbox, I now have this: bash-3.2$ ./libtest /cygdrive/s/test/libtest.exe: error while loading shared libraries: liba.dll: cannot open shared object file: No such file or directory I haven't done the work to report on missing symbols yet but I think that's a much less common error condition. So thanks for the patch and the concept but I think I'd rather go with a more linux-like solution. cgf
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
Christopher Faylor wrote: > That was going to be my first observation, actually. I'm still trying > to digest the patch but it seems like it wouldn't work well with the > fork retry code. The patch doesn't change any behavior though: in current Cygwin if the thing we're exec()ing returns a Win32 exit code of C135 (or whatever) then we retry creating it 5 times. With the patch, we still retry starting it 5 times but after the last one fails we recognise the situation and print a message. Brian
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Thu, Mar 13, 2008 at 07:46:37PM -0700, Brian Dessent wrote: >Brian Dessent wrote: > >> As we all know, Cygwin calls SetErrorMode (SEM_FAILCRITICALERRORS) to >> suppress those pop up GUI messageboxes from the operating system when > >Oh, I forgot to mention: > >In the course of testing this I came to realize that because of some >sort of "retry if fork doesn't seem to be working" code (not sure of the >details), every time that this situation comes up we are actually >launching five copies of the binary. That was going to be my first observation, actually. I'm still trying to digest the patch but it seems like it wouldn't work well with the fork retry code. cgf
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
On Mar 14 04:08, Brian Dessent wrote: > Brian Dessent wrote: > > > isn't present, etc. I was really hoping to figure out a cool way to get > > that info, perhaps by poking around in the TEB or PEB somewhere, but I > > haven't gotten that far. If anyone has any general ideas where to look > > for NTLDR's internal state, I'm all ears. I have a hunch it would be > > possible to get if we were running the exec'd process in a debugger loop > > and pumping WaitForDebugEvent() messages, since those can have > > parameters attached to exception codes. But that's a little too > > extreme. > > For anyone curious, it's absolutely possible to do the above. For the > C139 fault (missing procedure point entry), %ebx at the time of the > fault points right at the AsciiZ name of the missing import in the > .idata section, -8(%ebp) points to the import name in UNICODE, and > -10(%ebp) points to the DLL name in UNICODE. > > For the C135 fault (the "unable to locate component popup"), %esi at > the time of the fault points right to the missing library name in > UNICODE. > > For the C005 fault (the LDR hits an access violation trying to fixup > a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was > relocating and 24(%ebp) points to an AsciiZ filename of the module which > that symbol is supposed to be pointing into. > > Now I'm sure a lot of those above offsets are just coincidental, as I > haven't done much testing to see if it's reliably set as above. However > it does mean that it would be relatively easy to use the debug API to > step a process through its initialization and find out exactly why it's > faulting. I've been working on something along those lines for cygcheck > which will also give dynamic process tracing, i.e. runtime LoadLibrary > stuff. Combined with enabling the LDR snaps debug output, there is a > tremendous amount of debug capability hidden here. That's really cool. Your patch looks good, but it's Chris' code so he will have the final say. What we also could do instead of adding this to the DLL is to add this to cygcheck and/or strace only. If somebody complains on the list that a process just exits, we can point him to "run it under cygcheck and it will tell you what's wrong". That would be already quite nice, imho. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
Brian Dessent wrote: > isn't present, etc. I was really hoping to figure out a cool way to get > that info, perhaps by poking around in the TEB or PEB somewhere, but I > haven't gotten that far. If anyone has any general ideas where to look > for NTLDR's internal state, I'm all ears. I have a hunch it would be > possible to get if we were running the exec'd process in a debugger loop > and pumping WaitForDebugEvent() messages, since those can have > parameters attached to exception codes. But that's a little too > extreme. For anyone curious, it's absolutely possible to do the above. For the C139 fault (missing procedure point entry), %ebx at the time of the fault points right at the AsciiZ name of the missing import in the .idata section, -8(%ebp) points to the import name in UNICODE, and -10(%ebp) points to the DLL name in UNICODE. For the C135 fault (the "unable to locate component popup"), %esi at the time of the fault points right to the missing library name in UNICODE. For the C005 fault (the LDR hits an access violation trying to fixup a reloc .rdata), %ebx points to an AsciiZ name of the symbol it was relocating and 24(%ebp) points to an AsciiZ filename of the module which that symbol is supposed to be pointing into. Now I'm sure a lot of those above offsets are just coincidental, as I haven't done much testing to see if it's reliably set as above. However it does mean that it would be relatively easy to use the debug API to step a process through its initialization and find out exactly why it's faulting. I've been working on something along those lines for cygcheck which will also give dynamic process tracing, i.e. runtime LoadLibrary stuff. Combined with enabling the LDR snaps debug output, there is a tremendous amount of debug capability hidden here. Brian
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
Brian Dessent wrote: > As we all know, Cygwin calls SetErrorMode (SEM_FAILCRITICALERRORS) to > suppress those pop up GUI messageboxes from the operating system when Oh, I forgot to mention: In the course of testing this I came to realize that because of some sort of "retry if fork doesn't seem to be working" code (not sure of the details), every time that this situation comes up we are actually launching five copies of the binary. I had at one point a testcase so pathological that it somehow managed to invoke a MS runtime popup error, even though it was a pure Cygwin binary. And you'd get 5 copies of the thing before it gave up and stopped trying. You can see it in strace a lot easier though since strace is running the debugger loop you see the exceptions. Brian
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
Eric Blake wrote: > Should we also mention 'cygcheck ./dll_not_found' to find out which ones > are missing? It might be a good idea. On the other hand it's kind of long already. I'm totally not married to what I've got for the wording though, consider it a very rough draft. > | missing_import.exe: an entry point for one of more symbols could not be > | found during program initialization. Usually this means an incorrect > | or out of date version of one or more DLLs is being erroniously found > | on the PATH. > | Killed > > s/erroniously/erroneously/ Drat and s/one of more/one or more/ as well. Brian
Re: [patch] recognise when an exec()d process terminates due to unhandled exception
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 According to Brian Dessent on 3/13/2008 7:45 PM: | Anyway, the attached patch fixes all that by adding logic to let the | actual NTSTATUS logic percolate up to the waiting parent, so that it can | recognise these kinds of common(ish) faults and print a friendly message | -- or at least something other than silently dieing with no output. Cool! However, I haven't looked at the patch itself, yet. | $ ./dll_not_found | dll_not_found.exe: one or more DLLs that this program requires cannot be | located by the system. Make sure the PATH is correct and re-run the | setup program to install any packages indicated as necessary to satisfy | library dependencies. | Killed Should we also mention 'cygcheck ./dll_not_found' to find out which ones are missing? | | $ ./missing_import | missing_import.exe: an entry point for one of more symbols could not be | found during program initialization. Usually this means an incorrect | or out of date version of one or more DLLs is being erroniously found | on the PATH. | Killed s/erroniously/erroneously/ - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.8 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkfZ21gACgkQ84KuGfSFAYCjQwCfdmAdES3oXUTF0rf9eMFCvDBJ SbIAn1xfTEwKHZDUAloRo4VdvEt99xWJ =W9DL -END PGP SIGNATURE-
[patch] recognise when an exec()d process terminates due to unhandled exception
As we all know, Cygwin calls SetErrorMode (SEM_FAILCRITICALERRORS) to suppress those pop up GUI messageboxes from the operating system when a process encounters an unhandled exception. This has the advantage of making things more POSIX-like, and I'm sure people that run long testsuites or unattended headless servers appreciate not coming in after a long weeked to find that their server has been wedged for days waiting for someone to click on "OK". But of course if you follow the user list you also know that this is a double edged sword, in that currently if a required DLL is missing there is zero indication other than an curiously arbitrary exit status code of 53 decimal, or 0x35 hex which is the low byte of 0xC135, STATUS_DLL_NOT_FOUND. Anyway, the attached patch fixes all that by adding logic to let the actual NTSTATUS logic percolate up to the waiting parent, so that it can recognise these kinds of common(ish) faults and print a friendly message -- or at least something other than silently dieing with no output. After printing the message, the NTSTATUS is discarded and the exit status code is replaced with a synthetic exit status corresponding to "killed by SIGKILL" as read by the wait()ing parent, which means the shell will also append "Killed" to that message. I tried "0x80 | SIGSEGV", corresponding to "segmentation fault and core dumped" but since we aren't actually generating a core file, it seemed a little weird to see the shell say that there was one generated. The point here is that the exit status that the parent (in most cases the shell) sees is totally arbitrary, so we can put whatever makes the most sense there. I just figured that the shell printing "Killed" most closely corresponds to the actual situation of the OS terminating the process due to an unhandled exception. There are three specific cases that I had in mind to handle with a graceful message: 1. the user is missing a DLL 2. the DLL that is found is missing symbols 3. relocs in the .rdata section In addition to catching and hopefully explaining those, it also prints a generic default case for any other exception code. Also, I'm attaching a Makefile that will create a test executable for each of the three cases above. It's totally standalone, you can type "make check" and it will build and run the checks. This is what the current output looks like: $ ./dll_not_found dll_not_found.exe: one or more DLLs that this program requires cannot be located by the system. Make sure the PATH is correct and re-run the setup program to install any packages indicated as necessary to satisfy library dependencies. Killed $ ./missing_import missing_import.exe: an entry point for one of more symbols could not be found during program initialization. Usually this means an incorrect or out of date version of one or more DLLs is being erroniously found on the PATH. Killed $ ./rdata_relocs rdata_relocs.exe: the process encountered an unhandled access violation fault. If this happens immediately and consistently at process startup, one likely cause is relocs in the .rdata section as a result of the runtime pseudo-reloc feature being applied to data imports in 'const' structures. Relinking with a linker script that marks the .rdata section writeable can solve this problem. Killed In all three of these cases, the current behavior is that you would get a GUI popup box from csrss.exe if you ran them from strace (since strace does not call SetErrorMode() and that setting is inherited) but you get absolutely no indication of an error if you run them from a Cygwin process... other than $? if you know to check it. I'm not 100% convinced that the change to the sigproc/pinfo stuff is totally correct and safe, as it's pretty involved code and I had to scatch my head for a while to figure out how everything interacts. So please do kick the tires. BTW, when you *do* get the GUI popup messageboxes, they are helpful in that the identify the precise DLL that's missing or the function that isn't present, etc. I was really hoping to figure out a cool way to get that info, perhaps by poking around in the TEB or PEB somewhere, but I haven't gotten that far. If anyone has any general ideas where to look for NTLDR's internal state, I'm all ears. I have a hunch it would be possible to get if we were running the exec'd process in a debugger loop and pumping WaitForDebugEvent() messages, since those can have parameters attached to exception codes. But that's a little too extreme. Brian2008-03-13 Brian Dessent <[EMAIL PROTECTED]> * ntdll.h: Add several missing NTSTATUS defines. * pinfo.cc (pinfo::maybe_set_exit_code_from_windows): Recognise and preserve the exit status of a child that terminated with an unhandled exception. (pinfo::exit): Make the whole NTSTATUS exit code available to the wait()ing parent when an exec()d process fails due to an unhandled exception. * pinfo.h (clas