Eliot, It would be nice to have the #name of each process in the dump. I often end up with many similar/identical processes, and the name is very useful in sorting out which among them has/have strayed.
Re guess (a - unsignaled semaphore), does that perhaps suggest a missing #ensure: block? Another possibility (just asking) is that we are using a semaphore for mutual exclusion when a mutex would be a better choice?? When I started using threads, I had a robust mutex class "from the start" and the differences between a mutual exclusion semaphore and a mutex were striking. Re guess (b - lockup in #clearExternalObjects), Norbert mentioned saving the image in connection with this. Saving the image is a very "main thread" activity, and as such, there might be a need to queue a deferred action vs. invoking the code from a background thread. I was going to add something about my reservations on our weak collections, which (IMHO must be thread safe and self-cleaning, and are not in Sqeak/Pharo). Even in Dolphin's earliest docs, weak collections and finalization were one topic, for good reason. Toward that end, I looked at ExternalSemaphoreTable, expecting to find it subclassed or using a weak collection of some type. What I found is a #forMutualExclusion semaphore in a a situation where I would use a Mutex. This looks like a matter of evolution and timing. Squeak dates back to an era before structured exception handling and improvements like Mutex. Dolphin got started after Squeak, with either a two or fifteen year head start, depending on how one wants to call it. Dolphin was built from the ground up to have weak collections, finalization, and full set of process synchronization tools, making Mutex a part of the toolkit when its sockets were written. We have Mutex now, and probably should be using it more widely than we do. Bill ________________________________ From: [email protected] [[email protected]] on behalf of Eliot Miranda [[email protected]] Sent: Thursday, December 08, 2011 1:08 PM To: [email protected] Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image? Hi Norbert, On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl <[email protected]<mailto:[email protected]>> wrote: Eliot, can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. OK, as a favour to you. Next time you do the leg work. But all the info you need is in the dump. First thing, the active process is the idle process: Process 0x8f1d2a8 priority 10 0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class 0x8f929b0 s [] in ProcessorScheduler class>startUp 0x8f1d248 s [] in BlockClosure>newProcess (you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor) The next process is the finalization process, which has nothing to do. Then Process 0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores. Then Process 0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects. So my guesses are either that a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up. But you can read Smalltalk stack traces as well as I. Just look at them. There are only 11 of them. HTH Eliot Norbert Am 07.12.2011 um 22:49 schrieb Eliot Miranda: On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[email protected]<mailto:[email protected]>> wrote: That assumes there is an error. Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario. On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp. ________________________________ From: [email protected]<mailto:[email protected]> [[email protected]<mailto:[email protected]>] on behalf of Javier Pimás [[email protected]<mailto:[email protected]>] Sent: Wednesday, December 07, 2011 1:52 PM To: [email protected]<mailto:[email protected]> Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image? Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it. Cheers, Javier. On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[email protected]<mailto:[email protected]>> wrote: I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success. What would be a good way to get a glimpse of what is causing problems? I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes. thanks, Norbert -- Lic. Javier Pimás Ciudad de Buenos Aires -- best, Eliot
