Re: [Pharo-project] How to resurrect an unrepsonsive image?

Schwab,Wilhelm K Thu, 08 Dec 2011 12:04:51 -0800

Eliot,

It would be nice to have the #name of each process in the dump.  I often end up 
with many similar/identical processes, and the name is very useful in sorting 
out which among them has/have strayed.


Re guess (a - unsignaled semaphore), does that perhaps suggest a missing 
#ensure: block?  Another possibility (just asking) is that we are using a 
semaphore for mutual exclusion when a mutex would be a better choice??  When I 
started using threads, I had a robust mutex class "from the start" and the 
differences between a mutual exclusion semaphore and a mutex were striking.

Re guess (b - lockup in #clearExternalObjects), Norbert mentioned saving the 
image in connection with this.  Saving the image is a very "main thread" 
activity, and as such, there might be a need to queue a deferred action vs. 
invoking the code from a background thread.

I was going to add something about my reservations on our weak collections, 
which (IMHO must be thread safe and self-cleaning, and are not in Sqeak/Pharo). 
 Even in Dolphin's earliest docs, weak collections and finalization were one 
topic, for good reason.  Toward that end, I looked at ExternalSemaphoreTable, 
expecting to find it subclassed or using a weak collection of some type.  What 
I found is a #forMutualExclusion semaphore in a a situation where I would use a 
Mutex.

This looks like a matter of evolution and timing.  Squeak dates back to an era 
before structured exception handling and improvements like Mutex.  Dolphin got 
started after Squeak, with either a two or fifteen year head start, depending 
on how one wants to call it.  Dolphin was built from the ground up to have weak 
collections, finalization, and full set of process synchronization tools, 
making Mutex a part of the toolkit when its sockets were written.

We have Mutex now, and probably should be using it more widely than we do.

Bill



________________________________
From: [email protected] 
[[email protected]] on behalf of Eliot Miranda 
[[email protected]]
Sent: Thursday, December 08, 2011 1:08 PM
To: [email protected]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Hi Norbert,

On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl 
<[email protected]<mailto:[email protected]>> wrote:
Eliot,

can you take a look at the attached crash.dmp file? My knowledge about the 
internals is limited so I'm not a good candidate to get a feeling what could 
have been gone wrong.
If I would have to guess I would think there is a dead lock in 
ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries 
to register an external object at the same time the snapshot:andQuit: tries to 
clear the external objects. But I know nothing how this works.

OK, as a favour to you.  Next time you do the leg work. But all the info you 
need is in the dump.  First thing, the active process is the idle process:

Process  0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) 
ProcessorScheduler class
 0x8f929b0 s [] in ProcessorScheduler class>startUp
 0x8f1d248 s [] in BlockClosure>newProcess

(you can see form the C stack trace that the VM is looping doing 
primitiveRelinquishProcessor)

The next process is the finalization process, which has nothing to do.

Then Process  0x8d73268 priority 20 is trying to do a connect but is blocks in 
a critical section trying to register one or other of the Socket's semaphores.

Then Process  0x8f23b98 priority 20 is trying to do a snapshot and is blocked 
in a critical section doing SmalltalkImage>clearExternalObjects.

So my guesses are either that
    a) something terminated a process that was in the critical section for 
registering external objects and the semaphore protecting it is missing a 
signal, or
    b) there is a bug in the code and that if a process is in the critical 
section registering an external object then SmalltalkImage>clearExternalObjects 
my lock-up.

But you can read Smalltalk stack traces as well as I.  Just look at them. There 
are only 11 of them.

HTH
Eliot


Norbert



Am 07.12.2011 um 22:49 schrieb Eliot Miranda:



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K 
<[email protected]<mailto:[email protected]>> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises 
when an image does nothing. It is very help to be able to get callstacks in 
that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to 
crash.dmp.





________________________________
From: 
[email protected]<mailto:[email protected]>
 
[[email protected]<mailto:[email protected]>]
 on behalf of Javier Pimás 
[[email protected]<mailto:[email protected]>]
Sent: Wednesday, December 07, 2011 1:52 PM
To: 
[email protected]<mailto:[email protected]>
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is 
happening. Look at the backtrace to see what causes the error and then maybe 
there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl 
<[email protected]<mailto:[email protected]>> wrote:
I have a headless image that was running for a couple of days. Now it is not 
responding anymore. The image was running still but the sockets for http and 
vnc were closed and restarting the image just brings it up without any sockets 
opened.
Injecting something via script does not work either. I copied the image to my 
desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself 
issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot

Re: [Pharo-project] How to resurrect an unrepsonsive image?

Reply via email to