Hello there,

there seems to be one pretty rare, ugly and hard-to find lock in my application 
(I shall get back to it at the end, in hope it might ring a bell), but what's 
most weird: it seems that when it happens, it's _wotaskd_ what primarily goes 
down?!?

Alas, the information is sparse: it is the deployment site, to where the 
programming team has no access (and so far we were not able to repeat the 
problem at the test site whatever we try), but due to the site admin and logs, 
it looks like

(a) first, one of the worker threads hangs somehow, so far inexplicably (EC 
locking problem possible but improbable, explained below)
(b) for some time, other threads run without a glitch, new reqeusts are served, 
new R/R loop worker threads are spawned and logged (I log out all R/R loops)
(c) shortly (in minutes) though the adaptor begins to redirect requests to the 
“Redirection URL”
(d) now, the site admin is alerted; he runs JavaMonitor **which reports “Failed 
to contact 127.0.0.1-1085”**!
(e) he finds which process belongs to *the application instance* (*not* the 
wotaskd!), and kills it from Terminal
(f) which causes wotaskd to magically cure and JavaMonitor starts working and 
stops showing the 1085 fail, allows to re-launch the instance, all is well and 
swell.

Does this perhaps ring a bell? To me this behaviour does not make any sense :/

As for the hang itself, it's rather weird too. There is a loop which goes 
through a list of EOs; each of them is logged out. Something like this:

===
        for (DBTimeChunk tch in session().currentMarket.orderedTimeChunks()) {
            log.info(""+tch)
            if (tch.someTimestamp>fixedTimestamp) continue // happens to be 
true in our case
            ... therefore some irrelevant code here (it would log if it 
happened, does not) ...
        }
===

The problem is that

- this goes through some of the TimeChunks, and _then_ it hangs -- not at the 
start of R/R loop, where EC locking problems could be expected
- in the same session, with the same EC, even in the same thread (for the 
method which contains the loop happens to be used twice in the page template) 
the loop already run through all the TimeChunks and tested their someTimestamp 
and ended without a glitch (so, no fault is fired when it hangs)

So far it happened about thrice; each time on different TimeChunk.

About the only thing I guess _might_ cause the hang of the thread is the "log 
tch". TimeChunk's toString() is comparatively complex, it might call, among 
more mundane things, also
- this.changesFromCommittedSnapshot()
- this.attributeKeys()
- this.primaryKey() (of ERXGenericRecord which it inherits)

Might one of them hang the thread, if another thread does the same/something 
other at the wrong moment? (Presumed all of them were already called for the 
same EO in the same thread all right shortly ago.)

If it happens again, it would help if the site admin could, before killing the 
application, to force it somehow to log the stacktracks of all its threads. Is 
there some trick for that?

And of course, for any other advice how to hunt for this bloody kind of bug 
I'll be extremely grateful.

Thanks a lot,
OC


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (Webobjects-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/webobjects-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to