[Emc-developers] tool change abort hang

Michael Haberler Mon, 27 Dec 2010 23:33:54 -0800

while I havent fixed it yet, this is my current theory on the way to resolve 
this. I'd appreciate comments.


Symptom:
- emc starts a toolchange
- toolchange is aborted, eg. by Escape key in Axis
- emc freezes for up to five seconds, a log message appears saying like so:

emc/task/iotaskintf.cc 155: Command to IO level (EMC_TOOL_ABORT:+1103,+12,    
+0,) timed out waiting for last command done. 
emc/task/iotaskintf.cc 158: emcIoStatus->echo_serial_number=10, 
emcIoCommandSerialNumber=10, emcIoStatus->status=2
emc/task/iotaskintf.cc 163: Last command sent to IO level was 
(EMC_TOOL_LOAD:+1105,+12,   +10,)


What I think *should* be happening:

- emc starts a toolchange and sends a EMC_TOOL_LOAD to iocontrol and waits for 
it to be acknowledged before proceeding
- iocontrol receives that and starts its tool-change/tool-changed pin protocol 
with the external toolchanger script
- iocontrol should also listen on the toolCmd channel from emc for any other 
messages
- when the toolchange is aborted, e.g. by Escape in Axis or otherwise, emc 
should immediately queue a EMC_TOOL_ABORT to iocontrol
- iocontrol should periodically peek into the queue even if a toolchange is 
pending. If it sees an EMC_TOOL_ABORT it should clean up, like deassert the 
tool-change pin, and probably acknowledge both messages so emc continues.
- note that this assumes that the queue between emc and iocontrol is a 
bona-fide queue, i.e. can hold more than one message (the pending TOOL_LOAD and 
the TOOL_ABORT).

What I *think* is happening:

- emc starts a toolchange and sends a EMC_TOOL_LOAD to iocontrol
- iocontrol receives that and starts its tool-change/tool-changed pin protocol 
with the external toolchanger script
- when the toolchange is aborted, e.g. by Escape in Axis or otherwise, emc 
should queue a EMC_TOOL_ABORT to iocontrol. However, emc stares at the last 
EMC_TOOL_LOAD command serial number waiting for it to be acknowledged before it 
goes on to send the EMC_TOOL_ABORT
- that never happens so it's a classic deadlock, which is "resolved" by a 
timeout, resulting in the above message.
- it looks like in this state, iocontrol never really gets the EMC_TOOL_ABORT.

What's likely to be wrong:

for emc to be able to queue a EMC_TOOL_ABORT when the EMC_TOOL_LOAD still sits 
in the queue waiting to be acknowledged by iocontrol, queue size must be > 1.
However, it seems to me the 'queue' between emc and iocontrol has queue size 1 
(that is - just a shared memory buffer with mutex protections). See the line in 
emc.nml describing the toolCmd buffer:

# These are for the IO controller, EMCIO
B toolCmd               SHMEM   localhost       1024    0       0       4       
16 1004 TCP=5005 xdr 

http://www.isd.mel.nist.gov/projects/rcslib/NMLcfg.html states: "...To enable 
queuing of messages in the buffer, add the word "queue" to the buffer line. The 
size of the buffer determines how many messages can be simultaneously queued."

So to have a real queue, this should probably read:
B toolCmd               SHMEM   localhost       1024    0       0       4       
16 1004 TCP=5005 xdr queue <--- enable queueing

Probably I should check wether 1024 is large enough to hold at least 2 messages 
of worst size requirements.

Second, iotaskintf.cc:sendCommand() needs to be changed as follows:
if the new command is an EMC_TOOL_ABORT and the previous unacked command was an 
EMC_TOOL_LOAD or EMC_TOOL_PREPARE, do not wait for the old command to be 
acknowledged but immediately queue the EMC_TOOL_ABORT.

Third, iocontrol needs to be changed as follows:
It needds to peek() into the queue while there's a pending toolchange checking 
if an EMC_TOOL_ABORT msg is sitting there. If so, clean up, acknowlegde both 
messages and revert to idle.

thanks to Alex Joni for coaching me so far.

-Michael


------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Emc-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/emc-developers

[Emc-developers] tool change abort hang

Reply via email to