On Mon, Oct 20, 2014 at 07:53:24PM +1100, Frank Tkalcevic wrote:
> I run axis on the local machine, then remote to it using keystick.

It also *typically* works when using e.g., axis + halui or axis +
linuxcncrsh.

However, sometimes it doesn't.  This leads to reports like 328 (axis +
halui, using halui sometimes axis is unresponsive for a few seconds)
and 395 (axis + linuxcncrsh, linuxcncrsh becomes totally unresponsive)

The problem crops up when the UI wants to do one of two things: wait to
be certain its command was received by task; or wait to be certain its
command was fully acted on by task.

All the current UIs have an implementation similar to this one (from shcom):
    int emcCommandWaitReceived(int serial_number)
    {
        double end = 0.0;

        while (emcTimeout <= 0.0 || end < emcTimeout) {
            updateStatus();

            if (emcStatus->echo_serial_number == serial_number) {
                return 0;
            }

            esleep(EMC_COMMAND_DELAY);
            end += EMC_COMMAND_DELAY;
        }

        return -1;
    }
In this implementation, the UI waits for up to emcTimeout seconds (or forever,
if emcTimeout <= 0) for the stat buffer to hold a certain serial number in
echo_serial_number. (problems also arise in emcCommandWaitDone, which in
shcom calls out to emcCommandWaitReceived as a first step)

Here's one sequence of operations which causes this algorithm to go wrong:
        UI 1            UI 2            Task
        send SN 1
                                        receive SN 1
                                        echo SN 1
                        send SN 1001
                                        receive SN 1001
                                        echo SN 1001
        poll status buffer
        until echo SN = 1
        (never finishes)
            
.. and this sort of situation is easy to trigger.  In bug 395, it is
easy to trigger because when linuxcncrsh is waiting for "SET MODE
MANUAL", AXIS automatically sends another command when it reads the
stat buffer and sees the mode has changed to manual.  (UI 1 =
linuxcncrsh, UI 2 = axis)

I am aware that there must be some differences in behavior when using
the different client-name arguments to RCS_CMD_BUFFER / RCS_STAT_CHANNEL
etc but it doesn't seem to affect the way echo_serial_number behaves.
To confirm this belief I had, I ran keystick (uses client name string
"keystick" as you point out) and linuxcnctop (uses client name string
"xemc", I assume).  As I issued commands in linuxcnctop, I saw changing
echo_serial_number values in linuxcnctop.

This is why I said in my original message "the serial number method ...
simply does not work".  In my analysis, this bad behavior of multiple
UIs in no way is a bug in libnml.  It's a bug in the way "wait for
command to be received / completed" were implemented on top of NML.

The combo keystick + axis probably works better than many because
keystick and axis both have finite timeouts, while linuxcncrsh
apparently defaults to an infinite timeout so it readily exhibits very
bad behavior when it triggers this bug.

I'm sure open to solving this bug properly while retaining NML as the
IPC method of LinuxCNC, because even if *this* project gets done on the
fastest likely schedule (new API in 2.8, new backend in 2.9), *and* we
try to adopt twice-a-year releases, it's still ~18 months to 2.9 and a
fix for this class of bug.

Perhaps it is worth returning to the solution suggested in bug 328, and
ignoring the derail that happened right away (amusingly enough, by
somebody else who wanted to replace NML).  That patch uses an NML queue,
which makes message reception reliable; and implements a globally
increasing serial number.  This makes it possible to wait for
echo_serial_number >= serial_number (instead of ==), so it's OK if
another UI sends a command around the same time you do.

(however, this means you can't reliably determine whether your command
was successful [RCS_DONE] or failure [RCS_ERROR] because you're likely
to see the status of some command issued subsequent to your own.  but
mostly UIs don't actually indicate this success/failure result, but
instead rely on an operator message being shown when there's an error.)

Jeff

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
_______________________________________________
Emc-developers mailing list
Emc-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/emc-developers

Reply via email to