> I did make some other intentional changes from the patch: > - I changed the "protocol" parameter of ec_slave_datagram_to_buffer and > ec_slave_buffer_to_datagram to uint8_t from uint16_t (since that seems more > consistent).
OK. (I suppose you mean ec_slave_datagram_from_buffer instead of ec_slave_buffer_to_datagram.) > - I left out the cycles_poll > cycles_sent and jiffies_poll > jiffies_sent > checks in the timeout checking, since as I noted before these would not be > safe against wraparound. > > If I put the jiffies_poll > jiffies_sent check back in, I do not get these > 17171869us timeouts; but I'm unconvinced it's safe to leave this check in. > > Incidentally, reversing the calculation: > time_us = (unsigned int)((jiffies_poll - jiffies_sent) * 1000000 / HZ); > where time_us = 17171869 and HZ = 250, suggests that (poll-sent) is -2 > jiffies (17171869 * 250 / 1000000 ~= -2), but then plugging that forwards > through the formula I'm not sure why time_us != 4294959296 (aka -8000) > instead. (I did notice that (uint32_t)-2 * (uint64_t)(1000000 / 250) = > 17179869176000, which at least has the right digits but they're at the wrong > magnitude.) I suspect I'm missing something fundamental (and obvious). - Actually it's not the same digits, there's a slight (but important) difference. - Switching from wrap-around to real arithmetic makes things clearer (calculating with wrap-around is very confusing; writing them out allows for normal mathematical reasoning): Your first observation says: (17171869 * 250 - 2^32) / 1000000 ~= -2 i.e. (2^32 - 2000000) / 250 ~= 17171869 Then you say: (2^32 - 2) * (1000000 / 250) = 17179869176000 i.e. (2^32 * 1000000 - 2000000) / 250 = 17179869176000 So you see where the factor 1000000 comes from (2^32, i.e. the wrap-around, being the dominant term here). The different digit is actually the result of the factor not being applied to the second summand. OTOH, (2^32 - 2 * 1000000) / 250 ~= 17171869 i.e. in C: (uint32_t)-2 * (uint32_t)1000000 / (uint32_t)250 == 17171869 - In your previous mail you said the timeout was 790269982us most of the time?! You must have done something different there, because with the above assumptions (using jiffies, HZ=250, 32 bit) the maximum possible value of that calculation is (2^32 - 1) / 250 = 17179869. - Side note: The (original) code might be improved by casting to uint64_t before the multiplication by 1000000. Without it, with 32 bit long, the range of poll-sent is limited to ~4000 jiffies (which is still quite a bit, but if exceeded, makes things confusing as we see). With the cast, the range is at least 4000 seconds (though it still doesn't help much in case of overflow). - Instead of second-guessing I think it would be much easier to just debug-print the interesting variables to see what values they actually have, including get_cycles()/jiffies. ec_device_poll is called shortly before the timeout message which sets jiffies_poll, so I wonder how jiffies_sent can be newer (larger), so the difference would be negative. So I suppose jiffies_poll < jiffies_sent <= jiffies which means jiffies_sent was set between the call to ec_device_poll and the timeout message. But I'm not sure how this can happen: Either due to a race condition e.g. with ecrt_master_send called by ec_cdev_ioctl_send or ec_master_send_ext etc. (which may have been the case when I made this particular change; I don't remember if I fixed the locking bugs before or after it; but you have applied my locking patches and done proper locking in your application, haven't you?) or indirectly through device->poll() (that's some complicated code paths, can't see it right now). To check the latter, you could debug-print jiffies again after device->poll() in ec_device_poll. I vaguely remember it might be something like this, so I was probably wrong when I said the other day the additional checks were about wrap-around. But indeed, as you pointed out, they're not wrap-around-safe. To fix this: If it's actually device->poll(), it might suffice to update jiffies_poll/cycles_poll after, rather than before, it. If it's a race condition, this needs to be fixed obviously. If it's something else (which I don't see ATM), you could make it wrap-around-safe by changing jiffies_poll > jiffies_sent to jiffies_poll - jiffies_sent > timeout_jiffies and likewise with cycles. (But I wouldn't recommend this without finding the root cause of the problem.) > Spoken slightly too soon. Looks like the strange-timeout datagrams still > occur, but now only if "deferring concurrent CoE access" occurred just > prior. > > Which is odd, as those datagrams are marked as EC_DATAGRAM_INVALID, and > those don't get queued, and even if one somehow got onto the queue it should > be ignored by the timeout loop as that explicitly checks the state. Again, it would help if you just debug-print the state to check if that's actually what happens. > On a hunch, I noticed that one of the other changes in patch 11 was to make > the timeout loop consider QUEUED datagrams in addition to SENT ones; I > removed that change and it appeared to stop the spurious timeouts. What > prompted this change, and is it actually needed? One reason would be patch #16 which means datagrams can be queued and not sent for a while (instead of corrputing other frame data). If so, the change would be in the wrong patch, of course. I'm not 100% sure at the moment if something in #11 doesn't also require it, but I think #16 is reason enough, so I don't have to check #11 in detail, do I? (Needless to say I didn't develop the patches precisely in order; I tried to separate them when I had a working version, might have mis-assigned a few things; like with the problem of variables used before being introduced as you reported.) Regards, Frank -- Dipl.-Math. Frank Heckenbach <f.heckenb...@fh-soft.de> Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359 Systems Programming, Software Development, IT Consulting _______________________________________________ etherlab-dev mailing list etherlab-dev@etherlab.org http://lists.etherlab.org/mailman/listinfo/etherlab-dev