Gavin Lambert wrote: > (Also > there's a few cases where variables introduced in one patch aren't used > until a later patch, and the reverse, so some of the intermediate states > won't compile.)
The former shouldn't stop compilation (except for some warnings), the latter (used before introduction) would be a mistake on my side. If they cause problems, let me know which ones, and I'll try to rearrange my patches. > > Indeed, it doesn't, but the development code as obtained by the "hg > > clone" command as listed on http://etherlab.org/de/ethercat/index.php > > does. Though it uses > > master->io_mutex instead of master->io_sem now. For what I can tell > > (seeing that 1.5.0 doesn't have io_mutex at all), an intermediate > > version introduced the code with io_sem (which I then took) and was > > later changed to io_mutex along with other changes. (This all might not > > be very relevant except to explain where I got the code from; otherwise > > you can treat it like it was one of my patches, I guess.) > > I'm not really sure what's going on with the default branch, but as best I > can tell it's outdated and should be ignored. All the new changes are on > the stable-1.5 branch. (There hasn't been anything committed to "default" > since 2011.) Uhm, did I get that right? The code obtained by the command described on the website as: : The following command can be used to clone the repository in order : to get the latest revisions: : hg clone http://hg.code.sf.net/p/etherlabmaster/code ethercat-hg That's not the "latest revisions", in fact it's even older code than the last few releases? Well, if that's so, that's quite the bummer! (And it won't help with integration of my patches, since I've sometimes looked at this code for the general direction to take, in these two cases discussed and a few others. Now if that's exactly the wrong direction, well ...) Alright, I've long considered if I should post the following, but I think you (and other interested parties) ought to know about it, not as a kind of dirty laundry, just so to know what you might be at when you deal with my patches: As I mentioned previously, I did report the first few of the problems and sent patches back in 2011 (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html). Unfortunately, all of my patches, even back then were completely ignored[1], including a serious bug in the e1000 driver which rendered it basically unusable[2]. There was a single reply (http://lists.etherlab.org/pipermail/etherlab-users/2011/001276.html) in which Florian basically told me that he's very busy, and that he'll "check and include" my patches. (3 years later I think it's fair to say that only one of those statements turned out true.) My other bug reports for some of the problems addressed in my later patches (http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html) went completely unanswered. We even tried to contact IgH by phone, but were basically told, their developers are too busy with other stuff, and they're not very interested in our application anyway, so this went nowhere either. Meanwhile we had a project to complete, so without any support (and no interest by others (like you) on the lists back then), I just did what I needed to do on my own to finish the project (late but not too late ;) and then left it at that. I think several of the issues I fixed are serious bugs in the code, and when I occasionally peeked at the lists and saw prople like Jun Yuan who had apparently similar problems, I now sent my patches for your benefit (hopefully). However, seeing the current direction the code is going (which I now know is 1.5.2, not the hg version), it doesn't seem very interesting to me, so I guess my platform will remain 1.5.0 plus my patches.[3] As I mentioned, I will probably do some work on our EtherCAT application this year, but this will be (probably) compararably easy stuff with (hopefully) no new bugs found and (quite certainly) not requiring a new EtherCAT version. So you might not hear from me about that at all on the lists, and in fact, when that's done, I'll probably unsubscribe from the lists. So this leaves my patches somewhere in limbo. I won't maintain or port them to new versions[4], and if IgH still doesn't care about them now, you might be stuck with them as an inoffical fork too (like I am). Just think you should know that so you can decide what to do. [1] except for a trivial change relating to newer kernels (ethercat-1.5-header.patch) which they made have made without my patch because it was obvious [2] We were basically told, e1000 is broken, don't use it, use r8169. Well, my experience was different, we tried r8169, but also had some problems with it (I didn't debug them further), whereas e1000, after fixing this bug (patch #02) worked and still works very reliably for us. [3] I might just need drivers for newer kernels, but it now seems easier to "backport", or perhaps just copy them from 1.5.2 etc. into my 1.5.0 tree. [4] More precisely, my current client will not pay me to do so (for good reason, there would be no direct benefit to him, actually some risk, you know, never change a running system etc.) > - there's a very large number of "overriding mailbox check = 0" "buffering > mailbox response" "overriding mailbox check = 1" "fetching mailbox response" > sequences. Does this just happen for every mailbox exchange or is it > significant (eg. showing an out-of-order response)? Yes, that's normal. It shows that my patch is working. > - I'm seeing a higher number of errors logged while fetching the SDO > dictionary than I recall happening beforehand ("invalid entry description > response" mostly). Although if I run "ethercat sdos" then I do see the > correct information (presumably it's retrying?). I don't think this should happen. Can you check what the incorrect data are (printed in the debug output, refer to the condition before that error message to see what's wrong)? The retrying from my patches should happen at a lower level before it gets there. However, it does high level retry here too (which is probably why you get correct information in the end). I'm not sure why it does so, maybe there are other reasons for invalided reponses here which were handled before and are independent of my changes, though I don't know what reasons and I don't see why my changes should make them more likely. > - there's some very suspicious timeout warnings "timed out datagram xxx, > index 00 waited 790269982 us." (the time does seem to be the same most of > the time) Looks like an unset datagram got here. This would explain index 00, and (if it had cycles_send or jiffies_sent == 0) might explain the strange timing. Either an unset datagram is queued in ec_master_queue_datagram, or a datagram already queued is overwritten somewhere (which might be a more serious problem). If you want to debug it, you can start there. > > I admit I'm not very proficient with hg, so I probably mixed up the > > commits. I'd have to read up on it and dig deeper, maybe you're faster > > at it. In any case, the cloned code (see above) does contain lock_cb and > > unlock_cb in place of send_cb and receive_cb as in 1.5.0 and 1.5.2. So I > > figure, Florian made this change, but hasn't pushed it into a release > > yet. > > I haven't traced the history, but I suspect that (as this was on the default > branch), this was the code before send_cb and receive_cb were introduced in > the first place. So changing it back would presumably be a regression. Well, from my point of view, of course, the 1.5.0/1.5.2 (and as it now seems, current) code is a regression from 1.4.0 for two reasons: - Code structure: The "configurable" thing about the callbacks is not the send/receive part (this part is exactly proscribed, see the comment in ecrt.h), but the "rest" which means locking or similar. So it "feels strange" that some calls that are logically integral to the caller are delegated to callbacks (via these comments), instead of letting the callbacks deal with what they should really be concerned about. - Scalability: The version in 1.4.0 (and my patches) need two callbacks, lock and unlock. The 1.5.x version needs N callbacks, one for each operation. They thought N == 2 (send and receive), so it wouldn't matter, but as I explained in my initial mail, many more operations need to be locked. I think it's obvious that having to provide a callback for each of them is not viable. Of course, a completely different locking regime may be conceivable which doesn't require that, but I haven't seen one yet. What I've seen in 1.5.x is simply broken (especially in RTAI, but not only, as I explained), so it's no alternative to me. (As usual, locking bugs will manifest themselves rarely and unreproducibly, that's probably why the problem wasn't noticed more, but if we take that as a reason not to do locking properly, we can just remove all locks completely, right?) I still don't know the motivation for this change, perhaps it's to let the callbacks skip execution if they want (i.e., if no lock can be obtained). But that's still no reason for this unscalable approach. An alternative would be a kind of try-lock callback in the style of pthread_mutex_trylock (refer to its manpage for semantics in case of doubt). And I'm not even sure of the usefulness of that -- in my experience, from low-level to high-level code, I almost never use try-locking because if I did, the upper level would usually just have to wait and retry. So it's easier to combine this with locking, i.e. pthread_mutex_lock which automatically waits, or like here in my patch #22, in request_lock_callback, do the waiting close to the locking. (In ec_master, maybe the FSMs can keep running when they can't send/receive datagrams, but what would they do in this time? Just cycle around waiting for the datagrams to be sent and received. So it seems easier to wait directly.) That said, I'm not opposed to a try-lock approach, if this is favoured for some reason. But I am opposed to a buggy approach. ;) > > I don't know about your application, but in my case, I only want to run > > with the correct (configured) number of slaves anyway. > > Therefore, in my cyclic code I always call ecrt_master_state and check > > that in the result link_up is set and slaves_responding is equal to the > > number I expect; otherwise I abort immediately because something's > > fishy. > > Mine is a bit more flexible; I'm trying to get it to support hotplugging as > cleanly as possible if I can, along with a bit of autodetection. OK, to me the world of realtime industrial communication is about the polar opposite of hotplugging systems. ;) But I guess it depends on the application. Anyway, communication will not last over a rescan, i.e. pending SDO transfers will be aborted, PDOs might be remapped, EoE, not only connections, but the virtual EoE devices will be reset (and possible come back with a different name) etc. That's the case with and without my patches. > > ec_fsm_coe_exec could return 0 in 1.5.0 already. The return value is > > only used by those callers that go on to do other things after calling > > it (to return early if 0); those that call it at their end don't (need > > to) check the result, and I think that's ok. > > > > One of my changes was that ec_fsm_master_exec doesn't always return > > 1 after executing a state (it still does in 1.5.2). It's a consequence > > of my previous changes: If CoE is reserved by the slave FSM, the master > > FSM must wait. In order to do that I set > > datagram->state to EC_DATAGRAM_INVALID, and prevent it from being > > queued. Returning 0 from ec_fsm_master_exec is an easy way to achieve > > this (for both callers) since the 0 return (the first branch in > > ec_fsm_master_exec) was already there in case the FSM is still waiting > > for a reply and doesn't execute a state at all (and therefore obviously > > doesn't send a new datagram either). > > But in 1.5.2 ec_fsm_coe_exec has "datagram_used" and could return 0 when the > datagram was in INIT, QUEUED, or SENT, which I assumed would occur in some > of the middle states while it was waiting for something to happen, which > didn't sound that different from what you were trying to do. I don't know. The point of my patch #27 was to sequentialize CoE access by the "master" and "slave FSMs". In 1.5.0 this wasn't done at all, and datagrams wouldn't be in one of those state for this reason, they'd just be sent out and get conflicting answers. Maybe this problem is fixed in 1.5.2 and the change in ec_fsm_coe_exec is a part of this (it's certainly not the main thing; this must be other changes if so). Regards, Frank -- Dipl.-Math. Frank Heckenbach <f.heckenb...@fh-soft.de> Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359 Systems Programming, Software Development, IT Consulting _______________________________________________ etherlab-dev mailing list etherlab-dev@etherlab.org http://lists.etherlab.org/mailman/listinfo/etherlab-dev