Hi, a long time ago I reported some bugs (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html and http://lists.etherlab.org/pipermail/etherlab-users/2011/001272.html). Meanwhile I've fixed all the problems I've encountered during my project.
Unfortunately, this mail is quite late. This is because I did the respective work as an external developer for a company that uses the EtherCAT master, and since the project was already running late (not only because of these problems described here ;), I didn't have time at the end to write it all up properly etc. Now, while preparing to get back to this project with some updates, I can finally finish these patches too. (On the positive side, the project has now been running for 2-3 years without finding new EtherCAT related errors.) I attach my complete set of patches, including the patches I've sent in previous mails (02-* to 08-*, slightly adjusted; 01-ethercat-1.5-header.patch was applied in your code already). My patches are against 1.5.0 (which was current at the time I did the project), to be applied in file name order. I have not tried 1.5.1 or 1.5.2 yet. From the ChangeLog it looks like the changes between those versions should only have small overlap with mine, but of course, there may be some conflicts in the changes or their context, since some of my patches are quite substantial, so applying them to a newer version might not be trivial. If anything is unclear, feel free to ask me. The patches in detail (unless described already in the mails linked above): - As I described in my previous mails, the main problems I had were about several simultaneous mailbox users, e.g. using SDOs while EoE is active. Received mailbox datagrams are not properly dispatched to the correct handler (CoE, EoE, ...) which inevitably leads to problems when several of them run at the same time. A proper dispatcher or mailbox state-machine seems quite difficult to fit into the current code, so I solved it (or worked around it, you may say) at the lower level. Here's what I did: - Mailbox datagram structures are tagged with some additional fields in ec_datagram_t, so the low-level routines, in particular ec_master_send_datagrams() and ec_master_receive_datagrams(), can recognize them and handle them specially, see below. These new fields contain the expected mailbox protocol, the kind of datagram (check, fetch or send) as well as a pointer to the responsible slave. - When a fetch reply is received and the actual mailbox protocol (as read from the datagram contents) doesn't match the expected protocol, the datagram is put into an internal buffer, and another datagram from the buffer which matches the expected protocol is put in its place if there is one -- otherwise an error datagram is returned[1]. (Even if the protocol matches, it may need to be swapped with a buffered one, so datagrams are returned in the correct order.) - If a check reply is received, its answer is modified according to whether there is something in the buffer: If the check says "yes", but there's nothing in the buffer for the expected protocol, the answer is changed to "no" to avoid case "[1]" because we cannot know at this point if the data in the slave's mailbox are for the expected protocol. Furthermore, a fetch datagram is sent directly to actually fetch the mailbox, and its reply will be stored in the buffer, so if the client asks the next time (with another check datagram), it will get the data (if it was the correct protocol). So to the client it just looks like the slave took a bit longer to fill its mailbox. If the check says "no", but we have something buffered, the answer is changed to "yes". Since the slave will now send a fetch datagram, ec_master_send_datagrams() has to catch it and mark it as received with data from the buffer without ever sending it out. - The details are a little more complicated than that (e.g. it turned out that a single buffer per protocol isn't always enough, so I made a ring buffer of (currently) 0x10 datagrams per protocol. Also, it is necessary to time-out queued and not yet sent datagrams; and some book-keeping is required for the additional data structures), but that can all be seen in the patches. (09-ethercat-1.5-mailbox-tag.patch and 10-ethercat-1.5-mailbox-allocate-buffer.patch contain the boring parts (preparations, new data structures) that shouldn't change the behaviour. The main change is in 11-ethercat-1.5-mailbox-buffer.patch.) - Another problem was, when a fetch datagram is followed by a check datagram (from another user) in the same frame, the check will still get a "yes" answer even though the mailbox was emptied by the fetch. This might depend on the slave devices -- I didn't find a definitive statement about this case in the standard, but with our devices this is the observed behaviour. To avoid it, I now make sure that a new frame is started for a check datagram after a fetch datagram, even if the frame size would not require it. Except for a few more bytes on the line due to the new frame header, this should be harmless. (12-ethercat-1.5-fetch-check.patch) - Now about sending: When several sources (again e.g. SDO and EoE) try to send to the mailbox of the same slave simultaneously (or shortly after each other), only one of them will succeed. The other datagrams are not processed by the slave which can be seen by a working_counter which is still 0. Normally, I suppose each user should retry the sending until it succeeds. Some users do so (e.g. EoE in ec_eoe_state_tx_sent()), but there are many places that send to the mailbox that don't retry, so instead of fixing them all, I again did it at the lower level and implemented a retry centrally in ec_master_receive_datagrams(). (13-ethercat-1.5-send-retry.patch) - By the time a lost datagram times out, if the interface is busy, the 8-bit datagram index may already have wrapped around and another datagram with the same index been sent, causing confusion in ec_master_receive_datagrams() when the latter one is received if their type and size happens to match (which is not uncommon). Therefore, I added a new check to avoid reusing an index until the datagram is received or timed out. (14-ethercat-1.5-index-reuse.patch) - ec_eoe_run() seems to assume that at this point, the EoE datagram cannot be in state EC_DATAGRAM_QUEUED. This assumption is wrong. Even though my changes above make it more likely to happen, it could happen before. In fact, it could even be in state EC_DATAGRAM_INIT, e.g. when the master lock was denied in the send attempt. This leads to an invalid access to datagram_queue and a crash. But we cannot check for EC_DATAGRAM_INIT here because it is also set at the very beginning, so EoE processing would never start if the function just returned in this state. So I introduced a new state EC_DATAGRAM_PREQUEUED, set it in ec_eoe_queue() and check for it in ec_eoe_run(). The "sth_to_send" check also tests for this state, otherwise a pending datagram whose sending was once denied would never be sent. (15-ethercat-1.5-eoe-prequeue.patch) - Another major problem I had was frame corruption. As master/device.h says: * This memory ring is used to transmit frames. It is necessary to use * different memory regions, because otherwise the network device DMA could * send the same data twice, if it is called twice. Indeed, that's what I saw happening, causing various errors in any of the EtherCAT protocols. I found several questionable assumptions in the code: - EC_BYTE_TRANSMISSION_TIME_NS is set to 80, which is exactly the best case time. If anything takes a little longer than best case, on a busy EtherCAT interface it's only a matter of time until the buffers overrun. I've added a little reserve in ec_master_idle_thread() (just like ec_master_set_send_interval() did already). - The time calculation also didn't consider inter-frame gap and frame preambles. I added just the minimum (20 bytes). - ec_master_idle_thread() only considered the last frame sent to calculate the waiting time. However, ec_master_send_datagrams() may send several frames. Therefore, I now have ec_master_send_datagrams() compute and return the total number of bytes sent (including gaps). - If ec_master_send_datagrams() sends more than EC_TX_RING_SIZE frames, all waiting time is pointless since it will overwrite its own data before there is a chance to sleep. BTW, for debugging this problem, I used another PC with 2 NICs bridged (using brctl). By running Wireshark on the bridge interface, I could see damaged packets coming from the EtherCAT master which actually contained copies of (the initial part of) the frame after next (easy to identify by the datagram index, but also the rest of the data matched) which to me clearly confirms that the buffer was overwritten when 2 more frames were queued before this one was sent out (with EC_TX_RING_SIZE == 2). Therefore, I limit the number of frames it will send at once to EC_TX_RING_SIZE. - Even with all those changes, I still got corrupted frames (though less often than before). So I just increased EC_TX_RING_SIZE to 0x10. This is still heuristic, of course, but at least I haven't seen any frame corruption since then. (16-ethercat-1.5-frame-corruption.patch) - EoE: The TX frame was not properly cleaned up when the send datagram was not received or got no response (working_counter != 1). This caused unregister_netdev() to hang with the following syslog message repeating forever, and with rtnl_mutex held, so the whole networking subsystem remained locked when trying to unload the EC module and the system became mostly unusable till a reboot. unregister_netdevice: waiting for eoe0s1 to become free. Usage count = 4 (17-ethercat-1.5-eoe-tx-cleanup.patch) - As mentioned in previous mails, some other serious problems I had were about locking: - I wonder what is meant to protect access to datagram_queue. The comments are not quite clear, but according to master.h, io_sem is the "Semaphore used in IDLE phase", and looking at the code I figure it is meant to protect datagram_queue in idle phase, whereas application-specific locking should do it during operation phase. However, ec_master_queue_external_datagram() uses io_sem and can be called during operation phase, e.g.: ec_master_operation_thread() ->ec_fsm_slave_exec() ->ec_fsm_slave_state_ready() ->ec_fsm_slave_action_process_sdo() ->ec_master_queue_external_datagram() After I backported code from your repository to add locking in ec_master_clear_slaves() (18-ethercat-1.5-locking-fix-backport.patch), also the following sequence became possible: ec_master_operation_thread() ->ec_fsm_master_exec() ->ec_fsm_master_state_broadcast() ->ec_master_clear_slaves() Also, several places in cdev.c (lines 1840, 1859, 1924, 1943, 1962, 1983) use io_sem, and cdev can be used during operation phase. OTOH, ec_cdev_ioctl_domain_queue() ->ecrt_domain_queue() ->ec_master_queue_datagram() accesses datagram_queue without acquiring io_sem. It uses master_sem instead which seems to be wrong in any case. Using io_sem instead at least puts it on the same level as the other cdev functions mentioned before. (19-ethercat-1.5-io-sem.patch) I see that in newer versions (e.g. commit 53b5128e1313), you apparently reverted the callback mechanism from send/receive callbacks back to lock/unlock callbacks as it was in 1.4. I also prefer the latter since they can be used more generally. Therefore I made the respective changes in my 1.5 copy too, but a little differently. In particular, I use the callbacks also in the cdev routines and just anywhere io_sem was used. (io_sem is now only used in the default callbacks themselves.) (20-ethercat-1.5-callbacks.patch) - In examples/rtai you removed the t_critical check completely in newer versions (the value is still computed, but never used), so a non-RT access that happens at a bad time can now delay the execution of the cyclic task. Is this a good idea? What I did instead is to have the callbacks check t_critical (as before) and sleep (schedule()) when too close. This way they will always succeed (as in your version, no return code needed), but cannot block the cyclic task (provided timings are computed correctly). A fine point is that I now need a flag to tell when the cyclic task was stopped. Otherwise, if cleanup took too long, it could happen that e.g. stopping EoE would hang forever as it tried to get the lock because the "critical" time was already reached and the cyclic task would never run again and update the time. Also, I think t_last_cycle must be volatile. (21-ethercat-1.5-t-critical.patch) - However, I still think (as discussed previously) that using RTAI semaphores in non-RT tasks (i.e. the callbacks) is wrong. According to the RTAI developer, it is necessary that the current task is "RT hardened" in order to be able to use RTAI semaphores. But since I use the callbacks also from the cdev functions now, any Linux process can use them and there is no way to ensure that the caller is RT hardened. OTOH, we can't use normal kernel semaphores in RTAI code. So I now use an atomic flag as a hand-made non-blocking semaphore, and each user can wait for it in its own way (rt_sleep() for the RTAI task, schedule() in the callbacks). (22-ethercat-1.5-rtai-lock.patch) - A minor point: ext_queue_sem is meant to protect ext_datagram_queue. But it's not used in ecrt_master_send_ext() where ext_datagram_queue is accessed. Instead it's used around the external send callbacks, but it misses the call from ec_master_internal_send_cb(). In the end, it currently doesn't matter since both ec_eoe_queue() and send_cb() are only called from ec_master_eoe_thread() and are therefore automatically serialized, so the semaphore is actually pointless ATM. But if it ever gets important, it should be acquired in ecrt_master_send_ext() instead. (23-ethercat-1.5-ext-queue-sem.patch) - Though I don't use FoE, I happened to notice the use of a wrong wait queue there. (24-ethercat-1.5-foe-queue.patch) - The idle thread doesn't call ec_master_output_stats() regularly, so it's only called after a relevant problem, but only outputs information once a second, so the remaining statistics are swallowed. (25-ethercat-1.5-output-stats.patch) - When a slave's mailbox contains some old data when the master is restarted (this happens almost reproducibly when restarting the master while it's reading the SDO dictionary), the first mailbox response is misinterpreted which (in my case) typically results in an error like this: EtherCAT ERROR 0-0: Received unknown response while uploading SDO 0x1C12:00. EtherCAT ERROR 0-0: Failed to read number of assigned PDOs for SM2. Followed by: EtherCAT ERROR 0-0: Received upload response for wrong SDO (0x1C12:00, requested: 0x1C13:00). To avoid this, I fetch the mailbox once before using it for the first time, ignoring any result, whether empty or not. (26-ethercat-1.5-clear-mailbox.patch) - Even after the changes above, several simultaneous CoE (in particular SDO) requests can still get mixed up, since they have the same protocol number, so my mailbox "dispatcher" doesn't help. I must say that I don't really understand the separation between master and slave state machines, both of which have their own CoE state machines, and the corresponding separation of "internal" and "external" datagrams. For what I can tell, both are treated completely differently most of the way, but in the end they do exactly the same. (Of course, what else? All actual EtherCAT communication is between a master and a slave, there is no distinct "master CoE" and "slave CoE" protocol.) Problems occur when both master and slave state machine try to do CoE operations at the same time, because the mailbox responses get mixed up. Since I didn't want to make larger changes to the code structure now, such as merging those state machines, I changed the code so whichever state machine starts a CoE operation has exclusive access to CoE for this slave until the operation is finished or timed out. (Basically like a semaphore, except the state machines run in the same thread, so an actual semaphore would deadlock.) The next problem then is that some code (e.g. ec_fsm_master_exec()) just assumes that the FSM has a datagram to send out in every state, so it always returns 1 unless it's waiting for a reply. With my previous change, this isn't the case anymore, and it cannot be -- unless I'd block the FSM completely while another CoE operation is in progress. (I thought about it, but it might degrade performance, if e.g. a longer-running SDO transfer in the slave FSM could block unrelated operations in the master FSM.) For this reason I introduced a new state EC_DATAGRAM_INVALID, set it when the FSM is blocked and make ec_fsm_master_exec() return 0 if so (to take care of the master FSM), and ec_master_queue_external_datagrams() ignore it in this case (to take care of the slave FSM). (No, I don't particularly like this solution, but I don't see another way without large-scale changes.) (27-ethercat-1.5-coe-lock.patch) - There is no way AFAICS to find out when reading the slaves' SDO dictionaries is finished. This not only affects reliable operation when one wants to use the dictionary, but also performance, since reading is a CoE operation that runs in the master state machine and so, even with my previous patch, blocks other CoE (in particular, SDO) operations, since ec_fsm_master_action_process_sdo() is never reached while the dictionary is being read, so it's not reasonable to start an application task which uses SDOs in this situation. Therefore I added the sdo_dictionary_fetched flag to the state returned by ecrt_slave_config_state(), and don't set the flag until the reading is finished -- rather than until started, as before, which for the other purpose of this flag makes no difference. It's still up to the application to request this state and react to it. I also added this flag to ec_ioctl_slave_t and let "ethercat sdo" report if it's not set. This avoids outputting an incomplete list of SDOs, with (usually) a bogus error message EtherCAT: ERROR 0-0: SDO entry 0xXXXX:YY does not exist! for the entry currently being fetched, and it provides a reliable way for start scripts to wait until fetching is completed. (It's better for me to do it in a start script than in the application module since the master enters operation phase as soon as it is requested. Therefore I wait in my script using the output of "ethercat sdo", and use the new flag in ecrt_slave_config_state() only to let my application module verify that fetching was completed.) (28-ethercat-1.5-dictionary-fetched.patch) - The init script needs an "exit" in the "restart" case to avoid hitting the error exit at the end. (29-ethercat-1.5-init-restart.patch) - I see you mostly took my code for SDO up-/downloads I suggested previously (http://lists.etherlab.org/pipermail/etherlab-users/2011/001271.html), though you take a slave_position instead of an ec_slave_config_t parameter. Apparently you did this so you could have the same interface in the kernel and user space (though I've always wondered why you don't use alias and position in user-space too). This raises the question how to get the absolute position, as Graeme Foot asked in http://lists.etherlab.org/pipermail/etherlab-users/2011/001412.html (and got no answer). Well, I'm now using the same work-around Graeme described in this mail. Since I only need the ecrt_master_get_slave() calls during initialization, the overhead doesn't matter much to me, though I don't think it's a really nice interface, when used together with the other kernel functions. However, I still had to make a few changes: - In case of EINTR and also at the end of ecrt_master_sdo_download(), I think you forget to clear the request (and thus free the allocated memory). - The "data" parameter to ecrt_master_sdo_download() should be const. (While doing related changes, I noticed a duplicate block of code in CommandDownload.cpp; AFAICT the latter copy is spurious and my patch removes it.) (updated 07-ethercat-1.5-sdo-up-download.patch) Back then I asked which kinds of operations can be done concurrently on an EtherCAT master. Now I can finally answer my own question. With my patches, I can do all of the following at the same time: - Operations done by the master (bus scan, dictionary fetching, etc.) - Access through the cdev (e.g. "ethercat upload") - EoE access - SDO transfer by a non-realtime kernel module - SDO transfer by the cyclic task - PDO transfer in the cyclic task Regards, Frank -- Dipl.-Math. Frank Heckenbach <f.heckenb...@fh-soft.de> Stubenlohstr. 6, 91052 Erlangen, Germany, +49-9131-21359 Systems Programming, Software Development, IT Consulting
ethercat-1.5.0-patches.tar.bz2
Description: Binary data
_______________________________________________ etherlab-dev mailing list etherlab-dev@etherlab.org http://lists.etherlab.org/mailman/listinfo/etherlab-dev