Ira W. Snyder wrote:
> On Wed, Feb 24, 2010 at 12:52:06PM -0800, Ira W. Snyder wrote:
>> On Wed, Feb 24, 2010 at 08:41:06PM +0100, Marc Kleine-Budde wrote:
>>> Ira W. Snyder wrote:
>>>> On Wed, Feb 24, 2010 at 08:08:15PM +0100, Marc Kleine-Budde wrote:
>>>>> Ira W. Snyder wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>> But we want so solve the TX problem. You only have to stop the
>>>>>>>>> netif_queue if you TX ring is full. Then you have to deal with calling
>>>>>>>>> netif_wake_queue() eventually. But we obviously don't want to poll.
>>>>>>>>>
>>>>>>>>> The solution might be to look at the TX queue in the interrupt handler
>>>>>>>>> and/or the NAPI loop. And now we get back to my first question. If the
>>>>>>>>> controller does a loopback in hardware, i.e. you receive a frame for
>>>>>>>>> each frame send, you don't need to poll the TX queue length.
>>>>>>>>>
>>>>>>>>> If you receive a CAN frame it might be on of your's, which means the
>>>>>>>>> TX
>>>>>>>>> queue might have at least room for one frame. In pseudo code it might
>>>>>>>>> look like this:
>>>>>>>>>
>>>>>>>>> xmit()
>>>>>>>>> {
>>>>>>>>> send_frame_to_hardware();
>>>>>>>>> if (tx_ring_full) {
>>>>>>>>> netif_stop_queue();
>>>>>>>>> priv->flags |= TX_RING_FULL;
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> irq() and netif_poll()
>>>>>>>>> {
>>>>>>>>> if (priv->flags & TX_RING_FULL) {
>>>>>>>>> if (room_in_ring()) {
>>>>>>>>> priv->flags &= ~TX_RING_FULL;
>>>>>>>>> netif_wake_queue();
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Should be implemented race-free, i.e. you might have to use atomic_*
>>>>>>>>> operations and/or don't use a shared flag variable.
>>>>>>>> Cool, that should work. netif_queue_stopped() could be used instead of
>>>>>>>> the TX_RING_FULL flag.
>>>>>>> right...good point
>>>>>>>
>>>>>>> I should have looked at my own driver :)
>>>>>>>
>>>>>>> If room_in_ring() is cheap, it boils down to:
>>>>>>>
>>>>>>>>> irq() and netif_poll()
>>>>>>>>> {
>>>>>>>>> if (room_in_ring()) {
>>>>>>>>> netif_wake_queue();
>>>>>>>>> }
>>>>>>> Otherwise:
>>>>>>>
>>>>>>>>> irq() and netif_poll()
>>>>>>>>> {
>>>>>>>>> if (netif_queue_stopped() && room_in_ring()) {
>>>>>>>>> netif_wake_queue();
>>>>>>>>> }
>>>>>> Hmm, this *almost* works. Running "cangen -g 0 can0" quickly gets the
>>>>>> queue into a hung state, though.
>>>>> what is a "hung state"?
>>>>>
>>>>>> I added printk() to the sites where I stop/wake the queue. It is always
>>>>>> woken up correctly, but cangen always exits with "write: No buffer space
>>>>>> available" after a few stop/wake cycles. Immediately running cangen
>>>>>> again works for a few more stop/wake cycles.
>>>>> let's look the the cangen source if there's propper error handling...
>>>>> http://svn.berlios.de/viewvc/socketcan/trunk/can-utils/cangen.c?revision=787&view=markup
>>>>> nope, there isn't
>>>>>
>>>>> There's the option "-i" to ignore ENOBUFS, which is....we don't want to
>>>>> do that....
>>>>>
>>>>> Better improbe the source: (from the pengutronix' canutils...)
>>>>>
>>>>> again:
>>>>> len = write(s, &frame, sizeof(frame));
>>>>> if (len == -1) {
>>>>> switch (errno) {
>>>>> case ENOBUFS: {
>>>>> int err;
>>>>> struct pollfd fds[] = {
>>>>> {
>>>>> .fd = s,
>>>>> .events = POLLOUT,
>>>>> },
>>>>> };
>>>>>
>>>>> err = poll(fds, 1, 1000);
>>>>> if (err == -1 && errno != -EINTR) {
>>>>> perror("poll()");
>>>>> exit(EXIT_FAILURE);
>>>>> } else if (err == 0) {
>>>>> printf("a timeout occured\n");
>>>>> exit(EXIT_FAILURE);
>>>>> }
>>>>> }
>>>>> case EINTR: /* fallthrough */
>>>>> goto again;
>>>>> default:
>>>>> perror("write");
>>>>> exit(EXIT_FAILURE);
>>>>> }
>>>>> }
>>>>>
>>>>> If you hit a timeout, which means you cannot send packages for 1000 ms,
>>>>> your send queue probably didn't wake up...
>>>>>
>>>> Is there any chance you can send me the pengutronix version of
>>>> cangen/canutils? I can't find it on their website.
>>>
>>> http://www.pengutronix.de/software/socket-can/download/canutils/v4.0/
>>> http://www.pengutronix.de/software/libsocketcan/download/
>>>
>>> you need the libsocketcan, too
>>>
>>>>>> Here is the output of ifconfig:
>>>>>> can0 Link encap:UNSPEC HWaddr
>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>>>> UP RUNNING NOARP MTU:16 Metric:1
>>>>>> RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
>>>>>> TX packets:1902 errors:0 dropped:0 overruns:0 carrier:0
>>>>>> collisions:0 txqueuelen:10
>>>>>> RX bytes:10767 (10.5 KiB) TX bytes:10767 (10.5 KiB)
>>>>>> Interrupt:22
>>>>>>
>>>>>> can1 Link encap:UNSPEC HWaddr
>>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>>>> UP RUNNING NOARP MTU:16 Metric:1
>>>>>> RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
>>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>>>>> collisions:0 txqueuelen:10
>>>>>> RX bytes:10767 (10.5 KiB) TX bytes:0 (0.0 b)
>>>>>> Interrupt:22
>>>>>>
>>>>>> I'm stopping the queue correctly: the xmit() routine never complains
>>>>>> that it
>>>>>> tried to transmit but had no buffer space. You'll see that we didn't
>>>>>> drop any
>>>>>> packets in this run. In another run, we did drop packets. It is like the
>>>>>> firmware didn't echo the packets back to me. Also note that the TX of
>>>>>> can0 always matches the RX of can1. They're cabled together, so this is
>>>>>> expected.
>>>>> Hmm...it looks like this.
>>>>>
>>>>> Double check your RX path, so that you don't loose any frames in
>>>>> software. Is it possible that the RX and TX collide in your driver
>>>>> and/or in hardware?
>>>>>
>>>> I'm pretty sure that I do not lose RX frames in my driver. When I do,
>>>> stats->rx_dropped is always incremented. I do get the following warning
>>> I mean loosing RX frames due to an programming error...
>>>
>
> I've done some more investigation.
>
> 1) using a tasklet instead of the shared workqueue seems to help. We get
> many less rx packets lostWhy don't you use NAPI for rx? > 2) the hangs seem to be due to the driver losing lots of interrupts. It > spends all of it's time in the xmit() routine, and never gets a chance > to run the recv() routine. You wake the tx_queue quite early in the work_queue. Try waking TX after you have RX'ed all messages, not earlier. > This is a problem, even using tasklet_hi_schedule(). The cansequence > program is blasting the transmit queue so hard that we never get a > chance to service the recv() queue. Therefore we start losing packets, > and generating "rx overflow" error frames. Hmmm....if the tx_queue of your hardware is full, the xmit functions shouldn't be called, because you halted it. > Should I just go ahead and service the recv() routine right from > interrupt context? I know that it is usually frowned upon to do that, > but I've seen it in other network drivers before... cheers, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions | Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917-5555 | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Socketcan-core mailing list [email protected] https://lists.berlios.de/mailman/listinfo/socketcan-core
