On Wed, Feb 24, 2010 at 11:18:22PM +0100, Marc Kleine-Budde wrote:
> Ira W. Snyder wrote:
> > On Wed, Feb 24, 2010 at 12:52:06PM -0800, Ira W. Snyder wrote:
> >> On Wed, Feb 24, 2010 at 08:41:06PM +0100, Marc Kleine-Budde wrote:
> >>> Ira W. Snyder wrote:
> >>>> On Wed, Feb 24, 2010 at 08:08:15PM +0100, Marc Kleine-Budde wrote:
> >>>>> Ira W. Snyder wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>>>>> But we want so solve the TX problem. You only have to stop the
> >>>>>>>>> netif_queue if you TX ring is full. Then you have to deal with 
> >>>>>>>>> calling
> >>>>>>>>> netif_wake_queue() eventually. But we obviously don't want to poll.
> >>>>>>>>>
> >>>>>>>>> The solution might be to look at the TX queue in the interrupt 
> >>>>>>>>> handler
> >>>>>>>>> and/or the NAPI loop. And now we get back to my first question. If 
> >>>>>>>>> the
> >>>>>>>>> controller does a loopback in hardware, i.e. you receive a frame for
> >>>>>>>>> each frame send, you don't need to poll the TX queue length.
> >>>>>>>>>
> >>>>>>>>> If you receive a CAN frame it might be on of your's, which means 
> >>>>>>>>> the TX
> >>>>>>>>> queue might have at least room for one frame. In pseudo code it 
> >>>>>>>>> might
> >>>>>>>>> look like this:
> >>>>>>>>>
> >>>>>>>>> xmit()
> >>>>>>>>> {
> >>>>>>>>>     send_frame_to_hardware();
> >>>>>>>>>     if (tx_ring_full) {
> >>>>>>>>>             netif_stop_queue();
> >>>>>>>>>             priv->flags |= TX_RING_FULL;
> >>>>>>>>>     }
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> irq() and netif_poll()
> >>>>>>>>> {
> >>>>>>>>>     if (priv->flags & TX_RING_FULL) {
> >>>>>>>>>             if (room_in_ring()) {
> >>>>>>>>>                     priv->flags &= ~TX_RING_FULL;
> >>>>>>>>>                     netif_wake_queue();
> >>>>>>>>>             }
> >>>>>>>>>     }
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> Should be implemented race-free, i.e. you might have to use atomic_*
> >>>>>>>>> operations and/or don't use a shared flag variable.
> >>>>>>>> Cool, that should work. netif_queue_stopped() could be used instead 
> >>>>>>>> of
> >>>>>>>> the TX_RING_FULL flag.
> >>>>>>> right...good point
> >>>>>>>
> >>>>>>> I should have looked at my own driver :)
> >>>>>>>
> >>>>>>> If room_in_ring() is cheap, it boils down to:
> >>>>>>>
> >>>>>>>>> irq() and netif_poll()
> >>>>>>>>> {
> >>>>>>>>>     if (room_in_ring()) {
> >>>>>>>>>             netif_wake_queue();
> >>>>>>>>> }
> >>>>>>> Otherwise:
> >>>>>>>
> >>>>>>>>> irq() and netif_poll()
> >>>>>>>>> {
> >>>>>>>>>     if (netif_queue_stopped() && room_in_ring()) {
> >>>>>>>>>             netif_wake_queue();
> >>>>>>>>> }
> >>>>>> Hmm, this *almost* works. Running "cangen -g 0 can0" quickly gets the
> >>>>>> queue into a hung state, though.
> >>>>> what is a "hung state"?
> >>>>>
> >>>>>> I added printk() to the sites where I stop/wake the queue. It is always
> >>>>>> woken up correctly, but cangen always exits with "write: No buffer 
> >>>>>> space
> >>>>>> available" after a few stop/wake cycles. Immediately running cangen
> >>>>>> again works for a few more stop/wake cycles.
> >>>>> let's look the the cangen source if there's propper error handling...
> >>>>> http://svn.berlios.de/viewvc/socketcan/trunk/can-utils/cangen.c?revision=787&view=markup
> >>>>> nope, there isn't
> >>>>>
> >>>>> There's the option "-i" to ignore ENOBUFS, which is....we don't want to
> >>>>> do that....
> >>>>>
> >>>>> Better improbe the source: (from the pengutronix' canutils...)
> >>>>>
> >>>>> again:
> >>>>>         len = write(s, &frame, sizeof(frame));
> >>>>>         if (len == -1) {
> >>>>>                 switch (errno) {
> >>>>>                 case ENOBUFS: {
> >>>>>                         int err;
> >>>>>                         struct pollfd fds[] = {
> >>>>>                                 {
> >>>>>                                         .fd     = s,
> >>>>>                                         .events = POLLOUT,
> >>>>>                                 },
> >>>>>                         };
> >>>>>
> >>>>>                         err = poll(fds, 1, 1000);
> >>>>>                         if (err == -1 && errno != -EINTR) {
> >>>>>                                 perror("poll()");
> >>>>>                                 exit(EXIT_FAILURE);
> >>>>>                         } else if (err == 0) {
> >>>>>                                 printf("a timeout occured\n");
> >>>>>                                 exit(EXIT_FAILURE);
> >>>>>                         }
> >>>>>                 }
> >>>>>                 case EINTR:     /* fallthrough */
> >>>>>                         goto again;
> >>>>>                 default:
> >>>>>                         perror("write");
> >>>>>                         exit(EXIT_FAILURE);
> >>>>>                 }
> >>>>>         }
> >>>>>
> >>>>> If you hit a timeout, which means you cannot send packages for 1000 ms,
> >>>>> your send queue probably didn't wake up...
> >>>>>
> >>>> Is there any chance you can send me the pengutronix version of
> >>>> cangen/canutils? I can't find it on their website.
> >>>
> >>> http://www.pengutronix.de/software/socket-can/download/canutils/v4.0/
> >>> http://www.pengutronix.de/software/libsocketcan/download/
> >>>
> >>> you need the libsocketcan, too
> >>>
> >>>>>> Here is the output of ifconfig:
> >>>>>> can0      Link encap:UNSPEC  HWaddr 
> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
> >>>>>>           UP RUNNING NOARP  MTU:16  Metric:1
> >>>>>>           RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
> >>>>>>           TX packets:1902 errors:0 dropped:0 overruns:0 carrier:0
> >>>>>>           collisions:0 txqueuelen:10 
> >>>>>>           RX bytes:10767 (10.5 KiB)  TX bytes:10767 (10.5 KiB)
> >>>>>>           Interrupt:22 
> >>>>>>
> >>>>>> can1      Link encap:UNSPEC  HWaddr 
> >>>>>> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
> >>>>>>           UP RUNNING NOARP  MTU:16  Metric:1
> >>>>>>           RX packets:1902 errors:0 dropped:0 overruns:0 frame:0
> >>>>>>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >>>>>>           collisions:0 txqueuelen:10 
> >>>>>>           RX bytes:10767 (10.5 KiB)  TX bytes:0 (0.0 b)
> >>>>>>           Interrupt:22 
> >>>>>>
> >>>>>> I'm stopping the queue correctly: the xmit() routine never complains 
> >>>>>> that it
> >>>>>> tried to transmit but had no buffer space. You'll see that we didn't 
> >>>>>> drop any
> >>>>>> packets in this run. In another run, we did drop packets. It is like 
> >>>>>> the
> >>>>>> firmware didn't echo the packets back to me. Also note that the TX of
> >>>>>> can0 always matches the RX of can1. They're cabled together, so this is
> >>>>>> expected.
> >>>>> Hmm...it looks like this.
> >>>>>
> >>>>> Double check your RX path, so that you don't loose any frames in
> >>>>> software. Is it possible that the RX and TX collide in your driver
> >>>>> and/or in hardware?
> >>>>>
> >>>> I'm pretty sure that I do not lose RX frames in my driver. When I do,
> >>>> stats->rx_dropped is always incremented. I do get the following warning
> >>> I mean loosing RX frames due to an programming error...
> >>>
> > 
> > I've done some more investigation.
> > 
> > 1) using a tasklet instead of the shared workqueue seems to help. We get
> > many less rx packets lost
> 
> Why don't you use NAPI for rx?
> 

There are two kinds of messages from this firmware:
1) firmware control/communication messages (setup messages, error messages, 
etc.)
2) CAN messages

In my first posting of the driver, I used a workqueue for #1, and NAPI
for #2. Wolfgang suggested that I use just a workqueue. Is it
legal/possible to trigger a NAPI poll() without having the netdevice
open? Alternatively, I do all of the device setup during the netdevice
open() instead of handling most of it at device probe time. I believe
that NAPI can be triggered concurrently with netdevice open(), after
napi_enable() is called.

> > 2) the hangs seem to be due to the driver losing lots of interrupts. It
> > spends all of it's time in the xmit() routine, and never gets a chance
> > to run the recv() routine.
> 
> You wake the tx_queue quite early in the work_queue. Try waking TX after you
> have RX'ed all messages, not earlier.
> 

Hmm, doesn't seem to help. I removed the netif_wake_queue() check from
both ican3_recv_skb() and ican3_handle_interrupt() and moved it to the
end of ican3_tasklet(). The TX queue can only be woken up after all
packets have been recv'd now.

I also tried checking the DESC_IVALID bit in addition to the DESC_VALID
bit. You'll want to read the datasheet for yourself, but it basically
says that you need to clear the IVALID bit before clearing the
interrupt. Now process the buffer as you see fit. Once processing is
finished, toggle the VALID bit. Now the controller can re-use the
buffer.

The datasheet says that this is so you do not lose interrupts, but I
don't see how it can possibly work. You should always keep running
around the ring until you're out of buffers, rather than count on not
losing interrupts. Which is what my driver does.

> > This is a problem, even using tasklet_hi_schedule(). The cansequence
> > program is blasting the transmit queue so hard that we never get a
> > chance to service the recv() queue. Therefore we start losing packets,
> > and generating "rx overflow" error frames.
> 
> Hmmm....if the tx_queue of your hardware is full,
> the xmit functions shouldn't be called, because you halted it.
> 

Right. The ican3_xmit() is not called again after netif_stop_queue() is
called. That is working correctly. I think this is what is happening:

Assume the RX/TX rings are 8 buffers long, instead of the 256 used in
the driver. This will make the example easier.

TX used: 7, almost full
RX used: 8, full

Ok, this is a bad state. We haven't been processing RX messages fast
enough, for whatever reason.

Now we TX a packet:

TX used: 8, full, queue stopped
RX used: 8, full

We still haven't gotten around to processing the RX queue yet, but the
firmware goes ahead and transmits the message. It works. Now it tries to
do the hardware ECHO back, but the RX queue is full. So it drops the
packet (bad). Then it sends us a control message through the control
message queue, and we generate an "rx overflow" error frame.

Does that make sense?


I also tried tripling the length of the RX queue, without changing the
length of the TX queue. No change in behavior. Given the above
situation, that seems correct. :(

Thanks again,
Ira
_______________________________________________
Socketcan-core mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/socketcan-core

Reply via email to