Re: Using netconsole for debugging suspend/resume
On Mon, Jun 12, 2006 at 10:03:46PM -0700, David Miller wrote: From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 13 Jun 2006 06:54:14 +0200 I guess if you use 1394 with remote DMA for other protocols (like video etc.) there must be some way for the subsystem to map the memory even on IOMMU systems. I admit I haven't dived that deeply into the 1394 subsystem so I don't know how that works. Video-1394 has it's own driver, which does a consistent DMA allocation, and then maps that into userspace using remap_pfn_range(). Entirely portable. That's actually not portable to certain arm platforms, but that's a different story. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
From: Christoph Hellwig [EMAIL PROTECTED] Date: Tue, 13 Jun 2006 08:18:19 +0100 That's actually not portable to certain arm platforms, but that's a different story. Yes, cache issues :-/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Andi Kleen wrote: On Friday 09 June 2006 17:24, Mark Lord wrote: Andi Kleen wrote: If your laptop has firewire you can also use firescope. (ftp://ftp.suse.com/pub/people/ak/firescope/) .. FW keeps running as long as nobody resets the ieee1394 chip. This looks interesting. But how does one set it up for use on the *other* end of that firewire cable? The Quickstart and manpage don't seem to describe this fully. It's in the manpage: .SH NOTES The target must have the ohci1394 driver loaded. This implies that firescope cannot be used in early boot. That's it. Okay, so I'm daft. But.. *what* is it ?? We have two machines: target (being debugged), and host (anything). Sure, the target has to have ohci1394 loaded, and firescope running. But what about the *other* end of the connection? What commands? Thanks - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Monday 12 June 2006 17:38, Mark Lord wrote: Andi Kleen wrote: On Friday 09 June 2006 17:24, Mark Lord wrote: Andi Kleen wrote: If your laptop has firewire you can also use firescope. (ftp://ftp.suse.com/pub/people/ak/firescope/) .. FW keeps running as long as nobody resets the ieee1394 chip. This looks interesting. But how does one set it up for use on the *other* end of that firewire cable? The Quickstart and manpage don't seem to describe this fully. It's in the manpage: .SH NOTES The target must have the ohci1394 driver loaded. This implies that firescope cannot be used in early boot. That's it. Okay, so I'm daft. But.. *what* is it ?? We have two machines: target (being debugged), and host (anything). Sure, the target has to have ohci1394 loaded, and firescope running. But what about the *other* end of the connection? What commands? From the same manpage: The raw1394 module must be loaded and its device node be writable (this normally requires root) Ok it doesn't say you need ohci1394 too and doesn't say that's the target. If I do a new revision I'll perhaps expand the docs a bit. So load ohci1394/raw1394 and run firescope as root. Your distribution will hopefully take care of the device nodes. Usually you want something like firescope -Au System.map -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Andi Kleen wrote: On Monday 12 June 2006 17:38, Mark Lord wrote: Okay, so I'm daft. But.. *what* is it ?? We have two machines: target (being debugged), and host (anything). Sure, the target has to have ohci1394 loaded, and firescope running. But what about the *other* end of the connection? What commands? From the same manpage: The raw1394 module must be loaded and its device node be writable (this normally requires root) Ok it doesn't say you need ohci1394 too and doesn't say that's the target. If I do a new revision I'll perhaps expand the docs a bit. So load ohci1394/raw1394 and run firescope as root. Your distribution will hopefully take care of the device nodes. Usually you want something like firescope -Au System.map I think the confusion here is that the target doesn't need to be running anything; you can DMA chunks of memory with the OHCI controller with no need for any software support. The debugger host is what's running firescope. Unless I'm confused too, which is likely. Andi, I think your docs should be more explicit about what runs where. Also, the tricky bit for me is debugging resume; firescope still requires the OHCI device to come up to be useful, but I that's no different from using netconsole. Neat stuff; I need to get my two firewire-enabled machines close enough to each other to try it out. J - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Monday 12 June 2006 23:25, Jeremy Fitzhardinge wrote: Andi Kleen wrote: On Monday 12 June 2006 17:38, Mark Lord wrote: Okay, so I'm daft. But.. *what* is it ?? We have two machines: target (being debugged), and host (anything). Sure, the target has to have ohci1394 loaded, and firescope running. But what about the *other* end of the connection? What commands? From the same manpage: The raw1394 module must be loaded and its device node be writable (this normally requires root) Ok it doesn't say you need ohci1394 too and doesn't say that's the target. If I do a new revision I'll perhaps expand the docs a bit. So load ohci1394/raw1394 and run firescope as root. Your distribution will hopefully take care of the device nodes. Usually you want something like firescope -Au System.map I think the confusion here is that the target doesn't need to be running anything; you can DMA chunks of memory with the OHCI controller with no need for any software support. You need ohci1394 loaded at least once. That is why it only works in relatively late boot. I've been playing with the idea of writing early1394 that just turns the DMA controller on as early as possible similar to earlyprintk on the target. Then it would be possible to use it for early debugging too. But so far it's not done yet. I'll try to write better docs next time. BTW Bernd did a gdbstub based on the firescope so you can even examine all kernel variables symbolically. It can even write variables, but not change the flow of the CPU. Standard firescope can just hexdump read/write symbols. With gdb it's also possible to do a core file of the kernel. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 13 Jun 2006 05:47:49 +0200 I've been playing with the idea of writing early1394 that just turns the DMA controller on as early as possible similar to earlyprintk on the target. Then it would be possible to use it for early debugging too. But so far it's not done yet. Does this raw1394 thing with firescope just assume DMA address == physical address? How would it work to access all of physical memory properly on IOMMU platforms? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Tuesday 13 June 2006 06:49, David Miller wrote: From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 13 Jun 2006 05:47:49 +0200 I've been playing with the idea of writing early1394 that just turns the DMA controller on as early as possible similar to earlyprintk on the target. Then it would be possible to use it for early debugging too. But so far it's not done yet. Does this raw1394 thing with firescope just assume DMA address == physical address? Yes. How would it work to access all of physical memory properly on IOMMU platforms? It assumes you don't have an IOMMU - relies on all memory being accessible by ohci1394. On x86-64 it can't access 4GB also, but that's normally ok because the kernel log buffer is below that. I guess if you use 1394 with remote DMA for other protocols (like video etc.) there must be some way for the subsystem to map the memory even on IOMMU systems. I admit I haven't dived that deeply into the 1394 subsystem so I don't know how that works. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
From: Andi Kleen [EMAIL PROTECTED] Date: Tue, 13 Jun 2006 06:54:14 +0200 I guess if you use 1394 with remote DMA for other protocols (like video etc.) there must be some way for the subsystem to map the memory even on IOMMU systems. I admit I haven't dived that deeply into the 1394 subsystem so I don't know how that works. Video-1394 has it's own driver, which does a consistent DMA allocation, and then maps that into userspace using remap_pfn_range(). Entirely portable. Strangely I don't even see any bus_to_virt() etc. calls in the raw1394 driver, just these ptr2int() things... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Friday 09 June 2006 03:56, Jeremy Fitzhardinge wrote: Rafael J. Wysocki wrote: Please try doing echo 8 /proc/sys/kernel/printk before suspend. Um, why? That would increase the amount of log output, but I don't see how it would help with netconsole preventing suspend, or not being able to see console messages on a blank screen after resume. Ah, that's after resume. Sorry for the noise. :-) Rafael - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Fri, Jun 09, 2006 at 07:50:25AM +0200, Andi Kleen wrote: On Friday 09 June 2006 07:23, David Miller wrote: From: Auke Kok [EMAIL PROTECTED] Date: Thu, 08 Jun 2006 22:13:48 -0700 netconsole should retry. There is no timeout programmed here since that might lose important information, and you rather want netconsole to survive an odd unplugged cable then to lose vital debugging information when the system is busy for instance. (losing link will cause the interface to be down and thus the queue to be stopped) I completely disagree that netpoll should loop when the ethernet cable is plugged out. Currently it is a bit dumb and doesn't distingush the various cases well. I submitted a patch to loop to be a bit more clever at some point. It can be still found in the netdev archives. Agreed that timeouts should happen. IIRC, the trouble with your patch was that it a) timed out on far too short a timescale and b) locked up on my box. Unfortunately, so did my own patch, which made timeouts approximately 1ms. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Andi Kleen wrote: If your laptop has firewire you can also use firescope. (ftp://ftp.suse.com/pub/people/ak/firescope/) .. FW keeps running as long as nobody resets the ieee1394 chip. This looks interesting. But how does one set it up for use on the *other* end of that firewire cable? The Quickstart and manpage don't seem to describe this fully. Thanks - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Using netconsole for debugging suspend/resume
I've been trying to get suspend/resume working well on my new laptop. In general, netconsole has been pretty useful for extracting oopses and other messages, but it is of more limited help in debugging the actual suspend/resume cycle. The problem looks like the e1000 driver won't suspend while netconsole is using it, so I have to rmmod/modprobe netconsole around the actual suspend/resume. This is a big problem during resume because the screen is also blank, so I get no useful clue as to what went wrong when things go wrong. I'm wondering if there's some way to keep netconsole alive to the last possible moment during suspend, and re-woken as soon as possible during resume. It would be nice to have a clean solution, but I'm willing to use a bletcherous hack if that's what it takes. Any ideas? Thanks, J - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Thursday 08 June 2006 19:50, Jeremy Fitzhardinge wrote: I've been trying to get suspend/resume working well on my new laptop. In general, netconsole has been pretty useful for extracting oopses and other messages, but it is of more limited help in debugging the actual suspend/resume cycle. The problem looks like the e1000 driver won't suspend while netconsole is using it, so I have to rmmod/modprobe netconsole around the actual suspend/resume. This is a big problem during resume because the screen is also blank, so I get no useful clue as to what went wrong when things go wrong. I'm wondering if there's some way to keep netconsole alive to the last possible moment during suspend, and re-woken as soon as possible during resume. It would be nice to have a clean solution, but I'm willing to use a bletcherous hack if that's what it takes. Any ideas? Please try doing echo 8 /proc/sys/kernel/printk before suspend. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Thursday 08 June 2006 19:50, Jeremy Fitzhardinge wrote: I've been trying to get suspend/resume working well on my new laptop. In general, netconsole has been pretty useful for extracting oopses and other messages, but it is of more limited help in debugging the actual suspend/resume cycle. The problem looks like the e1000 driver won't suspend while netconsole is using it, so I have to rmmod/modprobe netconsole around the actual suspend/resume. If your laptop has firewire you can also use firescope. (ftp://ftp.suse.com/pub/people/ak/firescope/) This is a big problem during resume because the screen is also blank, so I get no useful clue as to what went wrong when things go wrong. I'm wondering if there's some way to keep netconsole alive to the last possible moment during suspend, and re-woken as soon as possible during resume. It would be nice to have a clean solution, but I'm willing to use a bletcherous hack if that's what it takes. FW keeps running as long as nobody resets the ieee1394 chip. Networking is much more complex and will likely never work well for such low level debug situations. Netconsole is mostly useful to catch the odd oops during runtime. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Matt Mackall wrote: That's odd. Netpoll holds a reference to the device, of course, but so does a normal up interface. So that shouldn't be the problem. Another possibility is that outgoing packets from printks in the driver are causing difficulty. Not sure what can be done about that. I only tried once; maybe I misunderstood what was going on. I'll try again tonight. Oh, I think I see what's happening. The e1000 suspend routine does this: if (netif_running(netdev)) e1000_down(adapter); This leaves the interface up, but it stops the queue. Then netpoll_send_skb() has this loop: do { npinfo-tries--; spin_lock(np-dev-xmit_lock); np-dev-xmit_lock_owner = smp_processor_id(); /* * network drivers do not expect to be called if the queue is * stopped. */ if (netif_queue_stopped(np-dev)) { np-dev-xmit_lock_owner = -1; spin_unlock(np-dev-xmit_lock); netpoll_poll(np); udelay(50); continue; } /* ... */ again: /* proposed */ } while (npinfo-tries 0); so this will end up in an infinite loop, since netif_queue_stopped() will always return true, and it never looks at npinfo-tries. Should the continue be goto again? Also, e1000_down does a netif_poll_disable(), but I'm not sure what that actually does... Should it prevent netpoll from even trying to send? It's generally going to suck, because unlike a polled serial port, the device needs to be put to sleep. But if you're doing suspend to RAM, I'm interested in suspend-to-ram. I presume that with suspend-to-disk, booting with built-in netconsole will tell me useful stuff; that'll be the next experiment. you might be able to do something like this: - unhook net device from suspend machinery (possibly just return success) - bounce out of suspend before the final call to ACPI is made Net effect is you do OS-level suspend and resume of everything but the NIC without actually powering down the core. Which should let you debug just about everything. Well, the machine has to really suspend so that I can see (and debug) a mostly normal resume. In particular, I need the hardware to be zapped so I can see if it is being restarted properly. What might work is to change the e1000 suspend routine to save enough state for resume to work, but keep the interface up so that netconsole can keep transmitting all the way up to the point that the final acpi call powers off the machine. Then the e1000 would resume normally, including restarting the xmit queue so that netconsole can start again immediately; any netconsole output before the e1000 resume would be lost, of course (I guess it could be buffered). That would suit me for now. J - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Rafael J. Wysocki wrote: Please try doing echo 8 /proc/sys/kernel/printk before suspend. Um, why? That would increase the amount of log output, but I don't see how it would help with netconsole preventing suspend, or not being able to see console messages on a blank screen after resume. J - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
From: Auke Kok [EMAIL PROTECTED] Date: Thu, 08 Jun 2006 22:13:48 -0700 netconsole should retry. There is no timeout programmed here since that might lose important information, and you rather want netconsole to survive an odd unplugged cable then to lose vital debugging information when the system is busy for instance. (losing link will cause the interface to be down and thus the queue to be stopped) I completely disagree that netpoll should loop when the ethernet cable is plugged out. This stops the entire system. What if this is one of my main web servers and I have other links on the machine for redundancy and load balancing? Just because some careless sysop knocks one of the cables out, my system just freezes up and stops? What if I'm on a remote serial console, how long should I scratch my head wondering why the whole machine is frozen up before I figure out that the ethernet cable being out has made my system unusable because netpoll is just looping on the thing forever? That's an extremely poor quality of implementation if you ask me. Netpoll is _BEST_ _EFFORT_, end of story. It by definition can only offer that level of service because it does locking in circumstances where such locking might be illegal or even impossible. So it has to try, but if it can't get the resources it needs, it must stop trying and abort the logging. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
Auke Kok wrote: netconsole should retry. There is no timeout programmed here since that might lose important information, and you rather want netconsole to survive an odd unplugged cable then to lose vital debugging information when the system is busy for instance. (losing link will cause the interface to be down and thus the queue to be stopped) Well, the trouble is that it ends up spinning forever in the suspend case. The driver's suspend routine has XOFFed the queue, and its never going to come back if netconsole clogs everything up over it. Perhaps the correct fix isn't at the netpoll level, but at the netconsole level, but the behaviour of suspend ethernet, netconsole drops bits into the bucket until the ether comes back seems to be the best we can hope for. The present behaviour is definitely bad, since it will prevent any system from suspending while using netconsole, so you'd need to make it modular and rmmod/modprobe it around the suspend event - definitely losing more information. Also it means that if you kick a cable, the machine will eventually lock up, which doesn't seem like the best behaviour... Even so, it will wait for 1 second per skb sent (2 x 50uS) to wait for the queue to be started, so it will be pretty slow, and will recover from little hiccups without losing much. polling is for receives. We're basically telling the stack not to poll our interface anymore. OK, I see. e1000_suspend saves the entire configuration of the device and puts it in Wake-on-Lan mode, allowing it to be waken up by your 'zap' in the proper way. Not sure that's terribly useful. It would be nice to be able to zap the ethernet to get a console dump from early stages, but talking to the device depends on all the intermediate PCI stuff being set up first, so netconsole could cause even more of a mess. Then the e1000 would resume normally, including restarting the xmit queue so that netconsole can start again immediately; any netconsole output before the e1000 resume would be lost, of course (I guess it could be buffered). That would suit me for now. after coming out of suspend, e1000_resume is called which basically reinitializes the entire device. In the entire sequence it is unlikely that you'll actually be able to maintain netconsole in the first boot stage - the network device will not be initialized by the kernel yet, and obviously will be useless until e1000_resume is called! Yes, but I think that's OK for what I'm looking at. The problems I'm seeing happen later, and as I said in the first mail, I'm willing to accept a bletcherous hack if necessary (though obviously something clean and mergable would be preferable). At the netpoll level, assuming that netpoll_send_skb doesn't busywait forever while the queue is XOFFed, it will toss things until the moment the ethernet device queue is up, and then it will resume as normal. I'm not sure that tweaking e1000 to survive longer is the answer here, and you might be better off trying to have netconsole graciously wait (msleep_interruptable instead of udelay?) Pretty sure netpoll can't sleep there... In any case, I see the biggest problem in the early boot stage when all nics are basically uninitialized until resume starts. You just can't assign it an IP address for instance that easy, and even resume causes the device to reset and thus link renegotiation, adding crucial seconds to the time that the link is down, in which time you're stacking up netconsole messages, or worse, fail to initialize netconsole netconsole has already been initialized. It doesn't need reinit on resume. I hope this helps - I can't help but thinking that netconsole definately wasn't designed with this in mind. Perhaps not, but it isn't far from being a useful tool in this case. Its much better than the alternative of having no information at all about the whole process. J - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using netconsole for debugging suspend/resume
On Friday 09 June 2006 07:23, David Miller wrote: From: Auke Kok [EMAIL PROTECTED] Date: Thu, 08 Jun 2006 22:13:48 -0700 netconsole should retry. There is no timeout programmed here since that might lose important information, and you rather want netconsole to survive an odd unplugged cable then to lose vital debugging information when the system is busy for instance. (losing link will cause the interface to be down and thus the queue to be stopped) I completely disagree that netpoll should loop when the ethernet cable is plugged out. Currently it is a bit dumb and doesn't distingush the various cases well. I submitted a patch to loop to be a bit more clever at some point. It can be still found in the netdev archives. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html