Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier
Dave Hansen <[EMAIL PROTECTED]> wrote on 14.02.2008 18:12:43: > On Thu, 2008-02-14 at 09:46 +0100, Christoph Raisch wrote: > > Dave Hansen <[EMAIL PROTECTED]> wrote on 13.02.2008 18:05:00: > > > On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote: > > > > Constraints imposed by HW / FW: > > > > - eHEA has own MMU > > > > - eHEA Memory Regions (MRs) are used by the eHEA MMU to translate > > virtual > > > > addresses to absolute addresses (like DMA mapped memory on a PCI bus) > > > > - The number of MRs is limited (not enough to have one MR per packet) > > > > > > Are there enough to have one per 16MB section? > > > > Unfortunately this won't work. This was one of our first ideas we tossed > > out, > > but the number of MRs will not be sufficient. > > Can you give a ballpark of how many there are to work with? 10? 100? > 1000? > It depends on HMC configuration, but in worst case the upper limit is in the 2 digits range. > > > But, I'm really not convinced that you can actually keep this map > > > yourselves. It's not as simple as you think. What happens if you get > > > on an LPAR with two sections, one [EMAIL PROTECTED] and another > > > [EMAIL PROTECTED] That's quite possible. I think your vmalloc'd > > > array will eat all of memory. > > I'm glad you mention this part. There are many algorithms out there to > > handle this problem, > > hashes/trees/... all of these trade speed for smaller memory footprint. > > We based the table decission on the existing implementations of the > > architecture. > > Do you see such a case coming along for the next generation POWER systems? > > Dude. It exists *TODAY*. Go take a machine, add tens of gigabytes of > memory to it. Then, remove all of the sections of memory in the middle. > You'll be left with a very sparse memory configuration that we *DO* > handle today in the core VM. We handle it quite well, actually. > > The hypervisor does not shrink memory from the top down. It pulls > things out of the middle and shuffles things around. In fact, a NUMA > node's memory isn't even contiguous. > > Your code will OOM the machine in this case. I consider the ehea driver > buggy in this regard. Your comment indicates that the upper limit for memory to be set on HMC does not influence the upper limit of the partition physical address space. So our base assumption we discussed internally is wrong here. (conclusion see below) > > > I would guess these drastic changes would also require changes in base > > kernel. > > No, we actually solved those a couple years ago. > > > Will you provide a generic mapping system with a contiguous virtual address > > space > > like the ehea_bmap we can query? This would need to be a "stable" part of > > the implementation, > > including translation functions from kernel to nextgen_ehea_generic_bmap > > like virt_to_abs. > > Yes, that's a real possibility, especially if some other users for it > come forward. We could definitely add something like that to the > generic code. But, you'll have to be convincing that what we have now > is insufficient. > > Does this requirement: > "- MRs cover a contiguous virtual memory block (no holes)" > come from the hardware? > yes > Is that *EACH* MR? OR all MRs? > each > Where does EHEA_BUSMAP_START come from? Is that defined in the > hardware? Have you checked to ensure that no other users might want a > chunk of memory in that area? > EHEA_BUSMAP_START is a value which has to match between the wqe virtual addresses and the MR used in them. Fortunately there's a simple answer on that one. Each MR has a own address space, so there's no need to check. A HEA MR actually has exactly the same attributes as a Infiniband MR with this hardware. send/receive processing is pretty much comparable to a Infiniband UD queue. > Can you query the existing MRs? no > Not change them in place, but can you > query their contents? no > > > > That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP implemented > > > in the core, so that we can deal with these kinds of problems, once and > > > *NOT* in every single little driver out there. > > > > > > > Functions to use while building ehea_bmap + MRs: > > > > - Use either the functions that are used by the memory hotplug system > > as > > > > well, that means using the section defines + functions > > (section_nr_to_pfn, > > > > pfn_valid) > > > > > > Basica
Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier
ing them > properly. You should assume that you can export and use > walk_memory_resource(). So this seems to come down to a basic question: New hardware seems to have a tendency to get "private MMUs", which need private mappings from the kernel address space into a "HW defined address space with potentially unique characteristics" RDMA in Openfabrics with global MR is the most prominent example heading there > > Do you know what other operating systems do with this hardware? We're not aware of another open source Operating system trying to address this topic. > > In the future, please make an effort to get review from knowledgeable > people about these kinds of things before using them in your driver. > Your company has many, many resources available, and all you need to do > is ask. All that you have to do is look to the tops of the files of the > functions you are calling. > So we're glad we finally found the right person who takes responsibility for this topic! > -- Dave > Gruss / Regards Christoph Raisch + Jan-Bernd Themann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: Add kdump support
Michael Ellerman wrote on 26.11.2007 09:16:28: > Solutions that might be better: > > a) if there are a finite number of handles and we can predict their > values, just delete them all in the kdump kernel before the driver > loads. Guessing the values does not work, because of the handle structure defined by the hypervisor. > b) if there are a small & finite number of handles, save their values > in a device tree property and have the kdump kernel read them and > delete them before the driver loads. 5*16*nr_ports+1+1= >82. a ML16 has 4 adapters with up to 16 ports, so the number is not small anymore The device tree functions are currently not exported. If you crashdump to a new kernel, will it get the device tree representation of the crashed kernel or of the initial one of open firmware? > c) if neither of those work, provide a minimal routine that _only_ > deletes the handles in the crashed kernel. I would hope this has the highest chance to actually work. For this we would have to add a proper notifier chain. Do you agree? > d) Firmware change? But that's not something you will get very soon. Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: add kexec support
Michael Neuling <[EMAIL PROTECTED]> wrote on 03.11.2007 07:06:31: > > DD allocates HEA resources and gets firmware_handles for these resources. > > To free the resources DD needs to use exactly these handles. > > There's no generic firmware call "clean out all resources". > > Allocating the same resources twice does not work. > > Can we get a new firmware call to do this? Well, there's no simple answer to this. I'm not working on firmware. I'm trying to get an answer... but don't expect anything "real soon". > > > So a new kernel can't free the resources allocated by an old kernel, > > because the numeric values of the handles aren't known anymore. > > How many possible handles are there? Depends on system configuration, between 4 and 64 per port. > > If the handles are lost, is the only way to clear out the HEA resources > is to reset the partition? Yes, that's exactly the problem. > > > Potential Solution: > > Hea driver cleanup function hooks into ppc_md.machine_crash_shutdown > > and frees all firmware resources at shutdown time of the crashed kernel. > > This means the crashed kernel now has to be trusted to shut down and > free up the resources. Isn't trusting the crashing kernel in this way > against the whole kdump idea? I would hope that if the cleanup routine only does hcalls and does not change any kernel memory areas, then the risk to damage anything else in kernel should be pretty small. This should allow to catch most cases, but as always you can imagine situations where the kernel memory is broken beyond hope to even restart the kdump kernel. > > > crash_kexec continues and loads new kernel. > > The new kernel restarts the HEA driver within kdump kernel, which will work > > because resources have been freed before. > > > > Michael, would this work? Is ppc_md.machine_crash_shutdown the right hook? Gruss/Regards Christoph R - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: add kexec support
Michael Ellerman <[EMAIL PROTECTED]> wrote on 02.11.2007 07:30:08: > On Wed, 2007-10-31 at 20:48 +0100, Christoph Raisch wrote: > > Michael Ellerman <[EMAIL PROTECTED]> wrote on 30.10.2007 23:50:36: > If that's really the way it works then eHEA is more or less broken for > kdump I'm afraid. We think we have a way to workaround this, but let me first try to explain the base problem. DD allocates HEA resources and gets firmware_handles for these resources. To free the resources DD needs to use exactly these handles. There's no generic firmware call "clean out all resources". Allocating the same resources twice does not work. So a new kernel can't free the resources allocated by an old kernel, because the numeric values of the handles aren't known anymore. Potential Solution: Hea driver cleanup function hooks into ppc_md.machine_crash_shutdown and frees all firmware resources at shutdown time of the crashed kernel. crash_kexec continues and loads new kernel. The new kernel restarts the HEA driver within kdump kernel, which will work because resources have been freed before. Michael, would this work? Gruss / Regards Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: add kexec support
Michael Ellerman <[EMAIL PROTECTED]> wrote on 30.10.2007 23:50:36: > > On Tue, 2007-10-30 at 09:39 +0100, Christoph Raisch wrote: > > > > Michael Ellerman <[EMAIL PROTECTED]> wrote on 28.10.2007 23:32:17: > > Hope I didn't miss anything here... > > Perhaps. When we kdump the kernel does not call the reboot notifiers, so > the code Jan-Bernd just added won't get called. So the eHEA resources > won't be freed. When the kdump kernel tries to load the eHEA driver what > will happen? > Good point. If the device driver tries to allocate resources again (in the kdump kernel), which have been allocated before (in the crashed kernel) the hcalls will fail because from the hypervisor view the resources are still in use. Currently there's no method to find out the resource handles for these HEA resources allocated by the crashed kernel within the hypervisor... So we have to trigger a explicit deregister in the hypervisor before the driver is started again. How do you recommend we should trigger this in the kdump process? Is placing a hook into a ppc_md.machine_kexec be an option? Gruss / Regards Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: add kexec support
Michael Ellerman <[EMAIL PROTECTED]> wrote on 28.10.2007 23:32:17: > > > How do you plan to support kdump? > When kexec is fully supported kdump should work out of the box as for any other ethernet card (if you load the right eth driver). There's nothing specific to kdump you have to handle in ethernet device drivers. Hope I didn't miss anything here... Gruss / Regards Christoph R - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote on 16.10.2007 11:01:49: > > > Christoph, have any of you tried it on powerpc ? No we didn't try this (yet). This approach makes a lot of sense. Why is this not installed by both large distros on PPC by default? how mature is this for larger SMPs on PPC? > > Cheers, > Ben. > > Gruss / Regards Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken for POWER architecture?
David Miller <[EMAIL PROTECTED]> wrote on 12.09.2007 14:50:04: > From: Jan-Bernd Themann <[EMAIL PROTECTED]> > Date: Fri, 7 Sep 2007 11:37:02 +0200 > > > 2) On SMP systems: after netif_rx_complete has been called on CPU1 > >(+interruts enabled), netif_rx_schedule could be called on CPU2 > >(irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. > >In that case the device would be added to poll lists of CPU1 and CPU2 > >as net_rx_action would see NAPI_STATE_SCHED set. > >This must not happen. It will be caught when netif_rx_complete is > >called the second time (BUG() called) > > > > This would mean we have a problem on all SMP machines right now. > > This is not a correct statement. > > Only on your platform do network device interrupts get moved > around, no other platform does this. > > Sparc64 doesn't, all interrupts stay in one location after > the cpu is initially choosen. > > x86 and x86_64 specifically do not move around network > device interrupts, even though other device types do > get dynamic IRQ cpu distribution. > > That's why you are the only person seeing this problem. > > I agree that it should be fixed, but we should also fix the IRQ > distribution scheme used on powerpc platforms which is totally > broken in these cases. This is definitely not something we can change in the HEA device driver alone. It could also affect any other networking cards on POWER (e1000,s2io...). Paul, Michael, Arndt, what is your opinion here? Gruss / Regards Christoph Raisch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ehea: Receive SKB Aggregation
Christoph Hellwig wrote on 31.05.2007 15:41:18: > I'm still very unhappy with having all this in various drivers. There's > a lot of code that can be turned into generic library functions, and even > more code that could be made generic with some amount of refactoring. Yes, we'd also prefer to use a generic function, but we first would want to get some "real world" experience how our driver behaves with LRO to be even able to define requirements for such a generic function. A lot of this is tied into pathlengths, caching, and why does that help compared to a different TCP receive side processing? In a perfect world we shouldn't see a diffference if this is enabled or not, but measurements indicate something completely different at 10gbit. Gruss / Regards Christoph Raisch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ehea: Receive SKB Aggregation
Stephen Hemminger wrote on 31.05.2007 18:37:03: > > > > > > > +static int try_get_ip_tcp_hdr(struct ehea_cqe *cqe, struct sk_buff *skb, > > + struct iphdr **iph, struct tcphdr **tcph) > > +{ > > + int ip_len; > > + > > + /* non tcp/udp packets */ > > + if (!cqe->header_length) > > + return -1; > > + > > + /* non tcp packet */ > > + *iph = (struct iphdr *)(skb->data); > > Why the indirection, copying of headers.. This interacts with the header split function in the hardware. > > > + if ((*iph)->protocol != IPPROTO_TCP) > > + return -1; > > + > > + ip_len = (u8)((*iph)->ihl); > > + ip_len <<= 2; > > + *tcph = (struct tcphdr *)(((u64)*iph) + ip_len); > > + > > + return 0; > > +} > > + > > > > This code seems to be duplicating a lot (but not all) of the TCP/IP > input path validation checks. This is a security problem if nothing else... > We should only do aggregation in the driver if this really is a TCP header, otherwise things will get worse. You're right, we should at least check that tcph is within the received frame. > Also, how do you prevent DoS attacks from hostile TCP senders that send > huge number of back to back frames? Actually a huge number of back to back frames is what we would want to receive at 10 gbit ;-) How is it possible to figure out if this is what you want or just DoS? It doesn't change anything compared to a non LRO driver, we process a certain maximum amount of frames before waiting for the next interrupt, the packet filters/DoS should still see all traffic (which is above the driver). Any suggestions how to handle this better/different? > > -- > Stephen Hemminger Gruss / Regards Christoph Raisch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.19-rc3 1/2] ehea: kzalloc GFP_ATOMIC fix
Andrew Morton <[EMAIL PROTECTED]> wrote on 27.10.2006 05:13:13: > On Wed, 25 Oct 2006 13:11:42 +0200 > Jan-Bernd Themann <[EMAIL PROTECTED]> wrote: > > > This patch fixes kzalloc parameters (GFP_ATOMIC instead of GFP_KERNEL) > > why? these few kcallocs run in atomic context in some situations. therefore GFP_KERNEL is no good idea. Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.19 PATCH 2/7] ehea: pHYP interface
> > Hi, > > > I asked SO to recount arguments and we've come to a conclusion that > > there're in fact 19 args not 18 as the name suggests. 19 args is > > I-N-S-A-N-E. > > It will be partially cleaned up by: > > http://ozlabs.org/pipermail/linuxppc-dev/2006-July/024556.html > > However it doesnt fix the fact someone has architected such a crazy > interface :( > > Anton well, just as background info, this is the wrapper around a single assembly instruction which calls system firmware and takes 9 CPU registers for input and 9 CPU registers for output parameters. This definition by platform architecture won't change in the near future, but the good news is with Antons change the wrapper will look much nicer. Gruss / Regards . . . Christoph R - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.19 PATCH 3/7] ehea: queue management
> You should really do some measurements to see what the minimal > queue sizes are that can get you optimal throughput. > >Arnd <>< we did. And as always in performance tuning... one size fits all unfortunately is not the correct answer. Therefore we'll leave that open to the user as most other new ethernet driver did as well. Regards . . . Christoph R - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/6] ehea: header files
"Jenkins, Clive" wrote on 15.08.2006 12:53:05: > > > You mean the eHEA has its own concept of page size? Separate from > the > > > page size used by the MMU? > > > > > > > yes, the eHEA currently supports only 4K pages for queues > > In that case, I suggest use the kernel's page size, but add a > compile-time > check, and quit with an error message if driver does not support it. eHEA does support other page sizes than 4k, but the HW interface expects to see 4k pages The adaption is done in the device driver, therefore we have a seperate 4k define. Regards . . . Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EHEA: Is there a size limit below 80K for patches?
well, now I'm confused... 2 People, two opinions Here's a URL for a complete tarball, "sharing" the download location with our other driver. http://prdownloads.sourceforge.net/ibmehcad/ehea_EHEA_0002.tgz We're waiting for a sourceforge project now since 9 days to put out a tgz, and it looks like many other people get upset right now by their response time. So what's the "right"/preferred way to proceed? Christop R. > We're currently developing a new Ethernet device driver for a 10G IBM chip > for System p. (ppc64) > > A later version of the driver should end up in mainline kernel. > How should we proceed to get first comments by the community? > Either post this code as a patch to netdev or yes > put a full tarball on for example sourceforge? nope. Please read and observe: Documentation/SubmittingPatches and Section 3 of it, References, for other sources of expectations/requirements. The -mm tree also contains Documentation/SubmitChecklist that you may find useful. --- ~Randy Jeff Garzik <[EMAIL PROTECTED]> wrote on 08.06.2006 13:34:36: > Jan-Bernd Themann wrote: > > Hello, > > > > we tried two times to send a patch set. In both cases the second > > (largest) patch > > got lost. The first one was a bit above 100k, the second one we tried > > was like 75K. > > > > Any idea what might be the problem? > > It might be size, or tripping a spam filter. > > For a new driver, do the sane thing and just post a URL. > >Jeff > > > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
new driver for IBM ethernet chip
We're currently developing a new Ethernet device driver for a 10G IBM chip for System p. (ppc64) A later version of the driver should end up in mainline kernel. How should we proceed to get first comments by the community? Either post this code as a patch to netdev or put a full tarball on for example sourceforge? Gruss / Regards . . . Christoph Raisch christoph raisch, HCAD teamlead, IODF2 (d/3627), ibm boeblingen lab, phone: (+49/0)7031-16 4584, fax: -16 2042, loc: 71032-05-003, internet: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html