Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Eric W. Biederman writes: > If you are doing a real time task you don't want to very close > to your performance envelope. If you are hitting the performance > envelope any small hiccup will cause you to miss your deadline, > and close to your performance envelope hiccups are virtually certain. > > Pushing the machine just 5% slower should get everything going > with multiple pages, and you wouldn't be pushing the performance > envelope so your machine can compensate for the occasional hiccup. > >> The data stream is fat and relentless. > > So you add another node if your current nodes can't handle the load > without using giant physical areas of memory. Attempt to redesign > the operating system. Much more cost effective. Nodes can be wicked expensive. :-) Pushing the performance envelope is important when you want to sell lots of systems. Radar is a similar computational task, with the added need to reduce space and weight requirements. It's not OK to be 5% more expensive, bulky, and heavy. Also the Airplane Principal: more nodes means more big failures. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Jes Sorensen writes: > Albert D Cahalan <[EMAIL PROTECTED]> writes: [about using huge physical allocations for number crunching] >> 2. Programming a DMA controller with multiple addresses isn't >> as fast as programming it with one. > > LOL > > Consider that allocating the larger block of memory is going > to take a lot longer than it will take for the DMA engine to > read the scatter/gather table entries and fetch a new address > word now and then. Say it takes a whole minute to allocate the memory. It wouldn't of course, because you'd allocate memory at boot, but anyway... Then the app runs, using that memory, for a multi-hour surgery. The allocation happens once; the inter-node DMA transfers occur dozens or hundreds of times per second. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Jes Sorensen writes: Albert D Cahalan [EMAIL PROTECTED] writes: [about using huge physical allocations for number crunching] 2. Programming a DMA controller with multiple addresses isn't as fast as programming it with one. LOL Consider that allocating the larger block of memory is going to take a lot longer than it will take for the DMA engine to read the scatter/gather table entries and fetch a new address word now and then. Say it takes a whole minute to allocate the memory. It wouldn't of course, because you'd allocate memory at boot, but anyway... Then the app runs, using that memory, for a multi-hour surgery. The allocation happens once; the inter-node DMA transfers occur dozens or hundreds of times per second. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Eric W. Biederman writes: If you are doing a real time task you don't want to very close to your performance envelope. If you are hitting the performance envelope any small hiccup will cause you to miss your deadline, and close to your performance envelope hiccups are virtually certain. Pushing the machine just 5% slower should get everything going with multiple pages, and you wouldn't be pushing the performance envelope so your machine can compensate for the occasional hiccup. The data stream is fat and relentless. So you add another node if your current nodes can't handle the load without using giant physical areas of memory. Attempt to redesign the operating system. Much more cost effective. Nodes can be wicked expensive. :-) Pushing the performance envelope is important when you want to sell lots of systems. Radar is a similar computational task, with the added need to reduce space and weight requirements. It's not OK to be 5% more expensive, bulky, and heavy. Also the Airplane Principal: more nodes means more big failures. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
> "Albert" == Albert D Cahalan <[EMAIL PROTECTED]> writes: >> bigmem is 'last resort' stuff. I'd much rather it is as now a >> seperate allocator so you actually have to sit and think and decide >> to give up on kmalloc/vmalloc/better algorithms and only use it >> when the hardware sucks Albert> It isn't just for sucky hardware. It is for performance too. Albert> 1. Linux isn't known for cache coloring ability. Even if it Albert> was, users want to take advantage of large pages or BAT Albert> registers to reduce TLB miss costs. (that is, mapping such Albert> areas into a process is needed... never mind security for now) Albert> 2. Programming a DMA controller with multiple addresses isn't Albert> as fast as programming it with one. LOL Consider that allocating the larger block of memory is going to take a lot longer than it will take for the DMA engine to read the scatter/gather table entries and fetch a new address word now and then. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
"Albert D. Cahalan" <[EMAIL PROTECTED]> writes: > > bigmem is 'last resort' stuff. I'd much rather it is as now a > > seperate allocator so you actually have to sit and think and > > decide to give up on kmalloc/vmalloc/better algorithms and > > only use it when the hardware sucks > > It isn't just for sucky hardware. It is for performance too. > 1. Linux isn't known for cache coloring ability. Most hardware doesn't need it. It might help a little but not much. >Even if it was, >users want to take advantage of large pages or BAT registers >to reduce TLB miss costs. (that is, mapping such areas into >a process is needed... never mind security for now) I think the minor cost incurred by uniform size is well made up for by reliable memory management, and avoidance of swapping, and needing less total ram. Besides the fact I don't see large physical areas of memory being more than a marginal performance gain. > 2. Programming a DMA controller with multiple addresses isn't >as fast as programming it with one. Garbage collecting is theoretically more efficient than explicit memory management too. But seriously I doubt that several pages have significantly more overhead than a giant burst, per transfer. > Consider what happens when you have the ability to make one > compute node DMA directly into the physical memory of another. > With a large block of physical memory, you only need to have > the destination node give the writer a single physical memory > address to send the data to. With loose pages, the destination > has to transmit a great big list. That might be 30 thousand! Hmm, queuing up enough data for a second at a time seems a little excessive. And with a 128M chunk... your system can't do good memory management at all. > The point of all this is to crunch data as fast as possible, > with Linux mostly getting out of the way. Perhaps you want > to generate real-time high-resolution video of a human heart > as it beats inside somebody. You process raw data (audio, X-ray, > magnetic resonance, or whatever) on one group of processors, > then hand off the data to another group of processors for the > rendering task. Actually there might be many stages. Playing > games with individual pages will cut into your performance. If you are doing a real time task you don't want to very close to your performance envelope. If you are hitting the performance envelope any small hiccup will cause you to miss your deadline, and close to your performance envelope hiccups are virtually certain. Pushing the machine just 5% slower should get everything going with multiple pages, and you wouldn't be pushing the performance envelope so your machine can compensate for the occasional hiccup. > The data stream is fat and relentless. So you add another node if your current nodes can't handle the load without using giant physical areas of memory. Attempt to redesign the operating system. Much more cost effective. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
"Albert" == Albert D Cahalan [EMAIL PROTECTED] writes: bigmem is 'last resort' stuff. I'd much rather it is as now a seperate allocator so you actually have to sit and think and decide to give up on kmalloc/vmalloc/better algorithms and only use it when the hardware sucks Albert It isn't just for sucky hardware. It is for performance too. Albert 1. Linux isn't known for cache coloring ability. Even if it Albert was, users want to take advantage of large pages or BAT Albert registers to reduce TLB miss costs. (that is, mapping such Albert areas into a process is needed... never mind security for now) Albert 2. Programming a DMA controller with multiple addresses isn't Albert as fast as programming it with one. LOL Consider that allocating the larger block of memory is going to take a lot longer than it will take for the DMA engine to read the scatter/gather table entries and fetch a new address word now and then. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
> bigmem is 'last resort' stuff. I'd much rather it is as now a > seperate allocator so you actually have to sit and think and > decide to give up on kmalloc/vmalloc/better algorithms and > only use it when the hardware sucks It isn't just for sucky hardware. It is for performance too. 1. Linux isn't known for cache coloring ability. Even if it was, users want to take advantage of large pages or BAT registers to reduce TLB miss costs. (that is, mapping such areas into a process is needed... never mind security for now) 2. Programming a DMA controller with multiple addresses isn't as fast as programming it with one. Consider what happens when you have the ability to make one compute node DMA directly into the physical memory of another. With a large block of physical memory, you only need to have the destination node give the writer a single physical memory address to send the data to. With loose pages, the destination has to transmit a great big list. That might be 30 thousand! The point of all this is to crunch data as fast as possible, with Linux mostly getting out of the way. Perhaps you want to generate real-time high-resolution video of a human heart as it beats inside somebody. You process raw data (audio, X-ray, magnetic resonance, or whatever) on one group of processors, then hand off the data to another group of processors for the rendering task. Actually there might be many stages. Playing games with individual pages will cut into your performance. The data stream is fat and relentless. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 10:39:43PM +0100, Erik Mouw wrote: > On Fri, Dec 22, 2000 at 02:54:50PM -0700, Jeff V. Merkey wrote: > > Having a 1 Gigabyte per second fat pipe that runs over a prallel bus > > fabric with a standard PCI card that costs @ $ 500 and can run LVS > > and TUX at high speeds would be for the common good, particularly since > > NT and W2K both have implementations of Dolphin SCI that allow them > > to exploit this hardware. > > I'm just wondering how you are going to do 1 Gbyte per second when you > still have to get the data through a PCI bus to that card. In theory, > standard PCI can do 133 Mbyte/s, but only when you're very lucky to be > able to burst large chunks of data. OK, 64 bit PCI at 66 MHz should > quadruple the throughput, but that's still not enough for 1 Gbyte/s. THe fabric supports this data rate. PCI cards are limited to @ 130MB, but multiple nodes all running at the same time could generate this much traffic. Jeff > > > Erik > > -- > J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department > of Electrical Engineering, Faculty of Information Technology and Systems, > Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands > Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [EMAIL PROTECTED] > WWW: http://www-ict.its.tudelft.nl/~erik/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]
On Fri, Dec 22, 2000 at 11:37:29AM -0800, Tim Wright wrote: I have been working with SCI since 1994. The people who own Dolphin and the SCI chipsets also own TRG. We dropped work in the P6 ccNUMA cards several years back because Intel was convinced that shared-nothing was the way to go (and it is). However, SCI's ability to create explicit sharing makes it the fastest shared nothing interface around for message passing (go figure). I think we do need some bettr APIs. Grab the source at my FTP server, and I'd love any input you could provide. Thanks, :-) Jeff > Hi Jeff, > > On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: > [...] > > SCI allows machines to create windows of shared memory across a cluster > > of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am > > putting a sockets interface into the drivers so Apache, LVS, and > > Pirahna can use these very high speed adapters for a clustered web > > server. Our M2FS clustered file system also is being architected > > to use these cards. > > You're probably aware of this, but SCI allows a lot more then the creation > of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use > the SCI interconnect to build a single-system image machine with all memory > visible from all "nodes". In fact, all the commercial NUMA machines of which > I am aware have this property (all nodes see and can address all memory). The > non-uniform part of NUMA comes from the potentially differing latency and > speed of different parts of memory (local vs remote in this case). > AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines. > > It sounds like you have a different requirement of very high-speed shared > memory between different nodes that can be mapped and unmapped as required. > Do I understand this correctly ? That would make your requirements somewhat > orthogonal to the requirements those of us with NUMA architectures have. > > > I will post the source code for the SCI cards at vger.timpanogas.org > > and if you have time, please download this code and take a look at > > how we are using the bigphysarea APIs to create these windows accros > > machines. The current NUMA support in Linux is somewhat slim, and > > I would like to use established APIs to do this if possible. > > See above. It may be that you need different APIs anyway. > > Regards, > > Tim > > -- > Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED] > IBM Linux Technology Center, Beaverton, Oregon > "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]
Hi Jeff, On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: [...] > SCI allows machines to create windows of shared memory across a cluster > of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am > putting a sockets interface into the drivers so Apache, LVS, and > Pirahna can use these very high speed adapters for a clustered web > server. Our M2FS clustered file system also is being architected > to use these cards. You're probably aware of this, but SCI allows a lot more then the creation of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use the SCI interconnect to build a single-system image machine with all memory visible from all "nodes". In fact, all the commercial NUMA machines of which I am aware have this property (all nodes see and can address all memory). The non-uniform part of NUMA comes from the potentially differing latency and speed of different parts of memory (local vs remote in this case). AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines. It sounds like you have a different requirement of very high-speed shared memory between different nodes that can be mapped and unmapped as required. Do I understand this correctly ? That would make your requirements somewhat orthogonal to the requirements those of us with NUMA architectures have. > I will post the source code for the SCI cards at vger.timpanogas.org > and if you have time, please download this code and take a look at > how we are using the bigphysarea APIs to create these windows accros > machines. The current NUMA support in Linux is somewhat slim, and > I would like to use established APIs to do this if possible. See above. It may be that you need different APIs anyway. Regards, Tim -- Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED] IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 08:21:37PM +0100, Andi Kleen wrote: > On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote: > > The real question is how to guarantee that these pages will be contiguous > > in memory. The slab allocator may also work, but I think there are size > > constraints on how much I can get in one pass. > > You cannot guarantee it after the system has left bootup stage. That's the > whole reason why bigphysarea exists. > > -Andi I am wondering why the drivers need such a big contiguous chunk of memory. For message passing operatings, they should not. Some of the user space libraries appear to need this support. I am going through this code today attempting to determine if there's a way to reduce this requirement or map the memory differently. I am not using these cards for a ccNUMA implementation, although they have versions of these adapters that can provide this capability, but for message passing with small windows of coherence between machines with push/pull DMA-style behavior for high speed data transfers. 99.9% of the clustering stuff on Linux uses this model, so this requirement perhaps can be restructured to be a better fit for Linux. Just having the patch in the kernel for bigphysarea support would solve this issue if it could be structured into a form Alan finds acceptable. Absent this, we need a workaround that's more tailored for the requirments for Linux apps. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote: > The real question is how to guarantee that these pages will be contiguous > in memory. The slab allocator may also work, but I think there are size > constraints on how much I can get in one pass. You cannot guarantee it after the system has left bootup stage. That's the whole reason why bigphysarea exists. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: > On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote: > > On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: > Pauline/Alan, I have been studying the SCI code and I think I may have a workaround that won't need the patch, but it will require pinning large chunks of memory with the existing __get_free_pages() functions. I will need to make the changes and test them. This change will require significant testing. I will ping you guys if I have questions. If we can reach a compromise on the bigphysarea patch, it would be great, but absent this, I will be looking at this alternate solution. The real question is how to guarantee that these pages will be contiguous in memory. The slab allocator may also work, but I think there are size constraints on how much I can get in one pass. :-) Jeff > > > > > -- > > GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 > > For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote: > On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: > > > > Alan, > > > > I am looking over the 2.4 bigphysarea patch, and I think I agree > > there needs to be a better approach. It's a messy hack -- I agree. > > Please explain further. > Just leaving it at that is not nice. What is messy? > The implementation? The API? > > If you have a better solutions for allocating big chunks of > physical continious memory at different stages during the > runtime of the kernel, i would be very interesseted. > > (Alan: bootmem allocation just won't do. I need that memory > in modules which get potentially loaded/unloaded, hence a > wrapper interface for allowing access to a bootmem allocated > piece of memory) > > And the API? That API was set a long time ago, luckely not by me :) > Though I dont see the real problem. It allows allocation and > freeing of chunks of memory. Period. Its all its suppose to do. > Or do you want it rolled in kmalloc? So GFP_DMA with size>128K > would take memory from this? That would mean a much more intrusive > patch in very sensitive and rapidly changing parts of the kernel > (2.2->2.4 speaking)... > > Met vriendelijke groet, > Pauline Middelink Pauline, Can we put together a patch that meets Alan's requirements and get it into the kernel proper. We have taken on a project from Dolphin to merge the high speed Dolphin SCI interconnect drivers into the kernel proper, and obviously, it's not possible to do so if the drivers are dependent on this patch. I can send you the driver sources for the SCI cards, at least the portions that depend on this patch, and would appreciate any guidance you could provide on a better way to allocate memory. SCI allows machines to create windows of shared memory across a cluster of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am putting a sockets interface into the drivers so Apache, LVS, and Pirahna can use these very high speed adapters for a clustered web server. Our M2FS clustered file system also is being architected to use these cards. I will post the source code for the SCI cards at vger.timpanogas.org and if you have time, please download this code and take a look at how we are using the bigphysarea APIs to create these windows accros machines. The current NUMA support in Linux is somewhat slim, and I would like to use established APIs to do this if possible. :-) Jeff > -- > GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 > For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
> (Alan: bootmem allocation just won't do. I need that memory > in modules which get potentially loaded/unloaded, hence a > wrapper interface for allowing access to a bootmem allocated > piece of memory) Yes, I pointed him at you for 2.4test because you had the code sitting on top of bootmem which is the right way to do it. > Or do you want it rolled in kmalloc? So GFP_DMA with size>128K > would take memory from this? That would mean a much more intrusive > patch in very sensitive and rapidly changing parts of the kernel > (2.2->2.4 speaking)... bigmem is 'last resort' stuff. I'd much rather it is as now a seperate allocator so you actually have to sit and think and decide to give up on kmalloc/vmalloc/better algorithms and only use it when the hardware sucks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: > > Alan, > > I am looking over the 2.4 bigphysarea patch, and I think I agree > there needs to be a better approach. It's a messy hack -- I agree. Please explain further. Just leaving it at that is not nice. What is messy? The implementation? The API? If you have a better solutions for allocating big chunks of physical continious memory at different stages during the runtime of the kernel, i would be very interesseted. (Alan: bootmem allocation just won't do. I need that memory in modules which get potentially loaded/unloaded, hence a wrapper interface for allowing access to a bootmem allocated piece of memory) And the API? That API was set a long time ago, luckely not by me :) Though I dont see the real problem. It allows allocation and freeing of chunks of memory. Period. Its all its suppose to do. Or do you want it rolled in kmalloc? So GFP_DMA with size>128K would take memory from this? That would mean a much more intrusive patch in very sensitive and rapidly changing parts of the kernel (2.2->2.4 speaking)... Met vriendelijke groet, Pauline Middelink -- GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: Alan, I am looking over the 2.4 bigphysarea patch, and I think I agree there needs to be a better approach. It's a messy hack -- I agree. Please explain further. Just leaving it at that is not nice. What is messy? The implementation? The API? If you have a better solutions for allocating big chunks of physical continious memory at different stages during the runtime of the kernel, i would be very interesseted. (Alan: bootmem allocation just won't do. I need that memory in modules which get potentially loaded/unloaded, hence a wrapper interface for allowing access to a bootmem allocated piece of memory) And the API? That API was set a long time ago, luckely not by me :) Though I dont see the real problem. It allows allocation and freeing of chunks of memory. Period. Its all its suppose to do. Or do you want it rolled in kmalloc? So GFP_DMA with size128K would take memory from this? That would mean a much more intrusive patch in very sensitive and rapidly changing parts of the kernel (2.2-2.4 speaking)... Met vriendelijke groet, Pauline Middelink -- GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
(Alan: bootmem allocation just won't do. I need that memory in modules which get potentially loaded/unloaded, hence a wrapper interface for allowing access to a bootmem allocated piece of memory) Yes, I pointed him at you for 2.4test because you had the code sitting on top of bootmem which is the right way to do it. Or do you want it rolled in kmalloc? So GFP_DMA with size128K would take memory from this? That would mean a much more intrusive patch in very sensitive and rapidly changing parts of the kernel (2.2-2.4 speaking)... bigmem is 'last resort' stuff. I'd much rather it is as now a seperate allocator so you actually have to sit and think and decide to give up on kmalloc/vmalloc/better algorithms and only use it when the hardware sucks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote: On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: Alan, I am looking over the 2.4 bigphysarea patch, and I think I agree there needs to be a better approach. It's a messy hack -- I agree. Please explain further. Just leaving it at that is not nice. What is messy? The implementation? The API? If you have a better solutions for allocating big chunks of physical continious memory at different stages during the runtime of the kernel, i would be very interesseted. (Alan: bootmem allocation just won't do. I need that memory in modules which get potentially loaded/unloaded, hence a wrapper interface for allowing access to a bootmem allocated piece of memory) And the API? That API was set a long time ago, luckely not by me :) Though I dont see the real problem. It allows allocation and freeing of chunks of memory. Period. Its all its suppose to do. Or do you want it rolled in kmalloc? So GFP_DMA with size128K would take memory from this? That would mean a much more intrusive patch in very sensitive and rapidly changing parts of the kernel (2.2-2.4 speaking)... Met vriendelijke groet, Pauline Middelink Pauline, Can we put together a patch that meets Alan's requirements and get it into the kernel proper. We have taken on a project from Dolphin to merge the high speed Dolphin SCI interconnect drivers into the kernel proper, and obviously, it's not possible to do so if the drivers are dependent on this patch. I can send you the driver sources for the SCI cards, at least the portions that depend on this patch, and would appreciate any guidance you could provide on a better way to allocate memory. SCI allows machines to create windows of shared memory across a cluster of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am putting a sockets interface into the drivers so Apache, LVS, and Pirahna can use these very high speed adapters for a clustered web server. Our M2FS clustered file system also is being architected to use these cards. I will post the source code for the SCI cards at vger.timpanogas.org and if you have time, please download this code and take a look at how we are using the bigphysarea APIs to create these windows accros machines. The current NUMA support in Linux is somewhat slim, and I would like to use established APIs to do this if possible. :-) Jeff -- GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: On Fri, Dec 22, 2000 at 09:39:28AM +0100, Pauline Middelink wrote: On Thu, 21 Dec 2000 around 15:53:39 -0700, Jeff V. Merkey wrote: Pauline/Alan, I have been studying the SCI code and I think I may have a workaround that won't need the patch, but it will require pinning large chunks of memory with the existing __get_free_pages() functions. I will need to make the changes and test them. This change will require significant testing. I will ping you guys if I have questions. If we can reach a compromise on the bigphysarea patch, it would be great, but absent this, I will be looking at this alternate solution. The real question is how to guarantee that these pages will be contiguous in memory. The slab allocator may also work, but I think there are size constraints on how much I can get in one pass. :-) Jeff -- GPG Key fingerprint = 2D5B 87A7 DDA6 0378 5DEA BD3B 9A50 B416 E2D0 C3C2 For more details look at my website http://www.polyware.nl/~middelink - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote: The real question is how to guarantee that these pages will be contiguous in memory. The slab allocator may also work, but I think there are size constraints on how much I can get in one pass. You cannot guarantee it after the system has left bootup stage. That's the whole reason why bigphysarea exists. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 08:21:37PM +0100, Andi Kleen wrote: On Fri, Dec 22, 2000 at 11:35:30AM -0700, Jeff V. Merkey wrote: The real question is how to guarantee that these pages will be contiguous in memory. The slab allocator may also work, but I think there are size constraints on how much I can get in one pass. You cannot guarantee it after the system has left bootup stage. That's the whole reason why bigphysarea exists. -Andi I am wondering why the drivers need such a big contiguous chunk of memory. For message passing operatings, they should not. Some of the user space libraries appear to need this support. I am going through this code today attempting to determine if there's a way to reduce this requirement or map the memory differently. I am not using these cards for a ccNUMA implementation, although they have versions of these adapters that can provide this capability, but for message passing with small windows of coherence between machines with push/pull DMA-style behavior for high speed data transfers. 99.9% of the clustering stuff on Linux uses this model, so this requirement perhaps can be restructured to be a better fit for Linux. Just having the patch in the kernel for bigphysarea support would solve this issue if it could be structured into a form Alan finds acceptable. Absent this, we need a workaround that's more tailored for the requirments for Linux apps. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]
Hi Jeff, On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: [...] SCI allows machines to create windows of shared memory across a cluster of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am putting a sockets interface into the drivers so Apache, LVS, and Pirahna can use these very high speed adapters for a clustered web server. Our M2FS clustered file system also is being architected to use these cards. You're probably aware of this, but SCI allows a lot more then the creation of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use the SCI interconnect to build a single-system image machine with all memory visible from all "nodes". In fact, all the commercial NUMA machines of which I am aware have this property (all nodes see and can address all memory). The non-uniform part of NUMA comes from the potentially differing latency and speed of different parts of memory (local vs remote in this case). AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines. It sounds like you have a different requirement of very high-speed shared memory between different nodes that can be mapped and unmapped as required. Do I understand this correctly ? That would make your requirements somewhat orthogonal to the requirements those of us with NUMA architectures have. I will post the source code for the SCI cards at vger.timpanogas.org and if you have time, please download this code and take a look at how we are using the bigphysarea APIs to create these windows accros machines. The current NUMA support in Linux is somewhat slim, and I would like to use established APIs to do this if possible. See above. It may be that you need different APIs anyway. Regards, Tim -- Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED] IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA and SCI [was Re: bigphysarea support in 2.2.19 and 2.4.0 kernels]
On Fri, Dec 22, 2000 at 11:37:29AM -0800, Tim Wright wrote: I have been working with SCI since 1994. The people who own Dolphin and the SCI chipsets also own TRG. We dropped work in the P6 ccNUMA cards several years back because Intel was convinced that shared-nothing was the way to go (and it is). However, SCI's ability to create explicit sharing makes it the fastest shared nothing interface around for message passing (go figure). I think we do need some bettr APIs. Grab the source at my FTP server, and I'd love any input you could provide. Thanks, :-) Jeff Hi Jeff, On Fri, Dec 22, 2000 at 11:11:05AM -0700, Jeff V. Merkey wrote: [...] SCI allows machines to create windows of shared memory across a cluster of nodes, and at 1 Gigabyte-per-second (Gigabyte not gigabit). I am putting a sockets interface into the drivers so Apache, LVS, and Pirahna can use these very high speed adapters for a clustered web server. Our M2FS clustered file system also is being architected to use these cards. You're probably aware of this, but SCI allows a lot more then the creation of windows of shared memory. The IBM NUMA-Q machines (what was Sequent), use the SCI interconnect to build a single-system image machine with all memory visible from all "nodes". In fact, all the commercial NUMA machines of which I am aware have this property (all nodes see and can address all memory). The non-uniform part of NUMA comes from the potentially differing latency and speed of different parts of memory (local vs remote in this case). AFAIK, the work that Kanoj Sarcar has been doing is to enable such machines. It sounds like you have a different requirement of very high-speed shared memory between different nodes that can be mapped and unmapped as required. Do I understand this correctly ? That would make your requirements somewhat orthogonal to the requirements those of us with NUMA architectures have. I will post the source code for the SCI cards at vger.timpanogas.org and if you have time, please download this code and take a look at how we are using the bigphysarea APIs to create these windows accros machines. The current NUMA support in Linux is somewhat slim, and I would like to use established APIs to do this if possible. See above. It may be that you need different APIs anyway. Regards, Tim -- Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED] IBM Linux Technology Center, Beaverton, Oregon "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Fri, Dec 22, 2000 at 10:39:43PM +0100, Erik Mouw wrote: On Fri, Dec 22, 2000 at 02:54:50PM -0700, Jeff V. Merkey wrote: Having a 1 Gigabyte per second fat pipe that runs over a prallel bus fabric with a standard PCI card that costs @ $ 500 and can run LVS and TUX at high speeds would be for the common good, particularly since NT and W2K both have implementations of Dolphin SCI that allow them to exploit this hardware. I'm just wondering how you are going to do 1 Gbyte per second when you still have to get the data through a PCI bus to that card. In theory, standard PCI can do 133 Mbyte/s, but only when you're very lucky to be able to burst large chunks of data. OK, 64 bit PCI at 66 MHz should quadruple the throughput, but that's still not enough for 1 Gbyte/s. THe fabric supports this data rate. PCI cards are limited to @ 130MB, but multiple nodes all running at the same time could generate this much traffic. Jeff Erik -- J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department of Electrical Engineering, Faculty of Information Technology and Systems, Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [EMAIL PROTECTED] WWW: http://www-ict.its.tudelft.nl/~erik/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
bigmem is 'last resort' stuff. I'd much rather it is as now a seperate allocator so you actually have to sit and think and decide to give up on kmalloc/vmalloc/better algorithms and only use it when the hardware sucks It isn't just for sucky hardware. It is for performance too. 1. Linux isn't known for cache coloring ability. Even if it was, users want to take advantage of large pages or BAT registers to reduce TLB miss costs. (that is, mapping such areas into a process is needed... never mind security for now) 2. Programming a DMA controller with multiple addresses isn't as fast as programming it with one. Consider what happens when you have the ability to make one compute node DMA directly into the physical memory of another. With a large block of physical memory, you only need to have the destination node give the writer a single physical memory address to send the data to. With loose pages, the destination has to transmit a great big list. That might be 30 thousand! The point of all this is to crunch data as fast as possible, with Linux mostly getting out of the way. Perhaps you want to generate real-time high-resolution video of a human heart as it beats inside somebody. You process raw data (audio, X-ray, magnetic resonance, or whatever) on one group of processors, then hand off the data to another group of processors for the rendering task. Actually there might be many stages. Playing games with individual pages will cut into your performance. The data stream is fat and relentless. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Alan, I am looking over the 2.4 bigphysarea patch, and I think I agree there needs to be a better approach. It's a messy hack -- I agree. :-) Jeff > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Thu, Dec 21, 2000 at 09:32:46PM +, Alan Cox wrote: > > A question related to bigphysarea support in the native Linux > > 2.2.19 and 2.4.0 kernels. > > > > I know there are patches for this support, but is it planned for > > rolling into the kernel by default to support Dolphin SCI and > > some of the NUMA Clustering adapters. I see it there for some > > of the video adapters. > > bigphysarea is the wrong model for 2.4. The bootmem allocator means that > drivers could do early claims via the bootmem interface during boot up. That > would avoid all the cruft. > > For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you > can redo cleanly as we don't have bootmem > > I belive Pauline Middelink had a patch implementing bigphysarea in terms of > bootmem > > Alan Alan, Thanks for the prompt response. I am merging the Dolphin SCI High Speed interconnect drivers into 2.2.18 and 2.4.0 for our M2FS project, and I am reviewing the big ugly nasty patch they have current as of 2.2.13 (really old). I will be looking over the 2.4 tree for a more clean manner to do what they want. What's in the patch alters the /proc filesystem, and the VM code. I will submit a patch against 2.2.19 and 2.4.0 for this support for their SCI adapters after I get a handle on it. :-) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
> A question related to bigphysarea support in the native Linux > 2.2.19 and 2.4.0 kernels. > > I know there are patches for this support, but is it planned for > rolling into the kernel by default to support Dolphin SCI and > some of the NUMA Clustering adapters. I see it there for some > of the video adapters. bigphysarea is the wrong model for 2.4. The bootmem allocator means that drivers could do early claims via the bootmem interface during boot up. That would avoid all the cruft. For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you can redo cleanly as we don't have bootmem I belive Pauline Middelink had a patch implementing bigphysarea in terms of bootmem Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
bigphysarea support in 2.2.19 and 2.4.0 kernels
A question related to bigphysarea support in the native Linux 2.2.19 and 2.4.0 kernels. I know there are patches for this support, but is it planned for rolling into the kernel by default to support Dolphin SCI and some of the NUMA Clustering adapters. I see it there for some of the video adapters. Is this planned for the kernel proper, or will it remain a patch? At the rate the VM and mm subsystems tend to get updated, I am wondering if there's a current version out for this. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
bigphysarea support in 2.2.19 and 2.4.0 kernels
A question related to bigphysarea support in the native Linux 2.2.19 and 2.4.0 kernels. I know there are patches for this support, but is it planned for rolling into the kernel by default to support Dolphin SCI and some of the NUMA Clustering adapters. I see it there for some of the video adapters. Is this planned for the kernel proper, or will it remain a patch? At the rate the VM and mm subsystems tend to get updated, I am wondering if there's a current version out for this. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
A question related to bigphysarea support in the native Linux 2.2.19 and 2.4.0 kernels. I know there are patches for this support, but is it planned for rolling into the kernel by default to support Dolphin SCI and some of the NUMA Clustering adapters. I see it there for some of the video adapters. bigphysarea is the wrong model for 2.4. The bootmem allocator means that drivers could do early claims via the bootmem interface during boot up. That would avoid all the cruft. For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you can redo cleanly as we don't have bootmem I belive Pauline Middelink had a patch implementing bigphysarea in terms of bootmem Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
On Thu, Dec 21, 2000 at 09:32:46PM +, Alan Cox wrote: A question related to bigphysarea support in the native Linux 2.2.19 and 2.4.0 kernels. I know there are patches for this support, but is it planned for rolling into the kernel by default to support Dolphin SCI and some of the NUMA Clustering adapters. I see it there for some of the video adapters. bigphysarea is the wrong model for 2.4. The bootmem allocator means that drivers could do early claims via the bootmem interface during boot up. That would avoid all the cruft. For 2.2 bigphysarea is a hack, but a neccessary add on patch and not one you can redo cleanly as we don't have bootmem I belive Pauline Middelink had a patch implementing bigphysarea in terms of bootmem Alan Alan, Thanks for the prompt response. I am merging the Dolphin SCI High Speed interconnect drivers into 2.2.18 and 2.4.0 for our M2FS project, and I am reviewing the big ugly nasty patch they have current as of 2.2.13 (really old). I will be looking over the 2.4 tree for a more clean manner to do what they want. What's in the patch alters the /proc filesystem, and the VM code. I will submit a patch against 2.2.19 and 2.4.0 for this support for their SCI adapters after I get a handle on it. :-) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bigphysarea support in 2.2.19 and 2.4.0 kernels
Alan, I am looking over the 2.4 bigphysarea patch, and I think I agree there needs to be a better approach. It's a messy hack -- I agree. :-) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/