Re: [Boston.pm] transposing rows and columns in a CSV file
On Mon, 15 Nov 2004 18:46:15 -0500 (EST), Dan Sugalski <[EMAIL PROTECTED]> wrote: > On Mon, 15 Nov 2004, Ben Tilly wrote: > > > On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman > > <[EMAIL PROTECTED]> wrote: > > > On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote: > > > > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> > > > > wrote: > > > > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: > [Massive snippage of two ships passing in the night] Heh. :-) > > > > In Perl I'd expect it to be possible but fragile. If Parrot could make > > > > it possible and not fragile, that would be great. > > > > > > In parrot it's quite robust. Parrot supports "buffers" as core PMC > > > types. A buffer can refer to any part of memory with any read-only or > > > copy-on-write semantics you like. > > > > That would be nice. > > > > Incidentally will Parrot also support efficiently building strings > > incrementally? I like the fact that in Perl 5 it is O($n) to do > > something like: > > > > $string .= "hello" for 1..$n; > > > > In most other languages that is quadratic, and I'm wondering > > what to expect in Perl 6. > > That's not O(n) in Perl 5, it's just smaller than O(n^2). The same's true > for Parrot -- we've got mutable strings and generally over-allocate, so > it's not going to be quadratic time. Neither, though, is it going to be > linear. Expect somewhere in between. I don't have source-code in front of me, but my memory says that when you have to reassign a string in Perl 5, the amount of extra length that you give is proportional to the length of the string. (If that is not how Perl does it, then Perl darned well should!) In that case it really is O(n). What happens is that the total recopying work can be bounded above by a geometric series that converges to something O(n). Everything else is O(n). So the result is O(n). I gave you more details at http://www.perlmonks.org/?node_id=276051. (Hey, my math background has to be good for something...) Perl uses the same strategy for other data structures, including growing a hash and on various array operations. Of course the main reason that I'm familiar with this trick is that my 2 miniscule core contributions to Perl were performance enhancements from using this trick in places where it had not been previously used. (i.e. unshift and map.) Unlike you, I don't have any other performance knowledge to confuse me... Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Mon, 15 Nov 2004, Ben Tilly wrote: > On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman > <[EMAIL PROTECTED]> wrote: > > On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote: > > > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> > > > wrote: > > > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: [Massive snippage of two ships passing in the night] > > > In Perl I'd expect it to be possible but fragile. If Parrot could make > > > it possible and not fragile, that would be great. > > > > In parrot it's quite robust. Parrot supports "buffers" as core PMC > > types. A buffer can refer to any part of memory with any read-only or > > copy-on-write semantics you like. > > That would be nice. > > Incidentally will Parrot also support efficiently building strings > incrementally? I like the fact that in Perl 5 it is O($n) to do > something like: > > $string .= "hello" for 1..$n; > > In most other languages that is quadratic, and I'm wondering > what to expect in Perl 6. That's not O(n) in Perl 5, it's just smaller than O(n^2). The same's true for Parrot -- we've got mutable strings and generally over-allocate, so it's not going to be quadratic time. Neither, though, is it going to be linear. Expect somewhere in between. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Mon, 15 Nov 2004 15:58:11 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote: > On Sat, 13 Nov 2004 11:40:25 -0800, Ben Tilly <[EMAIL PROTECTED]> wrote: > > On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote: > > > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: > > [...] > > > > Um, mmap does not (well should not - Windows may vary) use any > > > > RAM > > > > You are confusing two issues. "using RAM" is not the same as "allocating > > > process address space". > > > How was I confusing issues? > > Let me demonstrate: > > > What I meant is that calling mmap does > > not use significant amounts of RAM. > > Calling mmap uses NO RAM. It doesn't interact with RAM at all. But > does allocate (potentially huge) amounts of process address space, and > reserves it in such a way that your process can no longer allocate it > for uses like libc's memory allocator (which you access through > functions like malloc). I feel like we are all talking past each other. Let's go back to basics. When I say RAM I mean the physical RAM on the computer. Whether or not that RAM is currently allocated to your process. So if you do something and that makes something else get paged out, then you've used RAM in my view. Whether that RAM is in pages that are attached to your process, or was used by the kernel I still see that as you using RAM. > If you mmap a 3GB file (actually less than 3GB, but I'll use that > number as an example for now) on an x86 linux box and then call > "malloc", you get back a NULL pointer because malloc will fail. This > is actually not quite true. That malloc will likely work because it > will be allocated from some existing page of address space that > malloc's internal page allocator reserved before you called mmap, but > that won't work for long. I'm aware of this and wasn't disputing it. > > (The OS needs some to track > > that the mapping exists, but that should be it.) > > Actually, no. The place that mmap is tracked is a) in the file > descriptor table, which is outside of your 3GB process space in > kernel-space and b) in the system page table, which is not in your > address space at all, but in hardware. Where it is tracked doesn't concern me. That it is tracked, does. However I realize that I don't know enough about how the memory management is handled. I would think that this would be dynamic in some way - on creating a process the kernel should need to write very little data, but will then write more later. But I don't know enough to verify that one way or the other. > > Once you actually use > > the data that you mmapped in, file contents will be swapped in, and > > RAM will be taken, but not until then. > > "RAM will be taken" is a meaningless term here. Ignore RAM for > purposes of this conversation. On the one hand you are saying that I'm confused about what I meant by a comment about using RAM, and on the other you are telling me that I am to ignore RAM for the purposes of this conversation, it is meaningless. There is a contradiction there. For discussing what *you* want to talk about it may be meaningless, but for discussing what *I* had been talking about it isn't. And for deciding whether or not I was confused it most definitely isn't. (Perhaps you're confused about what I was talking about?) It appears that you want to discuss what the world looks like to a process. For that I wholeheartedly agree, talking about what is in RAM is generally counterproductive, if the abstraction of virtual memory works, then you should never know or care about what is or is not in RAM. But I was talking about what things lead to resource consumption that could adversely affect a machine which is carrying out a particular computation. For that it matters a great deal whether particular operations are going to cause pages of RAM to be discarded and allocated for something else. Because when it comes to actual performance, the abstraction of virtual memory leaks badly. (And what I was saying about resource consumption is that mmap doesn't. Consume in meaningful amounts that is.) > > As for a 3 GB limit, now that you mention it, I heard something > > about that. But I didn't pay attention since I don't need it right now. > > Suffice to say that your process cannot be larger than 3GB under x86 > Linux. There are extensions, options and hacks if you want to go > larger, but after 3GB it gets very dicey. Are you saying that Linux does not give an user-level API to Intel's addressing extensions? Or that it does but you recommend against using it? > > I've also heard about Intel's large addressing extensions (keep 2GB > > in normal address space, page around the top 2 GB, you get 64 GB > > of addressible memory). I'm curious about how (or if) the two can > > cooperate. > > The ability to re-map memory like this is quite common, and the *OS* > can take advantage of it, but as long as you're on an x86 and using > 32-bit pointers, your one process ca
Re: [Boston.pm] transposing rows and columns in a CSV file
> > BT> I've also heard about Intel's large addressing extensions (keep 2GB > > BT> in normal address space, page around the top 2 GB, you get 64 GB The extension referred to above concerns the 4 gig limit inherent in a 32 bit address space with byte addressable memory. The 4 gig limit is on the amount of physical memory that the CPU can directly address. The virtual address space on x86 also happens to be 32 bits in size. Historically on many x86 unixlike oses the upper half of the virtual address space (basically any memory address whose most significant bit was 1) was used by the operating system and was not acessible to application code. This meant that a single process could only usefully address 2 gigabytes of virtual memory at a time. For some time now it has been possible to build linux kernels where the amount of virtual memory available to the apps is 3 gigabytes, with the kernel making do with 1 gig of virtual memory. This was purely a software change. The CPU page addressing extensions allows the cpu to use more than 4 gigs of physical memory. However there is no way around the virtual memory address size limit, so not even the operating system can address it all simultaneously. However it can allocate all of that memory to different proceses, so if you have a single process that needs to access 30 gigs of memory you'll want a 64 bit machine, but if you have 12 processes that each need 2.5 gigs of RAM you're all set. > > eww, that sounds like manual management of extended memory. like the > > days of overlays or even pdp-11 extended memory (which i used to access > > 22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not Fortunately only the operating system has to deal with the addressing hacks required. > Personally I think it would be great to have a system call to say, > "Page this in, because I'll want to be there soon." The > multi-process solution is the next best thing. It is called madvise(). -- Jason ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
At 12:12 PM -0800 11/15/04, Ben Tilly wrote: On Sat, 13 Nov 2004 17:43:37 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > "BT" == Ben Tilly <[EMAIL PROTECTED]> writes: > BT> In Perl I'd expect it to be possible but fragile. If Parrot could make BT> it possible and not fragile, that would be great. parrot will have mmap in bytecode which will be nice and fast and shared. i dunno about mmaped data as that is always an issue and with > garbage collection it can be tricky. Nah, mapping mmapped data to internal strings won't be a big deal, and GC doesn't care. The memory gets marked as external so parrot won't try to move it or free it, and if you wanted it could be COW as well so the mmap segment could be read-only. Or not, either way. I see, so you compile Perl to bytecode, and then load it through mmap. Sure, it might be faster to compile Perl than to read the file. But if the bytecode has been used recently... More than that, it means not loading in unused code, and deferring loading of code until you actually get to it to use. Not as big a deal for one-liners (which wouldn't get compiled to disk, I expect) but if you had something like, say, Tk or CGI compiled to bytecode you could see a significant win, since 90%+ of those modules never get used by any particular program. (And the commonly used library modules are potentially already loaded in system memory by another process) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Sat, 13 Nov 2004 17:43:37 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > "BT" == Ben Tilly <[EMAIL PROTECTED]> writes: > > BT> How was I confusing issues? What I meant is that calling mmap does > BT> not use significant amounts of RAM. (The OS needs some to track > BT> that the mapping exists, but that should be it.) Once you actually use > BT> the data that you mmapped in, file contents will be swapped in, and > BT> RAM will be taken, but not until then. > > mmap uses virtual ram which means physical ram and swap space. so mmap can > suck up as much physical ram as you have if you allocate it. We are both right, and we are both wrong. The reality is that any such behaviour on the part of mmap is OS and implementation dependent. And an intelligent OS will make much of it configurable. Let me refer to my local Linux manpage. For this purpose I'd specify MAP_SHARED. In that case edits made to memory are made to the file, and there is no need to reserve RAM or swap. (Not that Linux would reserve RAM anyways, Linux allows overcommitting.) Something close to your described behaviour may happen if you use MAP_PRIVATE and don't specify MAP_NORESERVE. If you specify MAP_NORESERVE you won't use swap. (You might get a SIGSEGV if you make a write when RAM is not available.) Which reinforces the point that specific details about the side-effects of mmap (or any system call) are implementation dependent and should never be assumed. > BT> As for a 3 GB limit, now that you mention it, I heard something > BT> about that. But I didn't pay attention since I don't need it right now. > BT> I've also heard about Intel's large addressing extensions (keep 2GB > BT> in normal address space, page around the top 2 GB, you get 64 GB > BT> of addressible memory). I'm curious about how (or if) the two can > BT> cooperate. > > eww, that sounds like manual management of extended memory. like the > days of overlays or even pdp-11 extended memory (which i used to access > 22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not > something you want to deal with unless you have to. You got it. Worse yet, from the way that I see it, in the next 5 years our industry must decide whether the bad old days are going to return, and I don't know which way it will jump. The problem is that consumer computers will soon use over 4 GB of RAM. The obvious clean solution is to switch to 64 bit CPUs. But as soon as you go to 64-bit pointers, a lot of programs and data structures grow (worst case they double) so there is a big jump in memory needs. (Rather than being a bit over your needs, you are a lot over.) And when data bloats, moving that data around the computer slows down as well. Programmer pain is invisible. That size and performance hit is not. A couple of years ago Intel quietly added an addressing extension to allow for up to 64 GB of RAM, and then pursued a non-consumer 64-bit strategy (which flopped). The way that I read that is that Intel expects the consumer industry to do as it did for a decade after the 16 to 32 bit conversion should have happened - stick with smaller pointers and swallow the addressing difficulty. AMD's strategy was more public. They came out with a 64-bit CPU that solved a bunch of problems with the x86 architecture, meaning that if you switch to 64-bit mode there is a good chance that your code will speed up. Their CPU has been successful enough that Intel has been compelled to issue a copycat CPU. But which way the industry will jump is still unclear to me. To me the first good test is what happens when high-end games start having memory requirements that are too big for straightforward 32-bit access. (A limit which conventially is placed at 2 GB, but can be stretched to somewhere between 2 GB and 4 GB.) Will they manually manage some big chunks of data, or will they require an AMD Athalon-compatible computer? >as for the > original > problem, i keep saying that mmap will give little help if the input > matrix won't fit into physical ram. once you start swapping (manually or > virtually) all bets are off on speed of any transpostion algorithm. you > have to start counting disk accesses and once you do, who care how it > was done (mmap or whatever)? mmap with MAP_SHARED may reduce your RAM requirements, and improves your access speed. I agree that the difference is pretty marginal. As for your "counting disk accesses", I've already pointed out that disk accesses are not created equal. If you really care about the performance of the application, you need to benchmark, not make simplistic estimates. Because unless you know a lot of detail about how the disk drive works, you can't easily predict what the actual performance will be. [...] > BT> However the over-committed allocation comment confuses me. > BT> Why would a single mmap result in over committing memo
Re: [Boston.pm] transposing rows and columns in a CSV file
> "BT" == Ben Tilly <[EMAIL PROTECTED]> writes: BT> How was I confusing issues? What I meant is that calling mmap does BT> not use significant amounts of RAM. (The OS needs some to track BT> that the mapping exists, but that should be it.) Once you actually use BT> the data that you mmapped in, file contents will be swapped in, and BT> RAM will be taken, but not until then. mmap uses virtual ram which means physical ram and swap space. so mmap can suck up as much physical ram as you have if you allocate it. BT> As for a 3 GB limit, now that you mention it, I heard something BT> about that. But I didn't pay attention since I don't need it right now. BT> I've also heard about Intel's large addressing extensions (keep 2GB BT> in normal address space, page around the top 2 GB, you get 64 GB BT> of addressible memory). I'm curious about how (or if) the two can BT> cooperate. eww, that sounds like manual management of extended memory. like the days of overlays or even pdp-11 extended memory (which i used to access 22 bit address (yes, 22 bit PHYSICAL space) from a 16 bit cpu.). not something you want to deal with unless you have to. as for the original problem, i keep saying that mmap will give little help if the input matrix won't fit into physical ram. once you start swapping (manually or virtually) all bets are off on speed of any transpostion algorithm. you have to start counting disk accesses and once you do, who care how it was done (mmap or whatever)? >> To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB >> file, assuming you allowed for over-committed allocation in the kernel >> (assuming Linux... filthy habit, I know). BT> Exactly what I was referring to. BT> However the over-committed allocation comment confuses me. BT> Why would a single mmap result in over committing memory? you can allocate all the virtual space allowed when you don't need it. then mmap gains you little. mmap is best used as a window into a large file. note that page faults when using mmap will block a process until that page is swapped in. so some systems use small processes to touch shared (mmaped) virtual ram so that block and when they wake up the main process can use the real ram. BT> I was thinking that you'd use something like Sys::Mmap's mmap BT> call directly so that there is a Perl variable that Perl thinks is a BT> regular variable but which at a C level has its data at an mmapped BT> location. Fragile, I know (because Perl doesn't know that it cannot BT> reallocate the variable), but as long as you are careful to not cause BT> it to be reallocated or copied, there should be no limitations on BT> what you can do. but those are bad limitations. i wouldn't ever design in such a beast. i would use other methods of IPC than mmap in perl. BT> In Perl I'd expect it to be possible but fragile. If Parrot could make BT> it possible and not fragile, that would be great. parrot will have mmap in bytecode which will be nice and fast and shared. i dunno about mmaped data as that is always an issue and with garbage collection it can be tricky. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 23:04:46 -0500, Aaron Sherman <[EMAIL PROTECTED]> wrote: > On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: [...] > > Um, mmap does not (well should not - Windows may vary) use any > > RAM > > You are confusing two issues. "using RAM" is not the same as "allocating > process address space". Allocating process address space is, of course, > required for mmap (same way you allocate address space when you load a > shared library, which is also mmap-based under Unix and Unix-like > systems). All systems have to limit address space at some point. Linux > does this at 3GB up to 2.6.x where it becomes more configurable and can > be as large as 3.5, I think. How was I confusing issues? What I meant is that calling mmap does not use significant amounts of RAM. (The OS needs some to track that the mapping exists, but that should be it.) Once you actually use the data that you mmapped in, file contents will be swapped in, and RAM will be taken, but not until then. As for a 3 GB limit, now that you mention it, I heard something about that. But I didn't pay attention since I don't need it right now. I've also heard about Intel's large addressing extensions (keep 2GB in normal address space, page around the top 2 GB, you get 64 GB of addressible memory). I'm curious about how (or if) the two can cooperate. > To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB > file, assuming you allowed for over-committed allocation in the kernel > (assuming Linux... filthy habit, I know). Exactly what I was referring to. However the over-committed allocation comment confuses me. Why would a single mmap result in over committing memory? > > mmap should not cause any more or less disk accesses than > > reading from the file in the same pattern should have. It just lets > > you do things like use Perl's RE engine directly on the file > > contents. > > Actually, no it doesn't as far as I know (unless the copy-on-write code > got MUCH better recently). Where does a write happen? I was thinking in terms of using the RE engine (with pos) as a tokenizer. I was thinking that you'd use something like Sys::Mmap's mmap call directly so that there is a Perl variable that Perl thinks is a regular variable but which at a C level has its data at an mmapped location. Fragile, I know (because Perl doesn't know that it cannot reallocate the variable), but as long as you are careful to not cause it to be reallocated or copied, there should be no limitations on what you can do. > Like I said, you probably won't get the win out of mmap in Perl that you > would expect. In Parrot you would, but that's another story. In Perl I'd expect it to be possible but fragile. If Parrot could make it possible and not fragile, that would be great. Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: > On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: > [...] > > this talk about mmap makes little sense to me. it may save some i/o and > > even some buffering but you still need the ram and mmap still causes > > disk accesses. Just to throw in my own two cents before I critique the reply: "some disk buffering" can mean a factor of 10-1000 performance improvement in real world applications. This is my personal experience with real-world programs. Of course, if all you want is linear access to a file once, then mmap doesn't help. But, if you want random access to a file, nothing beats mmap because people spend their LIVES tuning paging strategies, and such code has knowlege of the hardware that you cannot otherwise take advantage of in a general purpose IO layer. > Um, mmap does not (well should not - Windows may vary) use any > RAM You are confusing two issues. "using RAM" is not the same as "allocating process address space". Allocating process address space is, of course, required for mmap (same way you allocate address space when you load a shared library, which is also mmap-based under Unix and Unix-like systems). All systems have to limit address space at some point. Linux does this at 3GB up to 2.6.x where it becomes more configurable and can be as large as 3.5, I think. To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB file, assuming you allowed for over-committed allocation in the kernel (assuming Linux... filthy habit, I know). > mmap should not cause any more or less disk accesses than > reading from the file in the same pattern should have. It just lets > you do things like use Perl's RE engine directly on the file > contents. Actually, no it doesn't as far as I know (unless the copy-on-write code got MUCH better recently). Like I said, you probably won't get the win out of mmap in Perl that you would expect. In Parrot you would, but that's another story. ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 2004-11-12 at 21:47 -0500, William Ricker wrote: > > This is at best 2/3 correct. > > > First you're right that mmap has a 2 GB limit because it maps > > things into your address space, and so the size of your pointers > > limit what you can address. > (unless you have 64bit pointers of course) No, even without 64 bit pointers, you can have a 4GB address space (not signed). The trick is that under Linux you're usually limited to 3GB because the rest is reserved and other OSes impose other similar limitations. I have worked with an application that allocates about 2.5GB of RAM on startup, so I have occasion to know this ;-) ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> Well mmap is a Unix concept. IIRC, it's a Multics concept, that Brian and Dennis presumably took south to NJ with them; if it was original with Multics or is even older I'd have to check with the retired Multicians list or a archeobibliography. > To the best of my knowledge > it is not natively supported in Windows. Right. ActiveState module repository does not include mmap.pm builds for Windows, only several *nix platforms. MKS rocks, if you need commercial grade support for *nix-on-winDos. Cygwin is fine if you don't need 800# support and don't want to manage the WINDOS ENV from KSH for launching those closed source programs from a nice shell. > Like him, I have no idea why pagefile.sys would enter into the > picture. It certainly doesn't on Linux. It oughtn't, but a lame enough emulation *of the interface* on a lame "os" might have to copy the whole thing into the swapfile instead of doing a memory-map file operation - in which case, kiss the efficiency goodbye. Bill --- William Ricker [EMAIL PROTECTED] ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> This is at best 2/3 correct. > First you're right that mmap has a 2 GB limit because it maps > things into your address space, and so the size of your pointers > limit what you can address. (unless you have 64bit pointers of course) Does the holder of the original problem (sorry, I've lost track of whose question it was now) 1. have *nix or windos? 32/64 if *nix? w/cygwin or MKS if win? 2. have a file >2GB or <2GB ? 3. have to do this once or many times? Unfortunately, I don't immediately see a Debian pkg for Mmap.pm for Alpha, which is the only platform I have perl64 on. Maybe next rev's perl 5.8 will bring one. bill --- William Ricker [EMAIL PROTECTED] ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: [...] > this talk about mmap makes little sense to me. it may save some i/o and > even some buffering but you still need the ram and mmap still causes > disk accesses. Um, mmap does not (well should not - Windows may vary) use any RAM (other than what the memory manager needs to keep track of the fact that the mapping has happened). Using mmap does not imply any particular algorithm, it is an optimization saying that you're going to leave the ugly details of paging the file to/from your process to the OS. mmap should not cause any more or less disk accesses than reading from the file in the same pattern should have. It just lets you do things like use Perl's RE engine directly on the file contents. > if the original file is too big for ram then the > algorithm chosen must be one to minimize disk accesses and mmap doesn't > save those. this is why disk/tape sorts were invented, to minimize the > slow disk and tape accesses. so you would still need my algorithm or > something similar regardless of how you actually get the data from disk > to ram. and yes i have used mmap on many projects. I'm not sure what you mean by "something similar", but yes, you'll need SOME algorithm to solve the problem. Which statement is so general as to be meaningless. I'm sure that there are some possible algorithms that you'd never have thought of. (Mostly because they're bad.) Disk/tape sorts were invented because back in the day there was not enough RAM to do anything useful and so everything had to go to disk. Of course once you're forced to go to disk, why not optimize it...? Of course this problem said to guarantee being able to do the sort, not necessarily to do it most efficiently. Therefore no single criteria - including disk accesses - necessarily MUST dominate your choice. Furthermore disk accesses are not created equal. There are multiple levels of cache between you and disk. Accessing data in a way that is friendly to cache will improve performance greatly. In particular managing to access data sequentially is orders of magnitude faster than jumping around. The key is not how often you "access disk", it is how often your hard drive has to do a seek. When it needs to seek it reads far more data than it is asked for and puts that in cache. When you read sequentially, most of your accesses come from cache, not disk. That is why databases use merge-sort so much, it accesses data in exactly the way that hard drives are designed to be accessed most efficiently. A quick sort has fewer disk accesses, but far more of them cause an unwanted seek. > when analyzing algorithm effienciency you must work out which is the > slowest operation that has the steepest growth curve and work on > minimizing it. since disk access is so much slower than ram access it > becomes the key element rather than the classic comparison in sorts. in You must, must, must. What is this preoccupation with must? As I just pointed out, disk accesses are not all equal. Secondly in many applications you will *parallelize* the slowest step, not minimize it. For instance good databases not only like to use mergesort internally, they often distribute the job to several processes or threads that all work at once, that way if one process is waiting on a disk read, others may be going at the same time. Thirdly, and most importantly, it is more important to make code work than to make it efficient. If a stupid solution will work and a smart one should be faster, code the stupid solution first. > a matrix transposition in ram, i would count the matrix accesses and/or > copies of elements. with a larger matrix, then ram accesses would be > key. my solution would load as much matrix into ram as possible (maybe > using mmap but that is not critical anymore) and transpose it. then > write the section out. that is 2 (large) disk accesses per chunk (or 1 > per disk block). then you do a merge (assuming you can access all the > sction files at one time) which is another disk access per section (or > block). and one more to write out the final matrix (in row order). so > that is O((2 + 2) * section_count) disk accesses which isn't too bad. You said that you want to assume that we can access all section files at once. Well suppose that I take a CSV file which is 100 columns by 10 million rows, transpose it, then try to transpose it again. Your assumption just broke. Maybe it would work for the person with the original problem, maybe not. Here is the outline of a solution that avoids all such assumptions. 1. Run through the CSV file and output a file lines of the format: $column,$row:$field You'll need to encode embedded newlines some way, for instance s/\\//g; s/\n/\\n/g; - you may also want to pre-pad the columns and rows with some number of 0's so that an ASCII-betical sort does The R
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 09:01:32 -0500, Tolkin, Steve <[EMAIL PROTECTED]> wrote: > I think there may be a more restrictive limit, at least on Windows. > The OS must be able to find a contiguous block > of virtual memory, i.e. in pagefile.sys. > The paging file may not be able to grow, (depending on > how it is configured) and there may not be a large enough block. Given how mmap is supposed to work, I'd doubt this diagnosis. I'm not saying that you're wrong, I'm just saying that this would be a strange limitation. What mmap is supposed to do is cause a section of your address space to transparently be a file on disk. It should not actually read that data in, that is done for you on demand when you access data. (I guess an implementation could do that, but the Unix ones certainly don't.) However it does need to find a contiguous block of address space in your process memory space to map that data into. > I would like to learn more about the exact situation of memory mapping > files > on Windows -- the above is just based on a hour of Googling > and the info below. Well mmap is a Unix concept. To the best of my knowledge it is not natively supported in Windows. If you have it, it will be implemented by whatever Unix emulation you're using. For instance if you're using MKS then see http://mkssoftware.com/docs/man3/mmap.3.asp, if you're using cygwin see your local cygwin documentation. It may be that the emulations have additional limits that I don't know about. (I don't use Windows.) > Non perl related info follows: > I hit this limitation in the otherwise excellent disk indexing program > Wilbur at http://wilbur.redtree.com which is free (as in beer) and open > source too. > (But for Windows only.) > > It uses memory mapped files and when one of the indexes exceeds > about 500 MB it says something like "unable to map view of a file" > even though my pagefile.sys is 1536 MB. > > The developer said: > This is a system message that occurs when Wilbur is unable to memory map > one > of its index files and due to the way memory mapping works on Windows, I > think > this is normally a symptom of insufficient virtual memory space. > Possible > solutions might be increasing the size of your paging file (dig through > the > performance options on the system control panel to find this), > defragmenting > the disk and of course adding more real memory. Googling for that, I see him saying the same thing at http://wilbur.redtree.com/cgi-bin/wilburtalk.pl?noframes;read=1770 Like him, I have no idea why pagefile.sys would enter into the picture. It certainly doesn't on Linux. Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 07:38:57 -0500, Gyepi SAM <[EMAIL PROTECTED]> wrote: > On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: [...] > I think mmap would be just as ideal in Perl and a lot less work too. > Rather than indexing and parsing a *large* file, you must mmap > and parse it. In fact, the CSV code, which was left as an exercise in you > pseudo-code, would be the only code required. It depends on your definition of ideal. A Perl string is far more complex than a C string, and translating between the two adds complexity. It requires an external module and adds platform dependencies. > I should point out though that mmap has a 2GB limit on systems > without 64bit support. Such systems can't store files larger than > that anyhow. This is at best 2/3 correct. First you're right that mmap has a 2 GB limit because it maps things into your address space, and so the size of your pointers limit what you can address. It is also correct that there are complications in handling large files on 32 bit systems. Most operating systems didn't handle that case. However today most 32 bit operating systems have support for large files, and Perl added the necessary hooks to take advantage of it several versions ago. So if you have a relatively up to date system, odds are very good that you don't have a 2 GB limit. Certainly not on Windows or Linux. Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: >> Seriously, while mmap is ideal in C, in Perl I would just build an array >> of tell()s for each line in the file and then walk through the lines, >> storing the offset of the last delimiter that I'd seen. GS> I think mmap would be just as ideal in Perl and a lot less work too. GS> Rather than indexing and parsing a *large* file, you must mmap GS> and parse it. In fact, the CSV code, which was left as an exercise in you GS> pseudo-code, would be the only code required. GS> I should point out though that mmap has a 2GB limit on systems GS> without 64bit support. Such systems can't store files larger than GS> that anyhow. >> Let the kernel file buffer do your heavy lifting for you. GS> Exactly, if s/kernel file/mmap/ this talk about mmap makes little sense to me. it may save some i/o and even some buffering but you still need the ram and mmap still causes disk accesses. if the original file is too big for ram then the algorithm chosen must be one to minimize disk accesses and mmap doesn't save those. this is why disk/tape sorts were invented, to minimize the slow disk and tape accesses. so you would still need my algorithm or something similar regardless of how you actually get the data from disk to ram. and yes i have used mmap on many projects. when analyzing algorithm effienciency you must work out which is the slowest operation that has the steepest growth curve and work on minimizing it. since disk access is so much slower than ram access it becomes the key element rather than the classic comparison in sorts. in a matrix transposition in ram, i would count the matrix accesses and/or copies of elements. with a larger matrix, then ram accesses would be key. my solution would load as much matrix into ram as possible (maybe using mmap but that is not critical anymore) and transpose it. then write the section out. that is 2 (large) disk accesses per chunk (or 1 per disk block). then you do a merge (assuming you can access all the sction files at one time) which is another disk access per section (or block). and one more to write out the final matrix (in row order). so that is O((2 + 2) * section_count) disk accesses which isn't too bad. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
RE: [Boston.pm] transposing rows and columns in a CSV file
I think there may be a more restrictive limit, at least on Windows. The OS must be able to find a contiguous block of virtual memory, i.e. in pagefile.sys. The paging file may not be able to grow, (depending on how it is configured) and there may not be a large enough block. I would like to learn more about the exact situation of memory mapping files on Windows -- the above is just based on a hour of Googling and the info below. Non perl related info follows: I hit this limitation in the otherwise excellent disk indexing program Wilbur at http://wilbur.redtree.com which is free (as in beer) and open source too. (But for Windows only.) It uses memory mapped files and when one of the indexes exceeds about 500 MB it says something like "unable to map view of a file" even though my pagefile.sys is 1536 MB. The developer said: This is a system message that occurs when Wilbur is unable to memory map one of its index files and due to the way memory mapping works on Windows, I think this is normally a symptom of insufficient virtual memory space. Possible solutions might be increasing the size of your paging file (dig through the performance options on the system control panel to find this), defragmenting the disk and of course adding more real memory. But I already have a paging file of 1536 MB, which is the recommended size in windows XP (3 times physical memory of 512 MB). I also do not think that defragmenting the disk helps, except possibly if done at boot time to defrag pagefile.sys (However I did do that and it still failed.) I was able to work around the problem by putting fewer words in the index, e.g. kept the default of min length = 3 and no numbers. I also have lots more files than most people, so I think very few people will hit this limitation. Hopefully helpfully yours, Steve -- Steve TolkinSteve . Tolkin at FMR dot COM 617-563-0516 Fidelity Investments 82 Devonshire St. V4D Boston MA 02109 There is nothing so practical as a good theory. Comments are by me, not Fidelity Investments, its subsidiaries or affiliates. -Original Message- From: Gyepi SAM [mailto:[EMAIL PROTECTED] Sent: Friday, November 12, 2004 7:39 AM To: [EMAIL PROTECTED] Subject: Re: [Boston.pm] transposing rows and columns in a CSV file On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: > Seriously, while mmap is ideal in C, in Perl I would just build an array > of tell()s for each line in the file and then walk through the lines, > storing the offset of the last delimiter that I'd seen. I think mmap would be just as ideal in Perl and a lot less work too. Rather than indexing and parsing a *large* file, you must mmap and parse it. In fact, the CSV code, which was left as an exercise in you pseudo-code, would be the only code required. I should point out though that mmap has a 2GB limit on systems without 64bit support. Such systems can't store files larger than that anyhow. > Let the kernel file buffer do your heavy lifting for you. Exactly, if s/kernel file/mmap/ -Gyepi -- The convenient method is insecure and the secure method is inconvenient. --me ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: > Seriously, while mmap is ideal in C, in Perl I would just build an array > of tell()s for each line in the file and then walk through the lines, > storing the offset of the last delimiter that I'd seen. I think mmap would be just as ideal in Perl and a lot less work too. Rather than indexing and parsing a *large* file, you must mmap and parse it. In fact, the CSV code, which was left as an exercise in you pseudo-code, would be the only code required. I should point out though that mmap has a 2GB limit on systems without 64bit support. Such systems can't store files larger than that anyhow. > Let the kernel file buffer do your heavy lifting for you. Exactly, if s/kernel file/mmap/ -Gyepi -- The convenient method is insecure and the secure method is inconvenient. --me ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Wed, 2004-11-10 at 23:13 -0500, Uri Guttman wrote: > sorry i missed the meeting but i have a nasty cold i am fighting off. I just got over that one :-( As for transposing a matrix that won't fit in ram... that's easy. Mail it to someone who has more ram. Aaron "Gordian Knot" Sherman, at your service ;-) Seriously, while mmap is ideal in C, in Perl I would just build an array of tell()s for each line in the file and then walk through the lines, storing the offset of the last delimiter that I'd seen. That way, every read looks like this: # XXX WARNING, untested pseudo-code next if $offsets[$thisline] == -1; seek(INPUT,$tells[$thisline]+$offsets[$thisline],0); $bigbuf = ''; for($i=0;$i;$i++) { sysread(INPUT,$buf,$blocksz); # Proper CSV parsing left as an exercise... if ($buf =~ /[,\n]/) { $pos = length($`)+1; $sep = $&; if ($sep eq "\n") { $offsets[$thisline] = -1; # Done here } else { $offsets[$thisline] += $i*$blocksz+$pos; } last; } else { $bigbuf .= $buf; } } handle_one_field($bigbuf); Let the kernel file buffer do your heavy lifting for you. -- ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> "WR" == William Ricker <[EMAIL PROTECTED]> writes: WR> This problem of transposing a matrix bigger than you want to WR> consider slurping feels familiar. The techniques used to sort WR> "big" files in the bad old days, disk or tape based sort-merge, WR> might be adaptble. The trick is to do it in 2 (or small * log N) WR> passes,by creating a simpler suitable intermediate form. interestingly, when doing tape merges with multiple tapes, the most efficient way is based on the fibonacci sequence. but in that case you can see the first records of each tape in a set to do the merge. in the large matrix you can't see all the needed elements so that may not be a viable direction to pursue. but here is another thought based on the old disk sort/merges and such. you can transpose sections of the input matrix and then join/merge those. so if you can only fit one part of the matrix into ram (including output buffers), you transpose it and output it to a file. do the same for the other sections and output them. then it is a fairly easy merge where you can read the first records of each of the transposed sections and merge into a single record and output it to a final file. see, i am not all about stem :) sorry i missed the meeting but i have a nasty cold i am fighting off. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
This problem of transposing a matrix bigger than you want to consider slurping feels familiar. The techniques used to sort "big" files in the bad old days, disk or tape based sort-merge, might be adaptble. The trick is to do it in 2 (or small * log N) passes,by creating a simpler suitable intermediate form. Bill --- William Ricker [EMAIL PROTECTED] ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Wed, Nov 10, 2004 at 04:38:26PM -0500, Tom Metro wrote: > At last night meeting someone asked about ways to transpose the rows and > columns in a CSV file that was too large to fit into memory and > potentially had thousands of columns and hundreds of thousands of rows. All of the suggested solutions are great, but ... If a file is too large to slurp in to memory, use mmap. It's perfect for the task and allows for the most natural solution. There are at least two mmap modules on CPAN. Of course, since I was not at the meeting, this may well have been discussed and I could be a day late! ;) -Gyepi ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
[Boston.pm] transposing rows and columns in a CSV file
At last night meeting someone asked about ways to transpose the rows and columns in a CSV file that was too large to fit into memory and potentially had thousands of columns and hundreds of thousands of rows. If your data fits the criteria of having reasonably uniform cell widths, say for example no cells holding more than 50 bytes, you could use a sparse file with a fixed width record structure. You would make a single pass through your source data file, parsing the CSV only once, and make random (as in random access) seeks into the target file, following a simple formula to determine the cell's offset. (If a sparse file isn't doable, but you have lots of disk space, you can instead create the target file by padding it with enough nulls to hold the number of cells you are expecting. Using dd to copy blocks from /dev/zero would do the trick.) Conceptually this isn't all that different from using a database, as someone suggested, but should have much less overhead. Of course if the desired end result is to have another CSV file, you'll have to write another chunk of code to convert the fixed record file back to CSV. Whether millions of random seeks and a post conversion step really provide enough of a performance gain over making multiple passes through the source file is debatable. If this is a one-time conversion effort, the multi pass approach might be the way to go. A variation on this idea, which would eliminate copying the data to an intermediary file, and would be able to tolerate widely varying cell widths, is to make one pass through the source file and generate an index for it - say the byte offset of each row, and then the relative offset and length of each cell - which, if you're lucky, might fit into memory, otherwise you'd again use a sparse file. Then you'd walk through the index, make random seeks into the source file, and write the cells to your target CSV file in the desired order. Not quite a single pass on the source file, but at least you'd have the overhead of performing the CSV parsing only once. -Tom ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm