Re: [Boston.pm] emergency social meeting
On Fri, 2004-11-12 at 10:26 -0500, Uri Guttman wrote: > > "bdf" == brian d foy <[EMAIL PROTECTED]> writes: > > bdf> I'm in Boston from Nov 15-18. > for those of you who don't know, bdf teaches for stonehenge (randal's > biz) and is the founder of perl mongers (which you are a member of!) so > let's give him his due props with a nice emergency social meeting. Yeah, yeah... whatever. How cool is his camera?! ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 2004-11-12 at 13:22 -0800, Ben Tilly wrote: > On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: > [...] > > this talk about mmap makes little sense to me. it may save some i/o and > > even some buffering but you still need the ram and mmap still causes > > disk accesses. Just to throw in my own two cents before I critique the reply: "some disk buffering" can mean a factor of 10-1000 performance improvement in real world applications. This is my personal experience with real-world programs. Of course, if all you want is linear access to a file once, then mmap doesn't help. But, if you want random access to a file, nothing beats mmap because people spend their LIVES tuning paging strategies, and such code has knowlege of the hardware that you cannot otherwise take advantage of in a general purpose IO layer. > Um, mmap does not (well should not - Windows may vary) use any > RAM You are confusing two issues. "using RAM" is not the same as "allocating process address space". Allocating process address space is, of course, required for mmap (same way you allocate address space when you load a shared library, which is also mmap-based under Unix and Unix-like systems). All systems have to limit address space at some point. Linux does this at 3GB up to 2.6.x where it becomes more configurable and can be as large as 3.5, I think. To be clear, though, if you had 10MB of RAM, you could still mmap a 3GB file, assuming you allowed for over-committed allocation in the kernel (assuming Linux... filthy habit, I know). > mmap should not cause any more or less disk accesses than > reading from the file in the same pattern should have. It just lets > you do things like use Perl's RE engine directly on the file > contents. Actually, no it doesn't as far as I know (unless the copy-on-write code got MUCH better recently). Like I said, you probably won't get the win out of mmap in Perl that you would expect. In Parrot you would, but that's another story. ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 2004-11-12 at 21:47 -0500, William Ricker wrote: > > This is at best 2/3 correct. > > > First you're right that mmap has a 2 GB limit because it maps > > things into your address space, and so the size of your pointers > > limit what you can address. > (unless you have 64bit pointers of course) No, even without 64 bit pointers, you can have a 4GB address space (not signed). The trick is that under Linux you're usually limited to 3GB because the rest is reserved and other OSes impose other similar limitations. I have worked with an application that allocates about 2.5GB of RAM on startup, so I have occasion to know this ;-) ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> Well mmap is a Unix concept. IIRC, it's a Multics concept, that Brian and Dennis presumably took south to NJ with them; if it was original with Multics or is even older I'd have to check with the retired Multicians list or a archeobibliography. > To the best of my knowledge > it is not natively supported in Windows. Right. ActiveState module repository does not include mmap.pm builds for Windows, only several *nix platforms. MKS rocks, if you need commercial grade support for *nix-on-winDos. Cygwin is fine if you don't need 800# support and don't want to manage the WINDOS ENV from KSH for launching those closed source programs from a nice shell. > Like him, I have no idea why pagefile.sys would enter into the > picture. It certainly doesn't on Linux. It oughtn't, but a lame enough emulation *of the interface* on a lame "os" might have to copy the whole thing into the swapfile instead of doing a memory-map file operation - in which case, kiss the efficiency goodbye. Bill --- William Ricker [EMAIL PROTECTED] ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] emergency social meeting
> i don't know how mobile or boston road aware brian is, so So the key questions for brian d foy are, where are you staying and would you like to get further afield or stay close to base? If bdf's non-mobile but would like to explore, perhaps a mobile monger can share him a camel ride. Do we get a better turnout for a social when we're subway accessible, or do enough suburbanites show up out Rt9 that no one misses us urban trolley dodgers? If so, it's good to do some out there too. (Given a free choice, Cambridge Brew and Boston Beer Works (either site) would get my vote, but there's plenty of good drinking around.) bill --- William Ricker [EMAIL PROTECTED] ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
[Boston.pm] new edition of book available
Hey all. For those of you who were at the monger meeting, I plugged my sci-fi book "Hunger Pangs" and said a new edition was coming out this week. Well, it's out as of last night. Check out the preview (part of which is licensed CreativeCommons-NonCommercial) and buy the book if you like hard sf or military sf. "Hunger Pangs" http://www.greglondon.com/hunger/ ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 10:05:27 -0500, Uri Guttman <[EMAIL PROTECTED]> wrote: > > "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: [...] > this talk about mmap makes little sense to me. it may save some i/o and > even some buffering but you still need the ram and mmap still causes > disk accesses. Um, mmap does not (well should not - Windows may vary) use any RAM (other than what the memory manager needs to keep track of the fact that the mapping has happened). Using mmap does not imply any particular algorithm, it is an optimization saying that you're going to leave the ugly details of paging the file to/from your process to the OS. mmap should not cause any more or less disk accesses than reading from the file in the same pattern should have. It just lets you do things like use Perl's RE engine directly on the file contents. > if the original file is too big for ram then the > algorithm chosen must be one to minimize disk accesses and mmap doesn't > save those. this is why disk/tape sorts were invented, to minimize the > slow disk and tape accesses. so you would still need my algorithm or > something similar regardless of how you actually get the data from disk > to ram. and yes i have used mmap on many projects. I'm not sure what you mean by "something similar", but yes, you'll need SOME algorithm to solve the problem. Which statement is so general as to be meaningless. I'm sure that there are some possible algorithms that you'd never have thought of. (Mostly because they're bad.) Disk/tape sorts were invented because back in the day there was not enough RAM to do anything useful and so everything had to go to disk. Of course once you're forced to go to disk, why not optimize it...? Of course this problem said to guarantee being able to do the sort, not necessarily to do it most efficiently. Therefore no single criteria - including disk accesses - necessarily MUST dominate your choice. Furthermore disk accesses are not created equal. There are multiple levels of cache between you and disk. Accessing data in a way that is friendly to cache will improve performance greatly. In particular managing to access data sequentially is orders of magnitude faster than jumping around. The key is not how often you "access disk", it is how often your hard drive has to do a seek. When it needs to seek it reads far more data than it is asked for and puts that in cache. When you read sequentially, most of your accesses come from cache, not disk. That is why databases use merge-sort so much, it accesses data in exactly the way that hard drives are designed to be accessed most efficiently. A quick sort has fewer disk accesses, but far more of them cause an unwanted seek. > when analyzing algorithm effienciency you must work out which is the > slowest operation that has the steepest growth curve and work on > minimizing it. since disk access is so much slower than ram access it > becomes the key element rather than the classic comparison in sorts. in You must, must, must. What is this preoccupation with must? As I just pointed out, disk accesses are not all equal. Secondly in many applications you will *parallelize* the slowest step, not minimize it. For instance good databases not only like to use mergesort internally, they often distribute the job to several processes or threads that all work at once, that way if one process is waiting on a disk read, others may be going at the same time. Thirdly, and most importantly, it is more important to make code work than to make it efficient. If a stupid solution will work and a smart one should be faster, code the stupid solution first. > a matrix transposition in ram, i would count the matrix accesses and/or > copies of elements. with a larger matrix, then ram accesses would be > key. my solution would load as much matrix into ram as possible (maybe > using mmap but that is not critical anymore) and transpose it. then > write the section out. that is 2 (large) disk accesses per chunk (or 1 > per disk block). then you do a merge (assuming you can access all the > sction files at one time) which is another disk access per section (or > block). and one more to write out the final matrix (in row order). so > that is O((2 + 2) * section_count) disk accesses which isn't too bad. You said that you want to assume that we can access all section files at once. Well suppose that I take a CSV file which is 100 columns by 10 million rows, transpose it, then try to transpose it again. Your assumption just broke. Maybe it would work for the person with the original problem, maybe not. Here is the outline of a solution that avoids all such assumptions. 1. Run through the CSV file and output a file lines of the format: $column,$row:$field You'll need to encode embedded newlines some way, for instance s/\\//g; s/\n/\\n/g; - you may also want to pre-pad the columns and rows with some number of 0's so that an ASCII-betical sort does The
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 09:01:32 -0500, Tolkin, Steve <[EMAIL PROTECTED]> wrote: > I think there may be a more restrictive limit, at least on Windows. > The OS must be able to find a contiguous block > of virtual memory, i.e. in pagefile.sys. > The paging file may not be able to grow, (depending on > how it is configured) and there may not be a large enough block. Given how mmap is supposed to work, I'd doubt this diagnosis. I'm not saying that you're wrong, I'm just saying that this would be a strange limitation. What mmap is supposed to do is cause a section of your address space to transparently be a file on disk. It should not actually read that data in, that is done for you on demand when you access data. (I guess an implementation could do that, but the Unix ones certainly don't.) However it does need to find a contiguous block of address space in your process memory space to map that data into. > I would like to learn more about the exact situation of memory mapping > files > on Windows -- the above is just based on a hour of Googling > and the info below. Well mmap is a Unix concept. To the best of my knowledge it is not natively supported in Windows. If you have it, it will be implemented by whatever Unix emulation you're using. For instance if you're using MKS then see http://mkssoftware.com/docs/man3/mmap.3.asp, if you're using cygwin see your local cygwin documentation. It may be that the emulations have additional limits that I don't know about. (I don't use Windows.) > Non perl related info follows: > I hit this limitation in the otherwise excellent disk indexing program > Wilbur at http://wilbur.redtree.com which is free (as in beer) and open > source too. > (But for Windows only.) > > It uses memory mapped files and when one of the indexes exceeds > about 500 MB it says something like "unable to map view of a file" > even though my pagefile.sys is 1536 MB. > > The developer said: > This is a system message that occurs when Wilbur is unable to memory map > one > of its index files and due to the way memory mapping works on Windows, I > think > this is normally a symptom of insufficient virtual memory space. > Possible > solutions might be increasing the size of your paging file (dig through > the > performance options on the system control panel to find this), > defragmenting > the disk and of course adding more real memory. Googling for that, I see him saying the same thing at http://wilbur.redtree.com/cgi-bin/wilburtalk.pl?noframes;read=1770 Like him, I have no idea why pagefile.sys would enter into the picture. It certainly doesn't on Linux. Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, 12 Nov 2004 07:38:57 -0500, Gyepi SAM <[EMAIL PROTECTED]> wrote: > On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: [...] > I think mmap would be just as ideal in Perl and a lot less work too. > Rather than indexing and parsing a *large* file, you must mmap > and parse it. In fact, the CSV code, which was left as an exercise in you > pseudo-code, would be the only code required. It depends on your definition of ideal. A Perl string is far more complex than a C string, and translating between the two adds complexity. It requires an external module and adds platform dependencies. > I should point out though that mmap has a 2GB limit on systems > without 64bit support. Such systems can't store files larger than > that anyhow. This is at best 2/3 correct. First you're right that mmap has a 2 GB limit because it maps things into your address space, and so the size of your pointers limit what you can address. It is also correct that there are complications in handling large files on 32 bit systems. Most operating systems didn't handle that case. However today most 32 bit operating systems have support for large files, and Perl added the necessary hooks to take advantage of it several versions ago. So if you have a relatively up to date system, odds are very good that you don't have a 2 GB limit. Certainly not on Windows or Linux. Cheers, Ben ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
[Boston.pm] emergency social meeting
> "bdf" == brian d foy <[EMAIL PROTECTED]> writes: bdf> I'm in Boston from Nov 15-18. brian d foy wants to hang with boston.pm some night next week. i mentioned this a little while ago but nothing happened so it is time to get it scheduled. how is next tuesday, nov 16 at 7pm? we could meet at the steakhouse BEHIND legal's on rte 9 (i was told it was decent with no waiting). i don't know how mobile or boston road aware brian is, so we could also meet in town if that works out. for those of you who don't know, bdf teaches for stonehenge (randal's biz) and is the founder of perl mongers (which you are a member of!) so let's give him his due props with a nice emergency social meeting. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
> "GS" == Gyepi SAM <[EMAIL PROTECTED]> writes: GS> On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: >> Seriously, while mmap is ideal in C, in Perl I would just build an array >> of tell()s for each line in the file and then walk through the lines, >> storing the offset of the last delimiter that I'd seen. GS> I think mmap would be just as ideal in Perl and a lot less work too. GS> Rather than indexing and parsing a *large* file, you must mmap GS> and parse it. In fact, the CSV code, which was left as an exercise in you GS> pseudo-code, would be the only code required. GS> I should point out though that mmap has a 2GB limit on systems GS> without 64bit support. Such systems can't store files larger than GS> that anyhow. >> Let the kernel file buffer do your heavy lifting for you. GS> Exactly, if s/kernel file/mmap/ this talk about mmap makes little sense to me. it may save some i/o and even some buffering but you still need the ram and mmap still causes disk accesses. if the original file is too big for ram then the algorithm chosen must be one to minimize disk accesses and mmap doesn't save those. this is why disk/tape sorts were invented, to minimize the slow disk and tape accesses. so you would still need my algorithm or something similar regardless of how you actually get the data from disk to ram. and yes i have used mmap on many projects. when analyzing algorithm effienciency you must work out which is the slowest operation that has the steepest growth curve and work on minimizing it. since disk access is so much slower than ram access it becomes the key element rather than the classic comparison in sorts. in a matrix transposition in ram, i would count the matrix accesses and/or copies of elements. with a larger matrix, then ram accesses would be key. my solution would load as much matrix into ram as possible (maybe using mmap but that is not critical anymore) and transpose it. then write the section out. that is 2 (large) disk accesses per chunk (or 1 per disk block). then you do a merge (assuming you can access all the sction files at one time) which is another disk access per section (or block). and one more to write out the final matrix (in row order). so that is O((2 + 2) * section_count) disk accesses which isn't too bad. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
RE: [Boston.pm] transposing rows and columns in a CSV file
I think there may be a more restrictive limit, at least on Windows. The OS must be able to find a contiguous block of virtual memory, i.e. in pagefile.sys. The paging file may not be able to grow, (depending on how it is configured) and there may not be a large enough block. I would like to learn more about the exact situation of memory mapping files on Windows -- the above is just based on a hour of Googling and the info below. Non perl related info follows: I hit this limitation in the otherwise excellent disk indexing program Wilbur at http://wilbur.redtree.com which is free (as in beer) and open source too. (But for Windows only.) It uses memory mapped files and when one of the indexes exceeds about 500 MB it says something like "unable to map view of a file" even though my pagefile.sys is 1536 MB. The developer said: This is a system message that occurs when Wilbur is unable to memory map one of its index files and due to the way memory mapping works on Windows, I think this is normally a symptom of insufficient virtual memory space. Possible solutions might be increasing the size of your paging file (dig through the performance options on the system control panel to find this), defragmenting the disk and of course adding more real memory. But I already have a paging file of 1536 MB, which is the recommended size in windows XP (3 times physical memory of 512 MB). I also do not think that defragmenting the disk helps, except possibly if done at boot time to defrag pagefile.sys (However I did do that and it still failed.) I was able to work around the problem by putting fewer words in the index, e.g. kept the default of min length = 3 and no numbers. I also have lots more files than most people, so I think very few people will hit this limitation. Hopefully helpfully yours, Steve -- Steve TolkinSteve . Tolkin at FMR dot COM 617-563-0516 Fidelity Investments 82 Devonshire St. V4D Boston MA 02109 There is nothing so practical as a good theory. Comments are by me, not Fidelity Investments, its subsidiaries or affiliates. -Original Message- From: Gyepi SAM [mailto:[EMAIL PROTECTED] Sent: Friday, November 12, 2004 7:39 AM To: [EMAIL PROTECTED] Subject: Re: [Boston.pm] transposing rows and columns in a CSV file On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: > Seriously, while mmap is ideal in C, in Perl I would just build an array > of tell()s for each line in the file and then walk through the lines, > storing the offset of the last delimiter that I'd seen. I think mmap would be just as ideal in Perl and a lot less work too. Rather than indexing and parsing a *large* file, you must mmap and parse it. In fact, the CSV code, which was left as an exercise in you pseudo-code, would be the only code required. I should point out though that mmap has a 2GB limit on systems without 64bit support. Such systems can't store files larger than that anyhow. > Let the kernel file buffer do your heavy lifting for you. Exactly, if s/kernel file/mmap/ -Gyepi -- The convenient method is insecure and the secure method is inconvenient. --me ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] transposing rows and columns in a CSV file
On Fri, Nov 12, 2004 at 02:11:37AM -0500, Aaron Sherman wrote: > Seriously, while mmap is ideal in C, in Perl I would just build an array > of tell()s for each line in the file and then walk through the lines, > storing the offset of the last delimiter that I'd seen. I think mmap would be just as ideal in Perl and a lot less work too. Rather than indexing and parsing a *large* file, you must mmap and parse it. In fact, the CSV code, which was left as an exercise in you pseudo-code, would be the only code required. I should point out though that mmap has a 2GB limit on systems without 64bit support. Such systems can't store files larger than that anyhow. > Let the kernel file buffer do your heavy lifting for you. Exactly, if s/kernel file/mmap/ -Gyepi -- The convenient method is insecure and the secure method is inconvenient. --me ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm