Re: Sorting Large File (Code/Performance)
In article [EMAIL PROTECTED], [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do the first two characters always belong to the ASCII subset? Yes, first two always belong to ASCII subset What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Like in? grep '^zz' longfile aapje You will have a hard time to beat that with python, in every respect. SNIP Cheers, Ira Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. [EMAIL PROTECTED]arc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
In article [EMAIL PROTECTED], John Nagle [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to double check line counts and the number of lines is ~16 million (instead of stated 1.6B). OK, that's not bad at all. You have a few options: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. There is no Unix sort besides the executable binary /usr/bin/sort There is qsort() in the standard c-library but you would have to write a c-program around it. - Thrash; in-memory sorts do very badly with virtual memory, but eventually they finish. Might take many hours. A long time I thought that too, but of course it doesn't apply to mergesorts. Then I heard reputable sources say that quicksort has good locality. Thinking about it, I find it more than plausible. Now I wonder whether even trash to disk (page fault) could tank a qsort using application. Why would it be a worse than a batch sort? - Get a serious disk-to-disk sort program. (See http://www.ordinal.com/;. There's a free 30-day trial. It can probably sort your data in about a minute.) This is the way, but you don't need/want to go with trial software. - Load the data into a database like MySQL and let it do the work. This is slow if done wrong, but OK if done right. - Write a distribution sort yourself. Fan out the incoming file into one file for each first letter, sort each subfile, merge the results. With DRAM at $64 for 4GB, I'd suggest just getting more memory and using a standard in-memory sort. You give the mistaken impression that normal sort programs can only sort what can be hold in memory. This is only true for MSDOS' sort.exe that was an abomination even in 1983. All sort programs on Unixes, way back in the 1980 could handle files that were restricted by disk size. With respect to Microsoft OS'es: Despite its age and using only 1 Mbyte of internal memory, my ssort.exe can handle the OP's problem with ease. It sorted a 12 Mbyte file in minutes even in 1993. http://home.hccnet.nl/a.w.m.van.der.horst/ssort.html This is a plain executable, so no hassles and no trial. It is GPL, so you can even hack the source, if you fancy C++. There are a lot of Unix compatible toolkit present for Windows nowadays. Install one, you can spare a dozen Mbytes, can't you? John Nagle Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. [EMAIL PROTECTED]arc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says: Limits on file size: The sort command has no limit on file size. Sure, since no-one can ever try it with more than 640k, it's easy to state that there is no limit. :) Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote: Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says: Limits on file size: The sort command has no limit on file size. Sure, since no-one can ever try it with more than 640k, it's easy to state that there is no limit. :) Huh? I used DOS sort to sort files much bigger than 640K. And I used the Unix sort command on a machine with 64K bytes of data space to sort files even larger than the ones under DOS. -- Grant -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Sun, 27 Jan 2008 10:00:45 +, Grant Edwards wrote: On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote: Gabriel Genellina wrote: use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says: Limits on file size: The sort command has no limit on file size. Sure, since no-one can ever try it with more than 640k, it's easy to state that there is no limit. :) Huh? I used DOS sort to sort files much bigger than 640K. That was an allusion to a quote misattributed to Bill Gates about DOS: 640K ought to be enough for anybody. http://en.wikiquote.org/wiki/Bill_Gates#Misattributed Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin http://phr.cx@NOSPAM.invalid escribi�: Nicko [EMAIL PROTECTED] writes: # The next line is order O(n) in the number of chunks (line, fileindex) = min(mergechunks) You should use the heapq module to make this operation O(log n) instead. Or forget about Python and use the Windows sort command. It has been there since MS-DOS ages, there is no need to download and install other packages, and the documentation at http://technet.microsoft.com/en-us/library/bb491004.aspx says: Limits on file size: The sort command has no limit on file size. Better, since the OP only intents to extract lines starting with zz, use the findstr command: findstr /l /b zz filename.exe would do the job. Why doing things more complicated than that? -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Jan 24, 9:26 pm, [EMAIL PROTECTED] wrote: If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. As long as you have at least as much disc space spare as you need to hold a copy of the file then this is not too hard. Split the file into chunks that are small enough to fit in memory, sort each chunk and write it to a file and then interleave the chunks. Below is a cheap and cheesy outline of code to do this, from which you can start. For files which are hugely larger than your available memory you can do this recursively but for files which are 10 to 100 times too big the single-pass code below will probably work just fine. The complexity is technically O(n.(log(c)+(n/c))) where n is the size of input and c is the chunk size; once n/c (the number of chunks) exceeds log(c) the cost of merging the chunks will start to dominate, though a recursive version would be slowed by needing a lot more disc access. #!/usr/bin/env python from itertools import islice from tempfile import TemporaryFile import sys # Tweak this number to fill your memory lines_per_chunk = 10 chunkfiles = [] mergechunks = [] while True: chunk = list(islice(sys.stdin, lines_per_chunk)) if not chunk: break chunk.sort() f = TemporaryFile() f.writelines(chunk) f.seek(0) mergechunks.append((chunk[0], len(chunkfiles))) chunkfiles.append(f) while mergechunks: # The next line is order O(n) in the number of chunks (line, fileindex) = min(mergechunks) mergechunks.remove((line, fileindex)) sys.stdout.write(line) nextline = chunkfiles[fileindex].readline() if nextline == : chunkfiles[fileindex].close() else: mergechunks.append((nextline, fileindex)) -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do the first two characters always belong to the ASCII subset? Yes, first two always belong to ASCII subset What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Here's a start:http://docs.python.org/lib/typesseq-mutable.html Google GnuWin32 and see if their sort does what you want. Will do, thanks for the tip. If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. Since the OP has stated that they are running Windows XP, and more than one poster has suggested installing more RAM in the box, I thought people should know that WinXP has certain limitations on the amount of memory that may be used: http://msdn2.microsoft.com/en-us/library/aa366778.aspx Firstly, the maximum amount of physical memory that may be installed is 4GB. Secondly, with the 4 gigabyte tuning and IMAGE_FILE_LARGE_ADDRESS_AWARE patches, the maximum amount of virtual memory (phyical memory + swapfile size) that may be assigned to user processes is 2GB. Hence, even if you made a 100GB swap file with 4GB RAM installed, by default only a maximum of 2GB would ever be assigned to a user- process. With the two flags enabled, the maximum becomes 3GB. If the OP finds performance to be limited and thinks more RAM would help trying a later version of Windows would be a start, but better would be to try Linux or Mac OSX out. Cheers, Asim Cheers, Ira -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Jan 25, 9:23 am, Asim [EMAIL PROTECTED] wrote: On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do the first two characters always belong to the ASCII subset? Yes, first two always belong to ASCII subset What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Here's a start:http://docs.python.org/lib/typesseq-mutable.html Google GnuWin32 and see if their sort does what you want. Will do, thanks for the tip. If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. Since the OP has stated that they are running Windows XP, and more than one poster has suggested installing more RAM in the box, I thought people should know that WinXP has certain limitations on the amount of memory that may be used: http://msdn2.microsoft.com/en-us/library/aa366778.aspx Firstly, the maximum amount of physical memory that may be installed is 4GB. Secondly, with the 4 gigabyte tuning and IMAGE_FILE_LARGE_ADDRESS_AWARE patches, the maximum amount of virtual memory (phyical memory + swapfile size) that may be assigned to user processes is 2GB. Hence, even if you made a 100GB swap file with 4GB RAM installed, by default only a maximum of 2GB would ever be assigned to a user- process. With the two flags enabled, the maximum becomes 3GB. If the OP finds performance to be limited and thinks more RAM would help trying a later version of Windows would be a start, but better would be to try Linux or Mac OSX out. Cheers, Asim Cheers, Ira Sorry, just to clarify my response. Any 32-bit OS will only be able to assign 4GB of virtual memory to a single processes, the argument being that since processes can only issue 32-bit instructions the process can only address a maximum of 2^32 bytes of addresses (assuming the architecture is using byte-addressed memory). Another link that's easier to grok: http://www.codinghorror.com/blog/archives/000811.html However, a 32-bit OS may support more than 4GB of virtual memory (using Physical Address Extension, or PAE) and split it more intelligently between processes than Windows XP or Vista does: http://www.ibm.com/developerworks/linux/library/l-memmod/ So allocating more than 4GB of virtual memory to your sort application could be achieved through splitting your task into more than one process on an appropriate OS. AFAIK, such memory limitations are dependent on the particular Linux distro you're using, and I'm not sure about Mac OSX limitations. This applies doubly for 64-bit architectures and OS's. Please correct me, with references, if my conclusions are wrong. Cheers, Asim -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Nicko [EMAIL PROTECTED] writes: # The next line is order O(n) in the number of chunks (line, fileindex) = min(mergechunks) You should use the heapq module to make this operation O(log n) instead. -- http://mail.python.org/mailman/listinfo/python-list
Sorting Large File (Code/Performance)
Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core 2.0GHz w/2GB RAM). Cheers, Ira -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
[EMAIL PROTECTED] writes: I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Use the unix sort command: sort inputfile -o outputfile I think there is a cygwin port. Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core 2.0GHz w/2GB RAM). Eh, unix sort would probably take a while, somewhere between 15 minutes and an hour. If you only have to do it once it's not worth writing special purpose code. If you have to do it a lot, get some more ram for that box, suck the file into memory and do a radix sort. -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
[EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. Given those numbers, the average number of characters per line is less than 2. Please check. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Jan 25, 6:18 am, [EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 * 1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other definitions of billion and/or GB, the result is even fewer bytes per line. What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: print repr(open('the_file', 'rb').read(100)) and show us the results. What does based on [the] first two characters mean? Do you mean raw order based on the ordinal of each character i.e. no fancy language- specific collating sequence? Do the first two characters always belong to the ASCII subset? You'd like to sort a large file? Why? Sorting a file is just a means to an end, and often another means is more appropriate. What are you going to do with it after it's sorted? I'd greatly appreciate if someone can post sample code that can help me do this. I'm sure you would. However it would benefit you even more if instead of sitting on the beach next to the big arrow pointing to the drop zone, you were to read the manual and work out how to do it yourself. Here's a start: http://docs.python.org/lib/typesseq-mutable.html Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core 2.0GHz w/2GB RAM). If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. Instead of writing Python code, you are probably better off doing an external sort. You might consider looking for a Windows port of a Unicode-capable Unix sort utility. Google GnuWin32 and see if their sort does what you want. -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do the first two characters always belong to the ASCII subset? Yes, first two always belong to ASCII subset What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Here's a start: http://docs.python.org/lib/typesseq-mutable.html Google GnuWin32 and see if their sort does what you want. Will do, thanks for the tip. If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. Cheers, Ira -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
[EMAIL PROTECTED] wrote: What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Isolate as in extract? Remove the rest? Then why don't you extract the lines first, without sorting the file? (or sort it afterwards if you still need to). That would heavily cut down your memory footprint. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Stefan Behnel wrote: [EMAIL PROTECTED] wrote: What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Isolate as in extract? Remove the rest? Then why don't you extract the lines first, without sorting the file? (or sort it afterwards if you still need to). That would heavily cut down your memory footprint. Just for fun, this is what I meant: for utf8_line in open(filename, 'rb'): if utf8_line.startswith('zz'): print utf8_line Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Jan 25, 8:26 am, [EMAIL PROTECTED] wrote: I need to isolate all lines that start with two characters (zz to be particular) What does isolate mean to you? What does this have to do with sorting? What do you actually want to do with (a) the lines starting with zz (b) the other lines? What percentage of the lines start with zz? When looking at the GnuWin32 collection: (a) Unicode is not relevant to your sort problem. (b) grab yourself a wc and a grep while you're there -- they will help with how many lines and what percentage of lines questions. -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
On Thursday 24 January 2008 20:56 John Nagle wrote: [EMAIL PROTECTED] wrote: Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. Given those numbers, the average number of characters per line is less than 2. Please check. which would be true if 1.599.999.999 had 2 chars and the rest of the lines just one :) (but yes that would be an interesting question how to sort a 1 character line based on the first 2 of that line) martin -- http://noneisyours.marcher.name http://feeds.feedburner.com/NoneIsYours You are not free to read this message, by doing so, you have violated my licence and are required to urinate publicly. Thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
John Nagle [EMAIL PROTECTED] writes: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. Unix sort does external sorting when needed. -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
[EMAIL PROTECTED] wrote: Thanks to all who replied. It's very appreciated. Yes, I had to double check line counts and the number of lines is ~16 million (instead of stated 1.6B). OK, that's not bad at all. You have a few options: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. - Thrash; in-memory sorts do very badly with virtual memory, but eventually they finish. Might take many hours. - Get a serious disk-to-disk sort program. (See http://www.ordinal.com/;. There's a free 30-day trial. It can probably sort your data in about a minute.) - Load the data into a database like MySQL and let it do the work. This is slow if done wrong, but OK if done right. - Write a distribution sort yourself. Fan out the incoming file into one file for each first letter, sort each subfile, merge the results. With DRAM at $64 for 4GB, I'd suggest just getting more memory and using a standard in-memory sort. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Paul Rubin wrote: John Nagle [EMAIL PROTECTED] writes: - Get enough memory to do the sort with an in-memory sort, like UNIX sort or Python's sort function. Unix sort does external sorting when needed. Ah, someone finally put that in. Good. I hadn't looked at sort's manual page in many years. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
John Nagle [EMAIL PROTECTED] writes: Unix sort does external sorting when needed. Ah, someone finally put that in. Good. I hadn't looked at sort's manual page in many years. Huh? It has been like that from the beginning. It HAD to be. Unix was originally written on a PDP-11. The GNU version has also always been external. -- http://mail.python.org/mailman/listinfo/python-list