Sorting Large File (Code/Performance)
Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core 2.0GHz w/2GB RAM). Cheers, Ira -- http://mail.python.org/mailman/listinfo/python-list
Re: Sorting Large File (Code/Performance)
Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this: The file is UTF-8 Do the first two characters always belong to the ASCII subset? Yes, first two always belong to ASCII subset What are you going to do with it after it's sorted? I need to isolate all lines that start with two characters (zz to be particular) Here's a start: http://docs.python.org/lib/typesseq-mutable.html Google GnuWin32 and see if their sort does what you want. Will do, thanks for the tip. If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath. I am limited with resources. Unfortunately. Cheers, Ira -- http://mail.python.org/mailman/listinfo/python-list
Filtering content of a text file
Hello All, I'd greatly appreciate if you can take a look at the task I need help with. It'd be outstanding if someone can provide some sample Python code. Thanks a lot, Ira --- Problem --- I am working with 30K+ record datasets in flat file format (.txt) that look like this: //-+alibaba sinage //-+amra damian//_9 //-+anix anire//_ //-+borom //-+bokima sun drane //-+ciren //-+cop calestieon eded //-+ciciban //-+drago kimano sole The records start with the same string (in the example //-+) wich is followed by another string of characters taht's changing from record to record. I am working on one file at the time and for each file I need to be able to do the following: a) By looping thru the file the program should isolate all records that have letter a following the //-+ b) The isolated dataset will contain only records that start with //- +a c) Save the isolated dataset as flat flat text file named a.txt d) Repeat a), b) and c) for all letters of english alphabet (a thru z) and numerical values (0 thru 9) CP: An inch of time is an inch of gold but you can't buy that inch of time with an inch of gold. Random Link Generator -- http://www.transactioncodes.com -- -- http://mail.python.org/mailman/listinfo/python-list
Re: Filtering content of a text file
Thanks all for the input. This is going to be a great basis for starting. And, yeah - I wish it was a homework. Best, Ira -- http://mail.python.org/mailman/listinfo/python-list