Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote: >[EMAIL PROTECTED] wrote: >> Thanks to all who replied. It's very appreciated. >> >> Yes, I had to double check line counts and the number of lines is ~16 >> million (instead of stated 1.6B). > >OK, that's not bad at all. >

Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> wrote: >Thanks to all who replied. It's very appreciated. > >Yes, I had to doublecheck line counts and the number of lines is ~16 >million (insetead of stated 1.6B). > >Also: > >>What is a "Unicode text file"? How is it encoded: utf8, utf16, utf1

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Marc 'BlackJack' Rintsch
On Sun, 27 Jan 2008 10:00:45 +, Grant Edwards wrote: > On 2008-01-27, Stefan Behnel <[EMAIL PROTECTED]> wrote: >> Gabriel Genellina wrote: >>> use the Windows sort command. It has been >>> there since MS-DOS ages, there is no need to download and install other >>> packages, and the documentati

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Grant Edwards
On 2008-01-27, Stefan Behnel <[EMAIL PROTECTED]> wrote: > Gabriel Genellina wrote: >> use the Windows sort command. It has been >> there since MS-DOS ages, there is no need to download and install other >> packages, and the documentation at >> http://technet.microsoft.com/en-us/library/bb491004.asp

Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Stefan Behnel
Gabriel Genellina wrote: > use the Windows sort command. It has been > there since MS-DOS ages, there is no need to download and install other > packages, and the documentation at > http://technet.microsoft.com/en-us/library/bb491004.aspx says: > > Limits on file size: > The sort command has no

Re: Sorting Large File (Code/Performance)

2008-01-26 Thread Gabriel Genellina
En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin <"http://phr.cx"@NOSPAM.invalid> escribi�: > Nicko <[EMAIL PROTECTED]> writes: >> # The next line is order O(n) in the number of chunks >> (line, fileindex) = min(mergechunks) > > You should use the heapq module to make this operation O(log

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Paul Rubin
Nicko <[EMAIL PROTECTED]> writes: > # The next line is order O(n) in the number of chunks > (line, fileindex) = min(mergechunks) You should use the heapq module to make this operation O(log n) instead. -- http://mail.python.org/mailman/listinfo/python-list

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Nicko
On Jan 24, 9:26 pm, [EMAIL PROTECTED] wrote: > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't > > hold your breath. > > I am limited with resources. Unfortunately. As long as you have at least as much disc space spare as you need to hold a copy of the file then this

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 25, 9:23 am, Asim <[EMAIL PROTECTED]> wrote: > On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: > > > > > Thanks to all who replied. It's very appreciated. > > > Yes, I had to doublecheck line counts and the number of lines is ~16 > > million (insetead of stated 1.6B). > > > Also: > > > >What is

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote: > Thanks to all who replied. It's very appreciated. > > Yes, I had to doublecheck line counts and the number of lines is ~16 > million (insetead of stated 1.6B). > > Also: > > >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, > >u

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle <[EMAIL PROTECTED]> writes: > > Unix sort does external sorting when needed. > >Ah, someone finally put that in. Good. I hadn't looked at > "sort"'s manual page in many years. Huh? It has been like that from the beginning. It HAD to be. Unix was originally written on a PDP-11.

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
Paul Rubin wrote: > John Nagle <[EMAIL PROTECTED]> writes: >> - Get enough memory to do the sort with an in-memory sort, like >> UNIX "sort" or Python's "sort" function. > > Unix sort does external sorting when needed. Ah, someone finally put that in. Good. I hadn't looked at "sort"

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle <[EMAIL PROTECTED]> writes: > - Get enough memory to do the sort with an in-memory sort, like > UNIX "sort" or Python's "sort" function. Unix sort does external sorting when needed. -- http://mail.python.org/mailman/listinfo/python-list

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote: > Thanks to all who replied. It's very appreciated. > > Yes, I had to double check line counts and the number of lines is ~16 > million (instead of stated 1.6B). OK, that's not bad at all. You have a few options: - Get enough memory to do the sort with an in

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Martin Marcher
On Thursday 24 January 2008 20:56 John Nagle wrote: > [EMAIL PROTECTED] wrote: >> Hello all, >> >> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like >> to sort based on first two characters. > > Given those numbers, the average number of characters per line is > less tha

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 8:26 am, [EMAIL PROTECTED] wrote: > I need to isolate all lines that start with two characters (zz to be > particular) What does "isolate" mean to you? What does this have to do with sorting? What do you actually want to do with (a) the lines starting with "zz" (b) the other lines? Wh

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
Stefan Behnel wrote: > [EMAIL PROTECTED] wrote: >>> What are you going to do with it after it's sorted? >> I need to isolate all lines that start with two characters (zz to be >> particular) > > "Isolate" as in "extract"? Remove the rest? > > Then why don't you extract the lines first, without so

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: >> What are you going to do with it after it's sorted? > I need to isolate all lines that start with two characters (zz to be > particular) "Isolate" as in "extract"? Remove the rest? Then why don't you extract the lines first, without sorting the file? (or sort it afterw

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Thanks to all who replied. It's very appreciated. Yes, I had to doublecheck line counts and the number of lines is ~16 million (insetead of stated 1.6B). Also: >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, >utf16be, ??? If you don't know, do this: The file is UTF-8 >

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 6:18 am, [EMAIL PROTECTED] wrote: > Hello all, > > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like > to sort based on first two characters. If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 * 1024 ** 3 bytes of data, that's 1.34 bytes per line. If

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote: > Hello all, > > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like > to sort based on first two characters. Given those numbers, the average number of characters per line is less than 2. Please check. John

Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
[EMAIL PROTECTED] writes: > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like > to sort based on first two characters. > > I'd greatly appreciate if someone can post sample code that can help > me do this. Use the unix sort command: sort inputfile -o outputfile I think

Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Hello all, I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like to sort based on first two characters. I'd greatly appreciate if someone can post sample code that can help me do this. Also, any ideas on approximately how long is the sort process going to take (XP, Dual Core 2.0