Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] wrote:
Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, 
utf16be, ??? If you don't know, do this:
The file is UTF-8

 Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset

 What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)

Like in?
   grep '^zz' longfile  aapje

You will have a hard time to beat that with python, in every respect.

SNIP


Cheers,

Ira

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
[EMAIL PROTECTED]arc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-02-02 Thread Albert van der Horst
In article [EMAIL PROTECTED],
John Nagle  [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
 Thanks to all who replied. It's very appreciated.

 Yes, I had to double check line counts and the number of lines is ~16
 million (instead of stated 1.6B).

OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX sort
   or Python's sort function.

There is no Unix sort besides the executable binary /usr/bin/sort
There is qsort() in the standard c-library
but you would have to write a c-program around it.

- Thrash; in-memory sorts do very badly with virtual memory, but eventually
   they finish.  Might take many hours.

A long time I thought that too, but of course it doesn't apply to mergesorts.
Then I heard reputable sources say that quicksort has good locality.
Thinking about it, I find it more than plausible.
Now I wonder whether even trash to disk (page fault) could tank a qsort
using application. Why would it be a worse than a batch sort?

- Get a serious disk-to-disk sort program. (See http://www.ordinal.com/;.
   There's a free 30-day trial.  It can probably sort your data
   in about a minute.)

This is the way, but you don't need/want to go with trial software.

- Load the data into a database like MySQL and let it do the work.
   This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself.  Fan out the incoming file into
   one file for each first letter, sort each subfile, merge the
   results.



With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

You give the mistaken impression that normal
sort programs can only sort what can be hold in memory. This is
only true for MSDOS' sort.exe that was an abomination even in 1983.

All sort programs on Unixes, way back in the 1980 could handle
files that were restricted by disk size.

With respect to Microsoft OS'es:
Despite its age and using only 1 Mbyte of internal memory,
my ssort.exe can handle the OP's problem with ease.

It sorted a 12 Mbyte file in minutes even in 1993.

http://home.hccnet.nl/a.w.m.van.der.horst/ssort.html

This is a plain executable, so no hassles and no trial.
It is GPL, so you can even hack the source, if you fancy C++.

There are a lot of Unix compatible toolkit present for
Windows nowadays. Install one, you can spare a dozen Mbytes,
can't you?


   John Nagle

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
[EMAIL PROTECTED]arc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Stefan Behnel
Gabriel Genellina wrote:
 use the Windows sort command. It has been
 there since MS-DOS ages, there is no need to download and install other
 packages, and the documentation at
 http://technet.microsoft.com/en-us/library/bb491004.aspx says:
 
 Limits on file size:
   The sort command has no limit on file size.

Sure, since no-one can ever try it with more than 640k, it's easy to state
that there is no limit. :)

Stefan

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Grant Edwards
On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote:
 Gabriel Genellina wrote:
 use the Windows sort command. It has been
 there since MS-DOS ages, there is no need to download and install other
 packages, and the documentation at
 http://technet.microsoft.com/en-us/library/bb491004.aspx says:
 
 Limits on file size:
   The sort command has no limit on file size.

 Sure, since no-one can ever try it with more than 640k, it's
 easy to state that there is no limit. :)

Huh?  I used DOS sort to sort files much bigger than 640K.  And
I used the Unix sort command on a machine with 64K bytes of
data space to sort files even larger than the ones under DOS.

-- 
Grant

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-27 Thread Marc 'BlackJack' Rintsch
On Sun, 27 Jan 2008 10:00:45 +, Grant Edwards wrote:

 On 2008-01-27, Stefan Behnel [EMAIL PROTECTED] wrote:
 Gabriel Genellina wrote:
 use the Windows sort command. It has been
 there since MS-DOS ages, there is no need to download and install other
 packages, and the documentation at
 http://technet.microsoft.com/en-us/library/bb491004.aspx says:
 
 Limits on file size:
   The sort command has no limit on file size.

 Sure, since no-one can ever try it with more than 640k, it's
 easy to state that there is no limit. :)
 
 Huh?  I used DOS sort to sort files much bigger than 640K.

That was an allusion to a quote misattributed to Bill Gates about DOS:

  640K ought to be enough for anybody.

http://en.wikiquote.org/wiki/Bill_Gates#Misattributed

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-26 Thread Gabriel Genellina
En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin  
http://phr.cx@NOSPAM.invalid escribi�:

 Nicko [EMAIL PROTECTED] writes:
 # The next line is order O(n) in the number of chunks
 (line, fileindex) = min(mergechunks)

 You should use the heapq module to make this operation O(log n) instead.

Or forget about Python and use the Windows sort command. It has been there  
since MS-DOS ages, there is no need to download and install other  
packages, and the documentation at  
http://technet.microsoft.com/en-us/library/bb491004.aspx says:

Limits on file size:
   The sort command has no limit on file size.

Better, since the OP only intents to extract lines starting with zz, use  
the findstr command:
findstr /l /b zz filename.exe
would do the job.

Why doing things more complicated than that?

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Nicko
On Jan 24, 9:26 pm, [EMAIL PROTECTED] wrote:
  If you really have a 2GB file and only 2GB of RAM, I suggest that you don't 
  hold your breath.

 I am limited with resources. Unfortunately.

As long as you have at least as much disc space spare as you need to
hold a copy of the file then this is not too hard.  Split the file
into chunks that are small enough to fit in memory, sort each chunk
and write it to a file and then interleave the chunks.  Below is a
cheap and cheesy outline of code to do this, from which you can start.

For files which are hugely larger than your available memory you can
do this recursively but for files which are 10 to 100 times too big
the single-pass code below will probably work just fine.  The
complexity is technically O(n.(log(c)+(n/c))) where n is the size of
input and c is the chunk size; once n/c (the number of chunks) exceeds
log(c) the cost of merging the chunks will start to dominate, though a
recursive version would be slowed by needing a lot more disc access.

#!/usr/bin/env python
from itertools import islice
from tempfile import TemporaryFile
import sys

# Tweak this number to fill your memory
lines_per_chunk = 10

chunkfiles = []
mergechunks = []

while True:
chunk = list(islice(sys.stdin, lines_per_chunk))
if not chunk:
   break
chunk.sort()
f = TemporaryFile()
f.writelines(chunk)
f.seek(0)
mergechunks.append((chunk[0], len(chunkfiles)))
chunkfiles.append(f)

while mergechunks:
# The next line is order O(n) in the number of chunks
(line, fileindex) = min(mergechunks)
mergechunks.remove((line, fileindex))
sys.stdout.write(line)
nextline = chunkfiles[fileindex].readline()
if nextline == :
chunkfiles[fileindex].close()
else:
mergechunks.append((nextline, fileindex))

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote:
 Thanks to all who replied. It's very appreciated.

 Yes, I had to doublecheck line counts and the number of lines is ~16
 million (insetead of stated 1.6B).

 Also:

 What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, 
 utf16be, ??? If you don't know, do this:

 The file is UTF-8

  Do the first two characters always belong to the ASCII subset?

 Yes, first two always belong to ASCII subset

  What are you going to do with it after it's sorted?

 I need to isolate all lines that start with two characters (zz to be
 particular)

  Here's a start:http://docs.python.org/lib/typesseq-mutable.html
  Google GnuWin32 and see if their sort does what you want.

 Will do, thanks for the tip.

  If you really have a 2GB file and only 2GB of RAM, I suggest that you don't 
  hold your breath.

 I am limited with resources. Unfortunately.


Since the OP has stated that they are running Windows XP, and more
than one poster has suggested installing more RAM in the box, I
thought people should know that WinXP has certain limitations on the
amount of memory that may be used:

http://msdn2.microsoft.com/en-us/library/aa366778.aspx

Firstly, the maximum amount of physical memory that may be installed
is 4GB.  Secondly, with the 4 gigabyte tuning and
IMAGE_FILE_LARGE_ADDRESS_AWARE patches, the maximum amount of
virtual memory (phyical memory + swapfile size) that may be assigned
to user processes is 2GB.

Hence, even if you made a 100GB swap file with 4GB RAM installed, by
default only a maximum of 2GB would ever be assigned to a user-
process.  With the two flags enabled, the maximum becomes 3GB.

If the OP finds performance to be limited and thinks more RAM would
help trying a later version of Windows would be a start, but better
would be to try Linux or Mac OSX out.

Cheers,
Asim


 Cheers,

 Ira

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Asim
On Jan 25, 9:23 am, Asim [EMAIL PROTECTED] wrote:
 On Jan 24, 4:26 pm, [EMAIL PROTECTED] wrote:



  Thanks to all who replied. It's very appreciated.

  Yes, I had to doublecheck line counts and the number of lines is ~16
  million (insetead of stated 1.6B).

  Also:

  What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, 
  utf16be, ??? If you don't know, do this:

  The file is UTF-8

   Do the first two characters always belong to the ASCII subset?

  Yes, first two always belong to ASCII subset

   What are you going to do with it after it's sorted?

  I need to isolate all lines that start with two characters (zz to be
  particular)

   Here's a start:http://docs.python.org/lib/typesseq-mutable.html
   Google GnuWin32 and see if their sort does what you want.

  Will do, thanks for the tip.

   If you really have a 2GB file and only 2GB of RAM, I suggest that you 
   don't hold your breath.

  I am limited with resources. Unfortunately.

 Since the OP has stated that they are running Windows XP, and more
 than one poster has suggested installing more RAM in the box, I
 thought people should know that WinXP has certain limitations on the
 amount of memory that may be used:

 http://msdn2.microsoft.com/en-us/library/aa366778.aspx

 Firstly, the maximum amount of physical memory that may be installed
 is 4GB.  Secondly, with the 4 gigabyte tuning and
 IMAGE_FILE_LARGE_ADDRESS_AWARE patches, the maximum amount of
 virtual memory (phyical memory + swapfile size) that may be assigned
 to user processes is 2GB.

 Hence, even if you made a 100GB swap file with 4GB RAM installed, by
 default only a maximum of 2GB would ever be assigned to a user-
 process.  With the two flags enabled, the maximum becomes 3GB.

 If the OP finds performance to be limited and thinks more RAM would
 help trying a later version of Windows would be a start, but better
 would be to try Linux or Mac OSX out.

 Cheers,
 Asim

  Cheers,

  Ira

Sorry, just to clarify my response.  Any 32-bit OS will only be able
to assign 4GB of virtual memory to a single processes, the argument
being that since processes can only issue 32-bit instructions the
process can only address a maximum of 2^32 bytes of addresses
(assuming the architecture is using byte-addressed memory).

Another link that's easier to grok:

http://www.codinghorror.com/blog/archives/000811.html

However, a 32-bit OS may support more than 4GB of virtual memory
(using Physical Address Extension, or PAE) and split it more
intelligently between processes than Windows XP or Vista does:

http://www.ibm.com/developerworks/linux/library/l-memmod/

So allocating more than 4GB of virtual memory to your sort application
could be achieved through splitting your task into more than one
process on an appropriate OS.  AFAIK, such memory limitations are
dependent on the particular Linux distro you're using, and I'm not
sure about Mac OSX limitations.  This applies doubly for 64-bit
architectures and OS's.

Please correct me, with references, if my conclusions are wrong.

Cheers,
Asim
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-25 Thread Paul Rubin
Nicko [EMAIL PROTECTED] writes:
 # The next line is order O(n) in the number of chunks
 (line, fileindex) = min(mergechunks)

You should use the heapq module to make this operation O(log n) instead.
-- 
http://mail.python.org/mailman/listinfo/python-list


Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
[EMAIL PROTECTED] writes:
 I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
 to sort based on first two characters.
 
 I'd greatly appreciate if someone can post sample code that can help
 me do this.

Use the unix sort command:

   sort inputfile -o outputfile 

I think there is a cygwin port.

 Also, any ideas on approximately how long is the sort process going to
 take (XP, Dual Core 2.0GHz w/2GB RAM).

Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour.  If you only have to do it once it's not worth
writing special purpose code.  If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote:
 Hello all,
 
 I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
 to sort based on first two characters.

Given those numbers, the average number of characters per line is
less than 2.  Please check.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 6:18 am, [EMAIL PROTECTED] wrote:
 Hello all,

 I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
 to sort based on first two characters.

If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 *
1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other
definitions of billion and/or GB, the result is even fewer bytes
per line.

What is a Unicode text file? How is it encoded: utf8, utf16,
utf16le, utf16be, ??? If you don't know, do this:

print repr(open('the_file', 'rb').read(100))

and show us the results.

What does based on [the] first two characters mean? Do you mean raw
order based on the ordinal of each character i.e. no fancy language-
specific collating sequence? Do the first two characters always belong
to the ASCII subset?

You'd like to sort a large file? Why? Sorting a file is just a means
to an end, and often another means is more appropriate. What are you
going to do with it after it's sorted?

 I'd greatly appreciate if someone can post sample code that can help
 me do this.

I'm sure you would. However it would benefit you even more if instead
of sitting on the beach next to the big arrow pointing to the drop
zone, you were to read the manual and work out how to do it yourself.
Here's a start: http://docs.python.org/lib/typesseq-mutable.html

 Also, any ideas on approximately how long is the sort process going to
 take (XP, Dual Core 2.0GHz w/2GB RAM).

If you really have a 2GB file and only 2GB of RAM, I suggest that you
don't hold your breath.

Instead of writing Python code, you are probably better off doing an
external sort. You might consider looking for a Windows port of a
Unicode-capable Unix sort utility. Google GnuWin32 and see if their
sort does what you want.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Ira . Kovac
Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

What is a Unicode text file? How is it encoded: utf8, utf16, utf16le, 
utf16be, ??? If you don't know, do this:
The file is UTF-8

 Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset

 What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)

 Here's a start: http://docs.python.org/lib/typesseq-mutable.html
 Google GnuWin32 and see if their sort does what you want.
Will do, thanks for the tip.

 If you really have a 2GB file and only 2GB of RAM, I suggest that you don't 
 hold your breath.
I am limited with resources. Unfortunately.

Cheers,

Ira
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 What are you going to do with it after it's sorted?
 I need to isolate all lines that start with two characters (zz to be
 particular)

Isolate as in extract? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Stefan Behnel
Stefan Behnel wrote:
 [EMAIL PROTECTED] wrote:
 What are you going to do with it after it's sorted?
 I need to isolate all lines that start with two characters (zz to be
 particular)
 
 Isolate as in extract? Remove the rest?
 
 Then why don't you extract the lines first, without sorting the file? (or sort
 it afterwards if you still need to). That would heavily cut down your memory
 footprint.

Just for fun, this is what I meant:

for utf8_line in open(filename, 'rb'):
if utf8_line.startswith('zz'):
print utf8_line

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Machin
On Jan 25, 8:26 am, [EMAIL PROTECTED] wrote:

 I need to isolate all lines that start with two characters (zz to be
 particular)

What does isolate mean to you? What does this have to do with
sorting?  What do you actually want to do with (a) the lines starting
with zz (b) the other lines? What percentage of the lines start with
zz?

When looking at the GnuWin32 collection:
(a) Unicode is not relevant to your sort problem.
(b) grab yourself a wc and a grep while you're there -- they will help
with how many lines and what percentage of lines questions.





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Martin Marcher
On Thursday 24 January 2008 20:56 John Nagle wrote:

 [EMAIL PROTECTED] wrote:
 Hello all,
 
 I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
 to sort based on first two characters.
 
 Given those numbers, the average number of characters per line is
 less than 2.  Please check.

which would be true if 1.599.999.999 had 2 chars and the rest of the lines
just one :)

(but yes that would be an interesting question how to sort a 1 character
line based on the first 2 of that line)

martin





-- 
http://noneisyours.marcher.name
http://feeds.feedburner.com/NoneIsYours

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle [EMAIL PROTECTED] writes:
 - Get enough memory to do the sort with an in-memory sort, like
   UNIX sort or Python's sort function.

Unix sort does external sorting when needed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
[EMAIL PROTECTED] wrote:
 Thanks to all who replied. It's very appreciated.
 
 Yes, I had to double check line counts and the number of lines is ~16
 million (instead of stated 1.6B).

OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX sort
or Python's sort function.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish.  Might take many hours.
- Get a serious disk-to-disk sort program. (See http://www.ordinal.com/;.
There's a free 30-day trial.  It can probably sort your data
in about a minute.)
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself.  Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread John Nagle
Paul Rubin wrote:
 John Nagle [EMAIL PROTECTED] writes:
 - Get enough memory to do the sort with an in-memory sort, like
   UNIX sort or Python's sort function.
 
 Unix sort does external sorting when needed.

   Ah, someone finally put that in.  Good. I hadn't looked at sort's manual
page in many years.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Sorting Large File (Code/Performance)

2008-01-24 Thread Paul Rubin
John Nagle [EMAIL PROTECTED] writes:
  Unix sort does external sorting when needed.
 
Ah, someone finally put that in.  Good. I hadn't looked at
 sort's manual page in many years.

Huh?  It has been like that from the beginning.  It HAD to be.  Unix
was originally written on a PDP-11.  The GNU version has also always
been external.
-- 
http://mail.python.org/mailman/listinfo/python-list