Re: shuffle the lines of a large file

2005-03-12 Thread paul koelle
Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. from comp.lang import python ;) Paul -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-11 Thread Simon Brunning
On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram [EMAIL PROTECTED] wrote: On Tuesday 08 March 2005 15:55, Simon Brunning wrote: Ah, but that's the clever bit; it *doesn't* store the whole list - only the selected lines. But that means that it'll only read several lines from the file,

Re: shuffle the lines of a large file

2005-03-11 Thread Peter Otten
Simon Brunning wrote: I couldn't resist. ;-) Me neither... import random def randomLines(filename, lines=1): selected_lines = list(None for line_no in xrange(lines)) for line_index, line in enumerate(open(filename)): for selected_line_index in xrange(lines):

Re: shuffle the lines of a large file

2005-03-10 Thread Stefan Behnel
Simon Brunning wrote: On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning wrote: selected_lines = list(None for line_no in xrange(lines)) Just a short note on this line. If lines is really large, its much faster to use from itertools import repeat selected_lines = list(repeat(None, len(lines)))

Re: shuffle the lines of a large file

2005-03-10 Thread Heiko Wundram
On Tuesday 08 March 2005 15:55, Simon Brunning wrote: Ah, but that's the clever bit; it *doesn't* store the whole list - only the selected lines. But that means that it'll only read several lines from the file, never do a shuffle of the whole file content... When you'd want to shuffle the file

Re: shuffle the lines of a large file

2005-03-08 Thread Nick Craig-Wood
Raymond Hettinger [EMAIL PROTECTED] wrote: from random import random out = open('corpus.decorated', 'w') for line in open('corpus.uniq'): print out, '%.14f %s' % (random(), line), out.close() sort corpus.decorated | cut -c 18- corpus.randomized Very good solution! Sort

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote: As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. If this is what's wanted, then perhaps some variation on this cookbook recipe might do the trick:

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning [EMAIL PROTECTED] wrote: On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote: As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. If this is what's wanted,

Re: shuffle the lines of a large file

2005-03-08 Thread Heiko Wundram
On Tuesday 08 March 2005 15:28, Simon Brunning wrote: This has the advantage that every line had the same chance of being picked regardless of its length. There is the chance that it'll pick the same line more than once, though. Problem being: if the file the OP is talking about really is 80GB

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning
On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram [EMAIL PROTECTED] wrote: Problem being: if the file the OP is talking about really is 80GB in size, and you consider a sentence to have 80 bytes on average (it's likely to have less than that), that makes 10^9 sentences in the file. Now, multiply

shuffle the lines of a large file

2005-03-07 Thread Joerg Schuster
Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays

Re: shuffle the lines of a large file

2005-03-07 Thread Kent Johnson
Joerg Schuster wrote: Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram
On Monday 07 March 2005 14:36, Joerg Schuster wrote: Any ideas? The following program should do the trick (filenames are hardcoded, look at top of file): ### shuffle.py import random import shelve # Open external files needed for data storage. lines = open(test.dat,r) lineindex =

Re: shuffle the lines of a large file

2005-03-07 Thread Eddie Corns
Joerg Schuster [EMAIL PROTECTED] writes: Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence

RE: shuffle the lines of a large file

2005-03-07 Thread Alex Stapleton
Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Alex Stapleton Sent: 07 March 2005 14:17 To: Joerg Schuster; python-list@python.org Subject: RE: shuffle the lines of a large file Not tested this, run it (or some derivation thereof) over the output to get increasing randomness

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram
Replying to oneself is bad, but although the program works, I never intended to use a shelve to store the data. Better to use anydbm. So, just replace: import shelve by import anydbm and lineindex = shelve.open(test.idx) by lineindex = anydbm.open(test.idx,c) Keep the rest as is. --

Re: shuffle the lines of a large file

2005-03-07 Thread Richard Brodie
Joerg Schuster [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I am looking for a method to shuffle the lines of a large file. Of the top of my head: decorate, randomize, undecorate. Prepend a suitable large random number or hash to each line and then use sort. You could prepend new

Re: shuffle the lines of a large file

2005-03-07 Thread gry
As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. Given the huge size of data, I would suggest not randomizing the file, but randomizing accesses to the file. E.g. (sorry for off-the-cuff pseudo python): [adjust 8196 == 2**13

Re: shuffle the lines of a large file

2005-03-07 Thread Warren Postma
Joerg Schuster wrote: Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help. Any ideas? Why don't you index the file? I would store the byte-offsets of the beginning of each line into an index file. Then you can generate a random number from 1 to

Re: shuffle the lines of a large file

2005-03-07 Thread Joerg Schuster
Thanks to all. This thread shows again that Python's best feature is comp.lang.python. Jörg -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread Steven Bethard
Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. +1 QOTW STeVe -- http://mail.python.org/mailman/listinfo/python-list

RE: shuffle the lines of a large file

2005-03-07 Thread Batista, Facundo
Title: RE: shuffle the lines of a large file [Joerg Schuster] #- Thanks to all. This thread shows again that Python's best feature is #- comp.lang.python. QOTW! QOTW! . Facundo Bitácora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.com.ar

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard
[Joerg Schuster] I am looking for a method to shuffle the lines of a large file. If speed and space are not a concern, I would be tempted to presume that this can be organised without too much difficulty. However, looking for speed handling a big file, while keeping equiprobability of all

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard
[Heiko Wundram] Replying to oneself is bad, [...] Not necessarily. :-) -- François Pinard http://pinard.progiciels-bpi.ca -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread Raymond Hettinger
[Joerg Schuster] I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. Since the corpus is huge, the python portion should