Joerg Schuster wrote:
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
from comp.lang import python ;)
Paul
--
http://mail.python.org/mailman/listinfo/python-list
On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram [EMAIL PROTECTED] wrote:
On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
Ah, but that's the clever bit; it *doesn't* store the whole list -
only the selected lines.
But that means that it'll only read several lines from the file,
Simon Brunning wrote:
I couldn't resist. ;-)
Me neither...
import random
def randomLines(filename, lines=1):
selected_lines = list(None for line_no in xrange(lines))
for line_index, line in enumerate(open(filename)):
for selected_line_index in xrange(lines):
Simon Brunning wrote:
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning wrote:
selected_lines = list(None for line_no in xrange(lines))
Just a short note on this line. If lines is really large, its much faster
to use
from itertools import repeat
selected_lines = list(repeat(None, len(lines)))
On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
Ah, but that's the clever bit; it *doesn't* store the whole list -
only the selected lines.
But that means that it'll only read several lines from the file, never do a
shuffle of the whole file content... When you'd want to shuffle the file
Raymond Hettinger [EMAIL PROTECTED] wrote:
from random import random
out = open('corpus.decorated', 'w')
for line in open('corpus.uniq'):
print out, '%.14f %s' % (random(), line),
out.close()
sort corpus.decorated | cut -c 18- corpus.randomized
Very good solution!
Sort
On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote:
As far as I can tell, what you ultimately want is to be able to extract
a random (representative?) subset of sentences.
If this is what's wanted, then perhaps some variation on this cookbook
recipe might do the trick:
On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning
[EMAIL PROTECTED] wrote:
On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote:
As far as I can tell, what you ultimately want is to be able to extract
a random (representative?) subset of sentences.
If this is what's wanted,
On Tuesday 08 March 2005 15:28, Simon Brunning wrote:
This has the advantage that every line had the same chance of being
picked regardless of its length. There is the chance that it'll pick
the same line more than once, though.
Problem being: if the file the OP is talking about really is 80GB
On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram [EMAIL PROTECTED] wrote:
Problem being: if the file the OP is talking about really is 80GB in size, and
you consider a sentence to have 80 bytes on average (it's likely to have less
than that), that makes 10^9 sentences in the file. Now, multiply
Hello,
I am looking for a method to shuffle the lines of a large file.
I have a corpus of sorted and uniqed English sentences that has been
produced with (1):
(1) sort corpus | uniq corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays
Joerg Schuster wrote:
Hello,
I am looking for a method to shuffle the lines of a large file.
I have a corpus of sorted and uniqed English sentences that has been
produced with (1):
(1) sort corpus | uniq corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only
once
On Monday 07 March 2005 14:36, Joerg Schuster wrote:
Any ideas?
The following program should do the trick (filenames are hardcoded, look at
top of file):
### shuffle.py
import random
import shelve
# Open external files needed for data storage.
lines = open(test.dat,r)
lineindex =
Joerg Schuster [EMAIL PROTECTED] writes:
Hello,
I am looking for a method to shuffle the lines of a large file.
I have a corpus of sorted and uniqed English sentences that has been
produced with (1):
(1) sort corpus | uniq corpus.uniq
corpus.uniq is 80G large. The fact that every sentence
Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Alex
Stapleton
Sent: 07 March 2005 14:17
To: Joerg Schuster; python-list@python.org
Subject: RE: shuffle the lines of a large file
Not tested this, run it (or some derivation thereof) over the output to get
increasing randomness
Replying to oneself is bad, but although the program works, I never intended
to use a shelve to store the data. Better to use anydbm.
So, just replace:
import shelve
by
import anydbm
and
lineindex = shelve.open(test.idx)
by
lineindex = anydbm.open(test.idx,c)
Keep the rest as is.
--
Joerg Schuster [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
I am looking for a method to shuffle the lines of a large file.
Of the top of my head: decorate, randomize, undecorate.
Prepend a suitable large random number or hash to each
line and then use sort. You could prepend new
As far as I can tell, what you ultimately want is to be able to extract
a random (representative?) subset of sentences. Given the huge size
of data, I would suggest not randomizing the file, but randomizing
accesses to the file. E.g. (sorry for off-the-cuff pseudo python):
[adjust 8196 == 2**13
Joerg Schuster wrote:
Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.
Any ideas?
Why don't you index the file? I would store the byte-offsets of the
beginning of each line into an index file. Then you can generate a
random number from 1 to
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
Jörg
--
http://mail.python.org/mailman/listinfo/python-list
Joerg Schuster wrote:
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.
+1 QOTW
STeVe
--
http://mail.python.org/mailman/listinfo/python-list
Title: RE: shuffle the lines of a large file
[Joerg Schuster]
#- Thanks to all. This thread shows again that Python's best feature is
#- comp.lang.python.
QOTW! QOTW!
. Facundo
Bitácora De Vuelo: http://www.taniquetil.com.ar/plog
PyAr - Python Argentina: http://pyar.decode.com.ar
[Joerg Schuster]
I am looking for a method to shuffle the lines of a large file.
If speed and space are not a concern, I would be tempted to presume that
this can be organised without too much difficulty. However, looking for
speed handling a big file, while keeping equiprobability of all
[Heiko Wundram]
Replying to oneself is bad, [...]
Not necessarily. :-)
--
François Pinard http://pinard.progiciels-bpi.ca
--
http://mail.python.org/mailman/listinfo/python-list
[Joerg Schuster]
I am looking for a method to shuffle the lines of a large file.
I have a corpus of sorted and uniqed English sentences that has been
produced with (1):
(1) sort corpus | uniq corpus.uniq
corpus.uniq is 80G large.
Since the corpus is huge, the python portion should
25 matches
Mail list logo