subject:"shuffle the lines of a large file"

Re: shuffle the lines of a large file

2005-03-12 Thread paul koelle

Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. from comp.lang import python ;) Paul -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-11 Thread Simon Brunning

On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram [EMAIL PROTECTED] wrote: On Tuesday 08 March 2005 15:55, Simon Brunning wrote: Ah, but that's the clever bit; it *doesn't* store the whole list - only the selected lines. But that means that it'll only read several lines from the file,

Re: shuffle the lines of a large file

2005-03-11 Thread Peter Otten

Simon Brunning wrote: I couldn't resist. ;-) Me neither... import random def randomLines(filename, lines=1): selected_lines = list(None for line_no in xrange(lines)) for line_index, line in enumerate(open(filename)): for selected_line_index in xrange(lines):

Re: shuffle the lines of a large file

2005-03-10 Thread Stefan Behnel

Simon Brunning wrote: On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning wrote: selected_lines = list(None for line_no in xrange(lines)) Just a short note on this line. If lines is really large, its much faster to use from itertools import repeat selected_lines = list(repeat(None, len(lines)))

Re: shuffle the lines of a large file

2005-03-10 Thread Heiko Wundram

On Tuesday 08 March 2005 15:55, Simon Brunning wrote: Ah, but that's the clever bit; it *doesn't* store the whole list - only the selected lines. But that means that it'll only read several lines from the file, never do a shuffle of the whole file content... When you'd want to shuffle the file

Re: shuffle the lines of a large file

2005-03-08 Thread Nick Craig-Wood

Raymond Hettinger [EMAIL PROTECTED] wrote: from random import random out = open('corpus.decorated', 'w') for line in open('corpus.uniq'): print out, '%.14f %s' % (random(), line), out.close() sort corpus.decorated | cut -c 18- corpus.randomized Very good solution! Sort

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning

On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote: As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. If this is what's wanted, then perhaps some variation on this cookbook recipe might do the trick:

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning

On Tue, 8 Mar 2005 14:13:01 +, Simon Brunning [EMAIL PROTECTED] wrote: On 7 Mar 2005 06:38:49 -0800, gry@ll.mit.edu gry@ll.mit.edu wrote: As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. If this is what's wanted,

Re: shuffle the lines of a large file

2005-03-08 Thread Heiko Wundram

On Tuesday 08 March 2005 15:28, Simon Brunning wrote: This has the advantage that every line had the same chance of being picked regardless of its length. There is the chance that it'll pick the same line more than once, though. Problem being: if the file the OP is talking about really is 80GB

Re: shuffle the lines of a large file

2005-03-08 Thread Simon Brunning

On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram [EMAIL PROTECTED] wrote: Problem being: if the file the OP is talking about really is 80GB in size, and you consider a sentence to have 80 bytes on average (it's likely to have less than that), that makes 10^9 sentences in the file. Now, multiply

shuffle the lines of a large file

2005-03-07 Thread Joerg Schuster

Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays

Re: shuffle the lines of a large file

2005-03-07 Thread Kent Johnson

Joerg Schuster wrote: Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only once

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram

On Monday 07 March 2005 14:36, Joerg Schuster wrote: Any ideas? The following program should do the trick (filenames are hardcoded, look at top of file): ### shuffle.py import random import shelve # Open external files needed for data storage. lines = open(test.dat,r) lineindex =

Re: shuffle the lines of a large file

2005-03-07 Thread Eddie Corns

Joerg Schuster [EMAIL PROTECTED] writes: Hello, I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. The fact that every sentence

RE: shuffle the lines of a large file

2005-03-07 Thread Alex Stapleton

Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Alex Stapleton Sent: 07 March 2005 14:17 To: Joerg Schuster; python-list@python.org Subject: RE: shuffle the lines of a large file Not tested this, run it (or some derivation thereof) over the output to get increasing randomness

Re: shuffle the lines of a large file

2005-03-07 Thread Heiko Wundram

Replying to oneself is bad, but although the program works, I never intended to use a shelve to store the data. Better to use anydbm. So, just replace: import shelve by import anydbm and lineindex = shelve.open(test.idx) by lineindex = anydbm.open(test.idx,c) Keep the rest as is. --

Re: shuffle the lines of a large file

2005-03-07 Thread Richard Brodie

Joerg Schuster [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I am looking for a method to shuffle the lines of a large file. Of the top of my head: decorate, randomize, undecorate. Prepend a suitable large random number or hash to each line and then use sort. You could prepend new

Re: shuffle the lines of a large file

2005-03-07 Thread gry

As far as I can tell, what you ultimately want is to be able to extract a random (representative?) subset of sentences. Given the huge size of data, I would suggest not randomizing the file, but randomizing accesses to the file. E.g. (sorry for off-the-cuff pseudo python): [adjust 8196 == 2**13

Re: shuffle the lines of a large file

2005-03-07 Thread Warren Postma

Joerg Schuster wrote: Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help. Any ideas? Why don't you index the file? I would store the byte-offsets of the beginning of each line into an index file. Then you can generate a random number from 1 to

Re: shuffle the lines of a large file

2005-03-07 Thread Joerg Schuster

Thanks to all. This thread shows again that Python's best feature is comp.lang.python. Jörg -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread Steven Bethard

Joerg Schuster wrote: Thanks to all. This thread shows again that Python's best feature is comp.lang.python. +1 QOTW STeVe -- http://mail.python.org/mailman/listinfo/python-list

RE: shuffle the lines of a large file

2005-03-07 Thread Batista, Facundo

Title: RE: shuffle the lines of a large file [Joerg Schuster] #- Thanks to all. This thread shows again that Python's best feature is #- comp.lang.python. QOTW! QOTW! . Facundo Bitácora De Vuelo: http://www.taniquetil.com.ar/plog PyAr - Python Argentina: http://pyar.decode.com.ar

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard

[Joerg Schuster] I am looking for a method to shuffle the lines of a large file. If speed and space are not a concern, I would be tempted to presume that this can be organised without too much difficulty. However, looking for speed handling a big file, while keeping equiprobability of all

Re: shuffle the lines of a large file

2005-03-07 Thread François Pinard

[Heiko Wundram] Replying to oneself is bad, [...] Not necessarily. :-) -- François Pinard http://pinard.progiciels-bpi.ca -- http://mail.python.org/mailman/listinfo/python-list

Re: shuffle the lines of a large file

2005-03-07 Thread Raymond Hettinger

[Joerg Schuster] I am looking for a method to shuffle the lines of a large file. I have a corpus of sorted and uniqed English sentences that has been produced with (1): (1) sort corpus | uniq corpus.uniq corpus.uniq is 80G large. Since the corpus is huge, the python portion should

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

RE: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

RE: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

Re: shuffle the lines of a large file

25 matches

Site Navigation

Mail list logo

Footer information