Re: [Tutor] need to get unique elements out of a 2.5Gb file

2006-02-02 Thread Alan Gauld
Hi,

  I have a file which is 2.5 Gb.,
 
 There are many duplicate lines.  I wanted to get rid
 of the duplicates.

First, can you use uniq which is a standard Unix/Linux OS command?

 I chose to parse to get uniqe element.
 
 f1 = open('mfile','r')
 da = f1.read().split('\n')

This reads 2.5G of data into memory. Do you have 2.5G of 
available memory?

It then splits it into lines, so why not read the file line by line 
instead?

for da in open('myfile'):
stuff here

 dat = da[:-1]

This creates a copy of the file contents - anbother 2.5GB!
if you used da = da[:-1]  you would only have one version.

However if you read it one line at a time you can go direct 
to putting it into the Set which means you never reach 
the 2.5GB size.

 f2 = open('res','w')
 dset = Set(dat)
 for i in dset:
f2.write(i)
f2.write('\n')

f2.write(i+'\n')

should be slightly faster and with this size of data set that 
probably is a visible difference!

 Problem: Python says it cannot hande such a large
 file. 

Thats probably not a Python issue but an available RAM issue.
But your code doesn't need the entire file in RAM so just read 
one line at a time and avoid the list..

If its still too big you can try batching the operations. 
Only process half the lines in the file say, then merge 
the resultant reduced files. The key point is that without 
resort to much more sophisticated algorithms you must 
at some point hold the final data set in RAM, if it is too 
big the program will fail.

A final strategy is to sort the file (which can be 
done - slowly! - in batches and remove duplicate lines 
afterwards, or even as part of the sort... But if you need 
to go that far come back for more details.

HTH,

Alan G.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] need to get unique elements out of a 2.5Gb file

2006-02-01 Thread Srinivas Iyyer
Hi Group,

I have a file which is 2.5 Gb.,
 
TRIM54  NM_187841.1 GO:0004984
TRIM54  NM_187841.1 GO:0001584
TRIM54  NM_187841.1 GO:0003674
TRIM54  NM_187841.1 GO:0004985
TRIM54  NM_187841.1 GO:0001584
TRIM54  NM_187841.1 GO:0001653
TRIM54  NM_187841.1 GO:0004984

There are many duplicate lines.  I wanted to get rid
of the duplicates.

I chose to parse to get uniqe element.

f1 = open('mfile','r')
da = f1.read().split('\n')
dat = da[:-1]
f2 = open('res','w')
dset = Set(dat)
for i in dset:
f2.write(i)
f2.write('\n')
f2.close()

Problem: Python says it cannot hande such a large
file. 
Any ideas please help me.

cheers
srini

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] need to get unique elements out of a 2.5Gb file

2006-02-01 Thread Danny Yoo


On Wed, 1 Feb 2006, Srinivas Iyyer wrote:

 I have a file which is 2.5 Gb.,

[data cut]

 There are many duplicate lines.  I wanted to get rid of the duplicates.


Hi Srinivas,

When we deal with such large files, we do have to be careful and aware of
issues like the concept of memory.


 I chose to parse to get uniqe element.

 f1 = open('mfile','r')
 da = f1.read().split('\n')
   ^

This line is particularly problematic.  Your file is 2.5GB, so you must
have at least have that much memory.  That's already a problem for most
typical desktops.  But you also need to store roughly 2.5GB as you're
building the list of line elements from the whole string we've read from
f1.read().

And that just means you've just broken the limits of most 32-bit machines
that can't address more than 2**32 MB of memory at once!

##
 2**32
4294967296L
 2 * (2.5 * 10**9)  ## Rough estimate of memory necessary to do
   ## what your program needs at that point
50.0
##

That's the hard limit you're facing here.


You must read the file progressively: trying to process it all at once is
not going to scale at all.

Simpler is something like this:

##
uniqueElements = Set()
for line in open('mfile'):
   uniqueElements.add(line.rstrip())
##

which tries to accumulate only unique elements, reading the file line by
line.

However, this approach too has limits.  If the number of unique elements
exceeds the amount of system memory, this too won't work.  (An approach
that does work involves using a mergesort along with auxiliary scratch
files.)



If you really need to get this job done fast, have you considered just
using the Unix 'sort' utility?

It has a uniqueness flag that you can enable, and it's always a good
approach to use tools that already exist rather than write your own.

That is, your problem may be solved by the simple shell command:

sort -u [somefile]
(Alternatively:   sort [somefile] | uniq)

So I guess my question is: why did you first approached this unique line
problem with Python?

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor