On Wed, 1 Feb 2006, Srinivas Iyyer wrote:
> I have a file which is 2.5 Gb., > [data cut] > > There are many duplicate lines. I wanted to get rid of the duplicates. Hi Srinivas, When we deal with such large files, we do have to be careful and aware of issues like the concept of memory. > I chose to parse to get uniqe element. > > f1 = open('mfile','r') > da = f1.read().split('\n') ^^^^^^^^^^^^^^^^^^^^^ This line is particularly problematic. Your file is 2.5GB, so you must have at least have that much memory. That's already a problem for most typical desktops. But you also need to store roughly 2.5GB as you're building the list of line elements from the whole string we've read from f1.read(). And that just means you've just broken the limits of most 32-bit machines that can't address more than 2**32 MB of memory at once! ###### >>> 2**32 4294967296L >>> 2 * (2.5 * 10**9) ## Rough estimate of memory necessary to do ## what your program needs at that point 5000000000.0 ###### That's the hard limit you're facing here. You must read the file progressively: trying to process it all at once is not going to scale at all. Simpler is something like this: ###### uniqueElements = Set() for line in open('mfile'): uniqueElements.add(line.rstrip()) ###### which tries to accumulate only unique elements, reading the file line by line. However, this approach too has limits. If the number of unique elements exceeds the amount of system memory, this too won't work. (An approach that does work involves using a mergesort along with auxiliary scratch files.) If you really need to get this job done fast, have you considered just using the Unix 'sort' utility? It has a uniqueness flag that you can enable, and it's always a good approach to use tools that already exist rather than write your own. That is, your problem may be solved by the simple shell command: sort -u [somefile] (Alternatively: sort [somefile] | uniq) So I guess my question is: why did you first approached this "unique line" problem with Python? _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor