On Tue, 25 Jan 2005, Scott Melnyk wrote:
> I have an file in the form shown at the end (please forgive any > wrapparounds due to the width of the screen here- the lines starting > with ENS end with the e-12 or what have you on same line.) > > What I would like is to generate an output file of any other > ENSE000...e-4 (or whathaveyou) lines that appear in more than one > place and for each of those the queries they appear related to. Hi Scott, One way to do this might be to do it in two passes across the file. The first pass through the file can identify records that appear more than once. The second pass can take that knowledge, and then display those records. In pseudocode, this will look something like: ### hints = identifyDuplicateRecords(filename) displayDuplicateRecords(filename, hints) ### > My data set the below is taken from is over 2.4 gb so speed and memory > considerations come into play. > > Are sets more effective than lists for this? Sets or dictionaries make the act of "lookup" of a key fairly cheap. In the two-pass approach, the first pass can use a dictionary to accumulate the number of times a certain record's key has occurred. Note that, because your file is so large, the dictionary probably shouldn't accumulation the whole mass of information that we've seen so far: instead, it's sufficient to record the information we need to recognize a duplicate. If you have more questions, please feel free to ask! _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor