On Tue, 25 Jan 2005, Scott Melnyk wrote:


> I have an file in the form shown at the end (please forgive any
> wrapparounds due to the width of the screen here- the lines starting
> with ENS end with the e-12 or what have you on same line.)
>
> What I would like is to generate an output file of  any other
> ENSE000...e-4 (or whathaveyou) lines that appear in more than one
> place and for each of those the queries they appear related to.

Hi Scott,

One way to do this might be to do it in two passes across the file.

The first pass through the file can identify records that appear more than
once.  The second pass can take that knowledge, and then display those
records.

In pseudocode, this will look something like:

###
hints = identifyDuplicateRecords(filename)
displayDuplicateRecords(filename, hints)
###



> My data set the below is taken from is over 2.4 gb so speed and memory
> considerations come into play.
>
> Are sets more effective than lists for this?

Sets or dictionaries make the act of "lookup" of a key fairly cheap.  In
the two-pass approach, the first pass can use a dictionary to accumulate
the number of times a certain record's key has occurred.

Note that, because your file is so large, the dictionary probably
shouldn't accumulation the whole mass of information that we've seen so
far: instead, it's sufficient to record the information we need to
recognize a duplicate.


If you have more questions, please feel free to ask!

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to