[Tutor] first python program to find citeulike duplicates

Suresh Krishna Thu, 20 Nov 2008 03:55:38 -0800


hi everybody,

i wrote this to solve the problem of exact duplicate entries in myciteulike library, that i wanted to remove. so i exported my entries inris format, and then parsed the entries to find exact duplicates based onmatching fields. the exact duplicates came about because i uploaded thesame RIS file twice to my citeulike library, as a result of the uploadbeing interrupted the first time.

it works (i think), but since this is my very first python program, iwould really appreciate feedback on how the program could be improved..


thanks much !!!!

suresh

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~éééé

InFileName= "original_library.ris";
INBIBFILE=open(InFileName,'r')

OutFileName= "C:/users/skrishna/desktop/library_without_duplicates.ris";
OUTBIBFILE=open(OutFileName,'w')

OutDupFileName= "C:/users/skrishna/desktop/library_of_duplicates.ris";
OUTDUPBIBFILE=open(OutDupFileName,'w')

current_entry=[]
current_keyval=[]
current_keys=[]

numduplicates=0

for line in INBIBFILE: #large file, so prefer not to use readlines()

    if not current_entry and line.isspace():
        continue  #dont write out successive blanks or initial blanks

elif current_entry and line.isspace(): #reached a blank thatdemarcates end of current entry

keyvalue=''.join(current_keyval) #generated a key based on certainfields

        if keyvalue not in current_keys: #is a unique entry

current_keys.append(keyvalue) #append current key to list ofkeys

            current_entry.append(line) #add the blank line to current entry

OUTBIBFILE.writelines(current_entry) #write out to new bibfile without duplicates

            current_entry=[] #clear current entry for next one
            current_keyval=[] #clear current key
        else:

numduplicates=numduplicates+1 #increment the number ofduplicates

            current_entry.append(line) #add the blank line at end of entry

OUTDUPBIBFILE.writelines(current_entry) #write out to list ofduplicates file

            current_entry=[] #clear current entry for next one
            current_keyval=[] #clear current key
    elif len(line)>2: #not a blank, so more stuff in currrent entry
        current_entry.append(line)

if line[0:2] in ('TY','JF','EP','TI','SP','KW','AU','PY','UR'):#only if line starts with these fields

            current_keyval.append(line) #append to current key

INBIBFILE.close()
OUTBIBFILE.close()
OUTDUPBIBFILE.close()

print numduplicates

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] first python program to find citeulike duplicates

Reply via email to