hi everybody,

i wrote this to solve the problem of exact duplicate entries in my citeulike library, that i wanted to remove. so i exported my entries in ris format, and then parsed the entries to find exact duplicates based on matching fields. the exact duplicates came about because i uploaded the same RIS file twice to my citeulike library, as a result of the upload being interrupted the first time.

it works (i think), but since this is my very first python program, i would really appreciate feedback on how the program could be improved..

thanks much !!!!

suresh

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~éééé

InFileName= "original_library.ris";
INBIBFILE=open(InFileName,'r')

OutFileName= "C:/users/skrishna/desktop/library_without_duplicates.ris";
OUTBIBFILE=open(OutFileName,'w')

OutDupFileName= "C:/users/skrishna/desktop/library_of_duplicates.ris";
OUTDUPBIBFILE=open(OutDupFileName,'w')

current_entry=[]
current_keyval=[]
current_keys=[]

numduplicates=0

for line in INBIBFILE: #large file, so prefer not to use readlines()

    if not current_entry and line.isspace():
        continue  #dont write out successive blanks or initial blanks
elif current_entry and line.isspace(): #reached a blank that demarcates end of current entry

keyvalue=''.join(current_keyval) #generated a key based on certain fields
        if keyvalue not in current_keys: #is a unique entry
current_keys.append(keyvalue) #append current key to list of keys
            current_entry.append(line) #add the blank line to current entry
OUTBIBFILE.writelines(current_entry) #write out to new bib file without duplicates
            current_entry=[] #clear current entry for next one
            current_keyval=[] #clear current key
        else:
numduplicates=numduplicates+1 #increment the number of duplicates
            current_entry.append(line) #add the blank line at end of entry
OUTDUPBIBFILE.writelines(current_entry) #write out to list of duplicates file
            current_entry=[] #clear current entry for next one
            current_keyval=[] #clear current key
    elif len(line)>2: #not a blank, so more stuff in currrent entry
        current_entry.append(line)
if line[0:2] in ('TY','JF','EP','TI','SP','KW','AU','PY','UR'): #only if line starts with these fields
            current_keyval.append(line) #append to current key

INBIBFILE.close()
OUTBIBFILE.close()
OUTDUPBIBFILE.close()

print numduplicates

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to