hi everybody,
i wrote this to solve the problem of exact duplicate entries in my
citeulike library, that i wanted to remove. so i exported my entries in
ris format, and then parsed the entries to find exact duplicates based on
matching fields. the exact duplicates came about because i uploaded the
same RIS file twice to my citeulike library, as a result of the upload
being interrupted the first time.
it works (i think), but since this is my very first python program, i
would really appreciate feedback on how the program could be improved..
thanks much !!!!
suresh
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~éééé
InFileName= "original_library.ris";
INBIBFILE=open(InFileName,'r')
OutFileName= "C:/users/skrishna/desktop/library_without_duplicates.ris";
OUTBIBFILE=open(OutFileName,'w')
OutDupFileName= "C:/users/skrishna/desktop/library_of_duplicates.ris";
OUTDUPBIBFILE=open(OutDupFileName,'w')
current_entry=[]
current_keyval=[]
current_keys=[]
numduplicates=0
for line in INBIBFILE: #large file, so prefer not to use readlines()
if not current_entry and line.isspace():
continue #dont write out successive blanks or initial blanks
elif current_entry and line.isspace(): #reached a blank that
demarcates end of current entry
keyvalue=''.join(current_keyval) #generated a key based on certain
fields
if keyvalue not in current_keys: #is a unique entry
current_keys.append(keyvalue) #append current key to list of
keys
current_entry.append(line) #add the blank line to current entry
OUTBIBFILE.writelines(current_entry) #write out to new bib
file without duplicates
current_entry=[] #clear current entry for next one
current_keyval=[] #clear current key
else:
numduplicates=numduplicates+1 #increment the number of
duplicates
current_entry.append(line) #add the blank line at end of entry
OUTDUPBIBFILE.writelines(current_entry) #write out to list of
duplicates file
current_entry=[] #clear current entry for next one
current_keyval=[] #clear current key
elif len(line)>2: #not a blank, so more stuff in currrent entry
current_entry.append(line)
if line[0:2] in ('TY','JF','EP','TI','SP','KW','AU','PY','UR'):
#only if line starts with these fields
current_keyval.append(line) #append to current key
INBIBFILE.close()
OUTBIBFILE.close()
OUTDUPBIBFILE.close()
print numduplicates
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor