[Tutor] first python program to find citeulike duplicates

2008-11-20 Thread Suresh Krishna


hi everybody,

i wrote this to solve the problem of exact duplicate entries in my  
citeulike library, that i wanted to remove. so i exported my entries in  
ris format, and then parsed the entries to find exact duplicates based on  
matching fields. the exact duplicates came about because i uploaded the  
same RIS file twice to my citeulike library, as a result of the upload  
being interrupted the first time.


it works (i think), but since this is my very first python program, i  
would really appreciate feedback on how the program could be improved..


thanks much 

suresh

~~

InFileName= original_library.ris;
INBIBFILE=open(InFileName,'r')

OutFileName= C:/users/skrishna/desktop/library_without_duplicates.ris;
OUTBIBFILE=open(OutFileName,'w')

OutDupFileName= C:/users/skrishna/desktop/library_of_duplicates.ris;
OUTDUPBIBFILE=open(OutDupFileName,'w')

current_entry=[]
current_keyval=[]
current_keys=[]

numduplicates=0

for line in INBIBFILE: #large file, so prefer not to use readlines()

if not current_entry and line.isspace():
continue  #dont write out successive blanks or initial blanks
elif current_entry and line.isspace(): #reached a blank that  
demarcates end of current entry


keyvalue=''.join(current_keyval) #generated a key based on certain  
fields

if keyvalue not in current_keys: #is a unique entry
current_keys.append(keyvalue) #append current key to list of  
keys

current_entry.append(line) #add the blank line to current entry
OUTBIBFILE.writelines(current_entry) #write out to new bib  
file without duplicates

current_entry=[] #clear current entry for next one
current_keyval=[] #clear current key
else:
numduplicates=numduplicates+1 #increment the number of  
duplicates

current_entry.append(line) #add the blank line at end of entry
OUTDUPBIBFILE.writelines(current_entry) #write out to list of  
duplicates file

current_entry=[] #clear current entry for next one
current_keyval=[] #clear current key
elif len(line)2: #not a blank, so more stuff in currrent entry
current_entry.append(line)
if line[0:2] in ('TY','JF','EP','TI','SP','KW','AU','PY','UR'):  
#only if line starts with these fields

current_keyval.append(line) #append to current key

INBIBFILE.close()
OUTBIBFILE.close()
OUTDUPBIBFILE.close()

print numduplicates

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] first python program to find citeulike duplicates

2008-11-20 Thread A.T.Hofkamp

Suresh Krishna wrote:
it works (i think), but since this is my very first python program, i 
would really appreciate feedback on how the program could be improved..


First of all, welcome, new Python programmer,


My main trouble with the program is lack of structure, some lines deal with 
individual lines of the input file, while other deal with complete 
citation-records. I think the program would improve if this is more clearly 
expressed in the code.


One of the reasons that you got the current structure may be due to the use of 
the 'for line in bibfile' loop, which gives you a new line at only one point 
in the program.
Such a for-loop is good for simple line processing (for each line do ...), but 
your program has grown beyond that imho. For this reason I'd like to propose 
to throw out the for-loop, and use a while-loop with three parts instead.



For simplicity, I left out the handling of the records (which we can discuss 
later perhaps)



inbibfile = open(InFileName,'r')

finished = False
while not finished:
# Read  skip blank lines
while True:
line = inbibfile.readline()
if len(line) == 0:
finished = True
break  # end of file detected
if len(line.rstrip())  0:  # Line contains non-white-space
break
# else it was a blank line, skip, read next line

if finished:
break

# line contains the first non-whitespace line of a new record
# read the entire record until the next empty line or EOF
current_entry = [line]
while True:
line = inbibfile.readline()
if len(line) == 0:
finished = True
break  # Found EOF
if len(line.rstrip())  0:
current_entry.append(line)
else: # Found an empty line, finished with current record
break

# finished reading (empty line or EOF encountered)
## Do something with current_entry

# if not finished, do the next record

inbibfile.close()



You can go further by introducing a few functions:

inbibfile = open(InFileName,'r')

finished = False
while not finished:
finished, line = read_blank_lines(inbibfile)
if finished:
break

# line contains the first non-whitespace line of a new record
current_entry = make_new_record(line)
finished, current_entry = read_record_lines(current_entry, inbibfile)

# finished reading (empty line or EOF encountered)
## Do something with current_entry

inbibfile.close()

and the functions then contain the details of each step.


A few comments about your current program:



InFileName= original_library.ris;


In Python, you don't end a line with a ';'.
Also, 'InFileName' is a constant, which is normally given an all-uppercase name.


OUTDUPBIBFILE=open(OutDupFileName,'w')
current_entry=[]
numduplicates=0


One of the most difficult things to do in programming is achieving 
consistency. Above you have 3 global variables, and you use several different 
way of writing their names. You should try to write them the same.
The big advantage of consistent names is that just by looking at a name you 
know what kind of thing it is, which reduces the chance of making errors in 
your program.



(Don't feel bad about it, this seems a very simple and obvious problem but in 
fact it is very hard to achieve, especially for larger programs and projects 
with several people).


if keyvalue not in current_keys: #is a unique entry
current_keys.append(keyvalue) #append current key to list of 
keys

current_entry.append(line) #add the blank line to current entry
OUTBIBFILE.writelines(current_entry) #write out to new bib 
file without duplicates

current_entry=[] #clear current entry for next one
current_keyval=[] #clear current key


Look at the code of each line above, then look at the comment at the same 
line. What does the comment tell you that the code didn't?


continue  #dont write out successive blanks or initial blanks
elif current_entry and line.isspace(): #reached a blank that demarcates 
end of current entry
keyvalue=''.join(current_keyval) #generated a key based on certain 
fields

 elif len(line)2: #not a blank, so more stuff in currrent entry

Now do the same here.


You see the difference?

In the first lines you repeat your code in words. In the second lines, you 
tell what you aim to achieve at that line, ie WHY do you do what you are doing 
rather than just WHAT does the line do (since that is already defined in the 
code).
In general, there is no need to comment each line. You may also assume that 
the reader knows Python (otherwise there is little value for him reading the 
code).


Try to make 'blocks' of code with one or more empty lines in between (see my 
example code), and write a comment what that block as a whole aims to do.



Hope I didn't critize you too much. For a first Python program it is quite nice.

Sincerely,