Re: [Tutor] first python program to find citeulike duplicates

A.T.Hofkamp Thu, 20 Nov 2008 05:44:35 -0800

Suresh Krishna wrote:

it works (i think), but since this is my very first python program, iwould really appreciate feedback on how the program could be improved..


First of all, welcome, new Python programmer,

My main trouble with the program is lack of structure, some lines deal withindividual lines of the input file, while other deal with completecitation-records. I think the program would improve if this is more clearlyexpressed in the code.

One of the reasons that you got the current structure may be due to the use ofthe 'for line in bibfile' loop, which gives you a new line at only one pointin the program.Such a for-loop is good for simple line processing (for each line do ...), butyour program has grown beyond that imho. For this reason I'd like to proposeto throw out the for-loop, and use a while-loop with three parts instead.

For simplicity, I left out the handling of the records (which we can discusslater perhaps)



inbibfile = open(InFileName,'r')

finished = False
while not finished:
    # Read & skip blank lines
    while True:
        line = inbibfile.readline()
        if len(line) == 0:
            finished = True
            break  # end of file detected
        if len(line.rstrip()) > 0:  # Line contains non-white-space
            break
        # else it was a blank line, skip, read next line

    if finished:
        break

    # line contains the first non-whitespace line of a new record
    # read the entire record until the next empty line or EOF
    current_entry = [line]
    while True:
        line = inbibfile.readline()
        if len(line) == 0:
            finished = True
            break  # Found EOF
        if len(line.rstrip()) > 0:
            current_entry.append(line)
        else: # Found an empty line, finished with current record
            break

    # finished reading (empty line or EOF encountered)
    ## Do something with current_entry

    # if not finished, do the next record

inbibfile.close()



You can go further by introducing a few functions:

inbibfile = open(InFileName,'r')

finished = False
while not finished:
    finished, line = read_blank_lines(inbibfile)
    if finished:
        break

    # line contains the first non-whitespace line of a new record
    current_entry = make_new_record(line)
    finished, current_entry = read_record_lines(current_entry, inbibfile)

    # finished reading (empty line or EOF encountered)
    ## Do something with current_entry

inbibfile.close()

and the functions then contain the details of each step.


A few comments about your current program:



InFileName= "original_library.ris";


In Python, you don't end a line with a ';'.
Also, 'InFileName' is a constant, which is normally given an all-uppercase name.

OUTDUPBIBFILE=open(OutDupFileName,'w')
current_entry=[]
numduplicates=0

One of the most difficult things to do in programming is achievingconsistency. Above you have 3 global variables, and you use several differentway of writing their names. You should try to write them the same.The big advantage of consistent names is that just by looking at a name youknow what kind of thing it is, which reduces the chance of making errors inyour program.

(Don't feel bad about it, this seems a very simple and obvious problem but infact it is very hard to achieve, especially for larger programs and projectswith several people).


>        if keyvalue not in current_keys: #is a unique entry

current_keys.append(keyvalue) #append current key to list ofkeys
            current_entry.append(line) #add the blank line to current entry
OUTBIBFILE.writelines(current_entry) #write out to new bibfile without duplicates
            current_entry=[] #clear current entry for next one
            current_keyval=[] #clear current key

Look at the code of each line above, then look at the comment at the sameline. What does the comment tell you that the code didn't?


>        continue  #dont write out successive blanks or initial blanks

> elif current_entry and line.isspace(): #reached a blank that demarcatesend of current entry> keyvalue=''.join(current_keyval) #generated a key based on certainfields

>     elif len(line)>2: #not a blank, so more stuff in currrent entry

Now do the same here.


You see the difference?

In the first lines you repeat your code in words. In the second lines, youtell what you aim to achieve at that line, ie WHY do you do what you are doingrather than just WHAT does the line do (since that is already defined in thecode).In general, there is no need to comment each line. You may also assume thatthe reader knows Python (otherwise there is little value for him reading thecode).

Try to make 'blocks' of code with one or more empty lines in between (see myexample code), and write a comment what that block as a whole aims to do.



Hope I didn't critize you too much. For a first Python program it is quite nice.

Sincerely,
Albert
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] first python program to find citeulike duplicates

Reply via email to