Re: [Tutor] Problem When Iterating Over Large Test Files

Wayne Werner Thu, 19 Jul 2012 04:20:09 -0700

Just a few notes...

On Wed, 18 Jul 2012, Ryan Waples wrote:
<snip>


import glob

my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq')

for each in my_in_files:
        #print(each)
        out = each.replace('/gzip', '/rem_clusters2' )
        #print (out)
        INFILE = open (each, 'r')
        OUTFILE = open (out , 'w')


It's slightly confusing to see your comments left-aligned instead of with the
code they refer to. At first glance it looked as though your block ended here,
when it does, in fact, continue.

# Tracking Variables
        Reads = 0
        Writes = 0
        Check_For_End_Of_File = 0

#Updates
        print ("Reading File: " + each)
        print ("Writing File: " + out)

# Read FASTQ File by group of four lines
        while Check_For_End_Of_File == 0:


This is Python, not C - checking for EOF is probably silly (unless you're
really checking for end of data) - you can just do:

for line in INFILE:
    ID_Line_1 = line
    Seq_line = next(INFILE) # Replace with INFILE.next() for Python2
    ID_Line_2 = next(INFILE)
    Quality_Line = next(INFILE)


                # Read the next four lines from the FASTQ file
                ID_Line_1               = INFILE.readline()
                Seq_Line                = INFILE.readline()
                ID_Line_2               = INFILE.readline()
                Quality_Line    = INFILE.readline()

                # Strip off leading and trailing whitespace characters
                ID_Line_1               = ID_Line_1.strip()
                Seq_Line                = Seq_Line.strip()
                ID_Line_2               = ID_Line_2.strip()
                Quality_Line    = Quality_Line.strip()


Also, it's just extra clutter to call strip like this when you can just tack it
on to your original statement:

for line in INFILE:
    ID_Line_1 = line.strip()
    Seq_line = next(INFILE).strip() # Replace with INFILE.next() for Python2
    ID_Line_2 = next(INFILE).strip()
    Quality_Line = next(INFILE).strip()

                Reads = Reads + 1

                #Check that I have not reached the end of file
                if Quality_Line == "":
                        #End of file reached, print update
                        print ("Saw " + str(Reads) + " reads")
                        print ("Wrote " + str(Writes) + " reads")
                        Check_For_End_Of_File = 1
                        break


This break is superfluous - it will actually remove you from the while loop -
no further lines of code will be evaluated, including the original `while`
comparison. You can also just test the Quality_Line for truthiness directly,
since empty string evaluate to false. I would actually just say:

if Quality_Line:
    #Do the rest of your stuff here


                #Check that ID_Line_1 starts with @
                if not ID_Line_1.startswith('@'):
                        print ("**ERROR**")
                        print (each)
                        print ("Read Number " + str(Reads))
                        print ID_Line_1 + ' does not start with @'
                        break #ends the while loop

                # Select Reads that I want to keep
                ID = ID_Line_1.partition(' ')
                if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"):
                        # Write to file, maintaining group of 4
                        OUTFILE.write(ID_Line_1 + "\n")
                        OUTFILE.write(Seq_Line + "\n")
                        OUTFILE.write(ID_Line_2 + "\n")
                        OUTFILE.write(Quality_Line + "\n")
                        Writes = Writes +1


        INFILE.close()
        OUTFILE.close()


You could (as long as you're on 2.6 or greater) just use the `with` block for
reading the files then you don't need to worry about closing - the block takes
care of that, even on errors:

for each in my_in_files:
    out = each.replace('/gzip', '/rem_clusters2' )
    with open (each, 'r') as INFILE, open (out, 'w') as OUTFILE:
        for line in INFILE:
            # Do your work here...


A few stylistic points:
ALL_CAPS are usually reserved for constants - infile and outfile are perfectly
legitimate names.

Caps_In_Variable_Names are usually discouraged. Class names should be CamelCase
(e.g. SimpleHTTPServer), while variable names should be lowercase with
underscores if needed, so id_line_1 instead of ID_Line_1.

If you're using Python3 or from __future__ import print_function, rather than
doing OUTFILE.write(value + '\n') you can do:

    print(value, file=OUTFILE)

Then you get the \n for free. You could also just do:

    print(val1, val2, val3, sep='\n', end='\n', file=OUTFILE)

The end parameter is there for example only, since the default value for end is
'\n'


HTH,
Wayne
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Problem When Iterating Over Large Test Files

Reply via email to