Just a few notes... On Wed, 18 Jul 2012, Ryan Waples wrote: <snip>
import glob my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq') for each in my_in_files: #print(each) out = each.replace('/gzip', '/rem_clusters2' ) #print (out) INFILE = open (each, 'r') OUTFILE = open (out , 'w')
It's slightly confusing to see your comments left-aligned instead of with the code they refer to. At first glance it looked as though your block ended here, when it does, in fact, continue.
# Tracking Variables Reads = 0 Writes = 0 Check_For_End_Of_File = 0 #Updates print ("Reading File: " + each) print ("Writing File: " + out) # Read FASTQ File by group of four lines while Check_For_End_Of_File == 0:
This is Python, not C - checking for EOF is probably silly (unless you're really checking for end of data) - you can just do: for line in INFILE: ID_Line_1 = line Seq_line = next(INFILE) # Replace with INFILE.next() for Python2 ID_Line_2 = next(INFILE) Quality_Line = next(INFILE)
# Read the next four lines from the FASTQ file ID_Line_1 = INFILE.readline() Seq_Line = INFILE.readline() ID_Line_2 = INFILE.readline() Quality_Line = INFILE.readline() # Strip off leading and trailing whitespace characters ID_Line_1 = ID_Line_1.strip() Seq_Line = Seq_Line.strip() ID_Line_2 = ID_Line_2.strip() Quality_Line = Quality_Line.strip()
Also, it's just extra clutter to call strip like this when you can just tack it on to your original statement: for line in INFILE: ID_Line_1 = line.strip() Seq_line = next(INFILE).strip() # Replace with INFILE.next() for Python2 ID_Line_2 = next(INFILE).strip() Quality_Line = next(INFILE).strip()
Reads = Reads + 1 #Check that I have not reached the end of file if Quality_Line == "": #End of file reached, print update print ("Saw " + str(Reads) + " reads") print ("Wrote " + str(Writes) + " reads") Check_For_End_Of_File = 1 break
This break is superfluous - it will actually remove you from the while loop - no further lines of code will be evaluated, including the original `while` comparison. You can also just test the Quality_Line for truthiness directly, since empty string evaluate to false. I would actually just say: if Quality_Line: #Do the rest of your stuff here
#Check that ID_Line_1 starts with @ if not ID_Line_1.startswith('@'): print ("**ERROR**") print (each) print ("Read Number " + str(Reads)) print ID_Line_1 + ' does not start with @' break #ends the while loop # Select Reads that I want to keep ID = ID_Line_1.partition(' ') if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"): # Write to file, maintaining group of 4 OUTFILE.write(ID_Line_1 + "\n") OUTFILE.write(Seq_Line + "\n") OUTFILE.write(ID_Line_2 + "\n") OUTFILE.write(Quality_Line + "\n") Writes = Writes +1 INFILE.close() OUTFILE.close()
You could (as long as you're on 2.6 or greater) just use the `with` block for reading the files then you don't need to worry about closing - the block takes care of that, even on errors: for each in my_in_files: out = each.replace('/gzip', '/rem_clusters2' ) with open (each, 'r') as INFILE, open (out, 'w') as OUTFILE: for line in INFILE: # Do your work here... A few stylistic points: ALL_CAPS are usually reserved for constants - infile and outfile are perfectly legitimate names. Caps_In_Variable_Names are usually discouraged. Class names should be CamelCase (e.g. SimpleHTTPServer), while variable names should be lowercase with underscores if needed, so id_line_1 instead of ID_Line_1. If you're using Python3 or from __future__ import print_function, rather than doing OUTFILE.write(value + '\n') you can do: print(value, file=OUTFILE) Then you get the \n for free. You could also just do: print(val1, val2, val3, sep='\n', end='\n', file=OUTFILE) The end parameter is there for example only, since the default value for end is '\n' HTH, Wayne _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor