I'm seeing some unexpected output when I use a script (included at end) to iterate over large text files. I am unsure of the source of the unexpected output and any help would be much appreciated.
Background Python v 2.7.1 Windows 7 32bit Reading and writing to an external USB hard drive Data files are ~4GB text (.fastq) file, it has been uncompressed (gzip). This file has no errors or formatting problems, it seems to have uncompressed just fine. 64M lines, each 'entry' is split across 4 consecutive lines, 16M entries. My python script iterates over data files 4 lines at a time, selects and writes groups of four lines to the output file. I will end up selecting roughly 85% of the entries. In my output I am seeing lines that don't occur in the original file, and that don't match any lines in the original file. The incidences of badly formatted lines don't seem to match up with any patterns in the data file, and occur across multiple different data files. I've included 20 consecutive lines of input and output. Each of these 5 'records' should have been selected and printed to the output file. But there is a problem with the 4th and 5th entries in the output, and it no longer matches the input as expected. For example the line: TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT never occurs in the original data. Sorry for the large block of text below. Other pertinent info, I've tried a related perl script, and ran into similar issues, but not in the same places. Any help or insight would be appreciated. Thanks __EXAMPLE RAW DATA FILE REGION__ @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0: CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC + @@@DDADDHHHHHB9+2A<??:?G9+C)???G@DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5@CDAC(5(5:5,(8?88?BC@######### @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0: TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA + @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4@ADCA>>?BBBDDABB055<>-?A<B1:@ACC: @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0: CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA + CCCFFFFFHHHHHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACC<ACDB;;B?C3>A>AD>BA @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0: ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC + CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD@D99B@>3884>A @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0: CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC + __EXAMPLE PROBLEMATIC OUTPUT FILE REGION__ @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0: CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC + @@@DDADDHHHHHB9+2A<??:?G9+C)???G@DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5@CDAC(5(5:5,(8?88?BC@######### @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0: TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA + @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4@ADCA>>?BBBDDABB055<>-?A<B1:@ACC: @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0: CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA + CCCFFFFFHHHHHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACC<ACDB;;B?C3>A>AD>BA TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT + BCCFFDFFHHHHFIJIJJHIFGIIIIGGGIGGIJIJIGIGIGIGHHIGIIJGJJJIIJIIEHIHHHFFFB@>CCE@BEDCDDAC?CC?ACC??<C>>ADDD @HWI-ST0747:167:B02DEACXX:8:1304:19473:44548 1:N:0: CTACAGTGCAGGCACCCGGCCCGCCACAATGAGTCGCTAGAGCGCAATGAGACAAGTAAAGCTCCCCGACCAAACCCTCCCCTAACCCGGACGATGCTGGG + BCCFFFFFHHHHHIJEHJJIIGIJIJJJJGIJIDHDGIGJJJJIGGFFFFEEEEED@CCDDC>C>BBD?BDBAABBBBBDDD@BCD@?@BDBDDDBDCCC2 __PYTHON CODE __ import glob my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq') for each in my_in_files: #print(each) out = each.replace('/gzip', '/rem_clusters2' ) #print (out) INFILE = open (each, 'r') OUTFILE = open (out , 'w') # Tracking Variables Reads = 0 Writes = 0 Check_For_End_Of_File = 0 #Updates print ("Reading File: " + each) print ("Writing File: " + out) # Read FASTQ File by group of four lines while Check_For_End_Of_File == 0: # Read the next four lines from the FASTQ file ID_Line_1 = INFILE.readline() Seq_Line = INFILE.readline() ID_Line_2 = INFILE.readline() Quality_Line = INFILE.readline() # Strip off leading and trailing whitespace characters ID_Line_1 = ID_Line_1.strip() Seq_Line = Seq_Line.strip() ID_Line_2 = ID_Line_2.strip() Quality_Line = Quality_Line.strip() Reads = Reads + 1 #Check that I have not reached the end of file if Quality_Line == "": #End of file reached, print update print ("Saw " + str(Reads) + " reads") print ("Wrote " + str(Writes) + " reads") Check_For_End_Of_File = 1 break #Check that ID_Line_1 starts with @ if not ID_Line_1.startswith('@'): print ("**ERROR**") print (each) print ("Read Number " + str(Reads)) print ID_Line_1 + ' does not start with @' break #ends the while loop # Select Reads that I want to keep ID = ID_Line_1.partition(' ') if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"): # Write to file, maintaining group of 4 OUTFILE.write(ID_Line_1 + "\n") OUTFILE.write(Seq_Line + "\n") OUTFILE.write(ID_Line_2 + "\n") OUTFILE.write(Quality_Line + "\n") Writes = Writes +1 INFILE.close() OUTFILE.close() _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor