> I count only 19 lines. yep, you are right. My bad, I think I missing copy/pasting line 20.
>The first group has only three lines. See below. Not so, the first group is actually the first four lines listed below. Lines 1-4 serve as one group. For what it is worth, line four should have 1 character for each char in line 1, and the first line is much shorter, contains a space, and for this file always ends in either "1:N:0:" (keep) "1"Y"0:" (remove). The EXAMPLE data is correctly formatted as it should be, but I'm missing line 20. > There is a blank line, which I take as NOT part of the input but just a > spacer. Then: > > 1) Line starting with @ > 2) Line of bases CGCGT ... > 3) Plus sign > 4) Line starting with @@@ > 5) Line starting with @ > 6) Line of bases TTCTA ... > 7) Plus sign > > and so on. There are TWO lines before the first +, and three before each > of the others. I think you are just reading one frame shifted, its not a well designed format because the required start character "@", can appear other places as well.... > > >> __EXAMPLE RAW DATA FILE REGION__ >> >> @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0: >> CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC >> + >> @@@DDADDHHHHHB9+2A<??:?G9+C)???G@DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5@CDAC(5(5:5,(8?88?BC@######### >> @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0: >> TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA >> + >> @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4@ADCA>>?BBBDDABB055<>-?A<B1:@ACC: >> @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0: >> CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA >> + >> CCCFFFFFHHHHHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACC<ACDB;;B?C3>A>AD>BA >> @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0: >> ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC >> + >> CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD@D99B@>3884>A >> @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0: >> CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC >> + > > Your code says that the first line in each group should start with an @ > sign. That is clearly not the case for the last two groups. > > I suggest that your data files have been corrupted. I'm pretty sure that my raw IN files are all good, its hard to be sure with such a large file, but the very picky downstream analysis program takes every single raw file just fine (30 of them), and gaks on my filtered files, at regions that don't conform to the correct formatting. > >> __PYTHON CODE __ > > I have re-written your code slightly, to be a little closer to "best > practice", or at least modern practice. If there is anything you don't > understand, please feel free to ask. > > I haven't tested this code, but it should run fine on Python 2.7. > > It will be interesting to see if you get different results with this. --CODE REMOVED-- Thanks, for the suggestions. I've never really felt super comfortable using objects at all, but its what I want to learn next. This will be helpful, and useful. > for reads, lines in four_lines( INFILE ): ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines Can you explain what is going on here, or point me In the right direction? I see that the parts of 'lines' get assigned, but I'm missing how the file gets iterated over and how reads gets incremented. Do you have a reason why this approach might give a 'better' output? Thanks again. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor