On Jul 18, 2012, at 7:33 PM, Ryan Waples wrote: > I'm seeing some unexpected output when I use a script (included at > end) to iterate over large text files. I am unsure of the source of > the unexpected output and any help would be much appreciated. > > Background > Python v 2.7.1 > Windows 7 32bit > Reading and writing to an external USB hard drive > > Data files are ~4GB text (.fastq) file, it has been uncompressed > (gzip). This file has no errors or formatting problems, it seems to > have uncompressed just fine. 64M lines, each 'entry' is split across > 4 consecutive lines, 16M entries. > > My python script iterates over data files 4 lines at a time, selects > and writes groups of four lines to the output file. I will end up > selecting roughly 85% of the entries. > > In my output I am seeing lines that don't occur in the original file, > and that don't match any lines in the original file. The incidences > of badly formatted lines don't seem to match up with any patterns in > the data file, and occur across multiple different data files. > > I've included 20 consecutive lines of input and output. Each of these > 5 'records' should have been selected and printed to the output file. > But there is a problem with the 4th and 5th entries in the output, and > it no longer matches the input as expected. For example the line: > TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT > never occurs in the original data. > > Sorry for the large block of text below. > Other pertinent info, I've tried a related perl script, and ran into > similar issues, but not in the same places. > > Any help or insight would be appreciated. > > Thanks
[Data and program snipped] With apologies - I'm a Mac/UNIX user, not Windows, but those numbers (4GB and 64M lines) look suspiciously close to the file and record pointer limits to a 32-bit file system. Are you sure you aren't bumping into wrap around issues of some sort? Just a thought… -Bill _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor