On Jul 18, 2012, at 7:33 PM, Ryan Waples wrote:

> I'm seeing some unexpected output when I use a script (included at
> end) to iterate over large text files.  I am unsure of the source of
> the unexpected output and any help would be much appreciated.
> 
> Background
> Python v 2.7.1
> Windows 7 32bit
> Reading and writing to an external USB hard drive
> 
> Data files are ~4GB text (.fastq) file, it has been uncompressed
> (gzip).  This file has no errors or formatting problems, it seems to
> have uncompressed just fine.  64M lines, each 'entry' is split across
> 4 consecutive lines, 16M entries.
> 
> My python script iterates over data files 4 lines at a time, selects
> and writes groups of four lines to the output file.  I will end up
> selecting roughly 85% of the entries.
> 
> In my output I am seeing lines that don't occur in the original file,
> and that don't match any lines in the original file.  The incidences
> of badly formatted lines don't seem to match up with any patterns in
> the data file, and occur across multiple different data files.
> 
> I've included 20 consecutive lines of input and output.  Each of these
> 5 'records' should have been selected and printed to the output file.
> But there is a problem with the 4th and 5th entries in the output, and
> it no longer matches the input as expected.  For example the line:
> TTCTGTGAGTGATTTCCTGCAAGACAGGAATGTCAGT
> never occurs in the original data.
> 
> Sorry for the large block of text below.
> Other pertinent info, I've tried a related perl script, and ran into
> similar issues, but not in the same places.
> 
> Any help or insight would be appreciated.
> 
> Thanks

[Data and program snipped]

With apologies - I'm a Mac/UNIX user, not Windows, but those numbers (4GB and 
64M lines) look suspiciously close to the file and record pointer limits to a 
32-bit file system.  Are you sure you aren't bumping into wrap around issues of 
some sort?

Just a thought…

-Bill
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to