On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics wrote:

Some time fastq are comressed to gz bz2 or xz as that is often a
huge file.
Maybe we need keep in mind this early in developement and use
std.zlib

While working on making the parser multi-threaded compatible, I was able to seperate the part that feeds data, and the part that parses data.

Long story short, the parser operates on an input range of ubyte[]: It is not responsible any more for acquisition of data.

The range can be a simple (wrapped) File, a byChunk, an asynchroneus file reader, or a zip decompresser, or just stdin I guess. Range can be transient.

However, now that you mention it, I'll make sure it is correctly supported.

I'll *try* to show you what I have so far tomorow (in about 18h).

Yeah... I played around too much, and the file is dirtier than ever.

The good news is that I was able to test out what I was telling you about: accepting any range is ok:

I used your ZFile range to plug it into my parser: I can now parse zipped files directly.

The good news is that now, I'm not bottle necked by IO anymore! The bad news is that I'm now bottle necked by CPU decompressing. But since I'm using dmd, you may get better results with LDC or GDC.

In any case, I am now parsing the 6Gig packed into 1.5Gig in about 53 seconds (down from 61). I also tried doing a dual-threaded approach (1 thread to unzip, 1 thread to parse), but again, the actual *parse* phase is so ridiculously fast, that it changes *nothing* to the final result.

Long story short: 99% of the time is spent acquiring data. The last 1% is just copying it into local buffers.

The last good news though is that CPU bottleneck is always better than IO bottleneck. If you have multiple cores, you should be able to run multiple *instances* (not threads), and be able to process several files at once, multiplying your throughput.

Reply via email to