Olaf writes:
> your program looks almost exaclty how I'd write it, expect for the
> foldl' Christian mentioned.
Nice to hear! It is very simple, as you say, so maybe that's also why
I'm not that far off.
> I also doubt that the Haskell program can really outperform a
> well-written C program on such a simple task.
I agree. But the C-program I am taking on, as it were, is not really
well-written. For one thing, it does malloc()/free() for every line.
(Oh, and it doesn't handle big numbers, it overflows without detecting
it :-))
So I am cheating, by having my program using a probably quite
well-written runtime against a more-or-less naïve C-implementation.
When the time is dominated by disk-access, the timings are very close (C
first, then Haskell):
$ for f in small_29M.fastq large_5G.fastq huge_33G.fastq; do time fastqstats
$f; done
Count 199957
Total 199957 records 9997850 length 50 average
real 0m0.129s
user 0m0.098s
sys 0m0.000s
Count 10085674
Total 10085674 records -1893163715 length -187.708 average
real 0m19.975s
user 0m8.335s
sys 0m1.841s
Count 63074335
Total 63074335 records -143886218 length -2.28122 average
real 2m7.448s
user 0m56.549s
sys 0m10.825s
$ for f in small_29M.fastq large_5G.fastq huge_33G.fastq; do time hfastqstats
$f; done
Count 199957
Total 199957 records 9997850 length 50.0 average
real 0m0.120s
user 0m0.048s
sys 0m0.015s
Count 10085674
Total 10085674 records 2401803581 length 238.1401 average
real 0m19.911s
user 0m4.276s
sys 0m2.120s
Count 63074335
Total 63074335 records 12741015670 length 202.0 average
real 2m11.627s
user 0m31.264s
sys 0m13.468s
$
So what happens when the disk-cache is hot?
I only have 16 GB RAM in my desktop, so I'll exclude the 33 GB file, and
run the two programs a number of times. After 10 runs of each, I get
these numbers (C first again, then Haskell):
11
fastqstats
Count 199957
Total 199957 records 9997850 length 50 average
real 0m0.097s
user 0m0.097s
sys 0m0.000s
Count 10085674
Total 10085674 records -1893163715 length -187.708 average
real 0m8.681s
user 0m7.979s
sys 0m0.696s
hfastqstats
Count 199957
Total 199957 records 9997850 length 50.0 average
real 0m0.066s
user 0m0.062s
sys 0m0.004s
Count 10085674
Total 10085674 records 2401803581 length 238.1401 average
real 0m3.904s
user 0m3.212s
sys 0m0.688s
$
which is kind of fun.
> In my eyes, the strength of Haskell is hidden in the readIllumina
> function: Bioinformatics is 50% parsing and converting text formats.
That's also why I like BioPerl a lot - some one else did the parsing for
for me :-)
Thanks for the comments.
Best regards,
Adam
--
"No more than that, but very powerful all the same; Adam Sjøgren
simple things are good." [email protected]