Felipe Lessa <felipe.le...@gmail.com> writes:

[-snip- I've already spent too much time on the other stuff :-]

> And what do you think about creating a real SeqData data type
> with two bases per byte?  In terms of processing speed I guess
> there will be a small penalty, but if you need to have large
> quantities of base pairs in memory this would double your
> capacity =).

Yes, this is interesting in some cases.  Obvious downsides would be a
separate data type for protein sequences (20 characters, plus some
wildcards), and more complicated string comparison (when a match is off
by one).  Oh, and lower case is sometimes used to signify less
"important" regions, like repeats.

Another choice is the 2bit format (used by BLAT, and supported in Bio
for input/output, but not internally), which stores the alphabet proper
directly in 2bit quantities, and uses a separate lists for gaps, lower
case masking, and Ns (and is obviously extensible to wildcards).  Too
much extending, and you're likely to lose any benefit, though.

Basically, it boils down to a set of different tradeoffs, and I think
ByteString is a fairly good choice in *most* cases, and it deals - if
not particularly elegantly, then at least fairly effectively with
various conventions, like lower-casing or wild cards.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to