On Fri, Dec 5, 2014 at 8:48 PM, David Koslicki <dmkosli...@gmail.com> wrote:
> Thanks for the suggestions: > > On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote: >> >> This is the best you can do if >> >> 1. Every input in your space of possibilities is equally likely, and >> 2. You need to remember every input that you've seen >> 3. You need to know the order you saw them in >> >> If you just need to know "have I seen this input before," and you can >> accept some false positives, then bloom filters are a good choice, as you >> suggest, and John has an implementation of these in Julia: >> https://github.com/johnmyleswhite/BloomFilters.jl >> >> I've tried using this, but it isn't quite "prime time ready" ( > https://github.com/johnmyleswhite/BloomFilters.jl/issues/7) > Implementing your own Bloom filter really shouldn't be too hard. Alternatively, it might not be too hard to file some issues against John's package and get it into better working state. If you mention me in an issue, I can also take a look at things. Having the BloomFilters package working would probably be a good thing. > See also the BioJulia organization. Looks like they already have an >> implementation of the 2-bits per base pair idea: https://github.com/ >> BioJulia/BioSeq.jl >> > This is currently the approach I'm taking (with my own implementation). > One issue with this may be the overhead per sequence. I think this package is really designed for longer sequences since it uses array objects to store sequences – a perfectly reasonable thing to do. There are a few ways to avoid this kind of overhead. One would be to use a custom type something like this: immutable ShortNucleotideSeq bits::Uint128 end You can store an array of these with no overhead per item – it's just data. You may want to use the first byte to store the length or something like that. Another approach would be to use a single long BioSeq "string" and just implicitly use it to store lots of little sequences. You could just pick some maximum number of nucleotides per sequence and assume that each one is of that fixed length. Alternatively, you could store an array of offsets.