On Fri, Dec 5, 2014 at 8:48 PM, David Koslicki <dmkosli...@gmail.com> wrote:

> Thanks for the suggestions:
>
> On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote:
>>
>> This is the best you can do if
>>
>> 1. Every input in your space of possibilities is equally likely, and
>> 2. You need to remember every input that you've seen
>> 3. You need to know the order you saw them in
>>
>> If you just need to know "have I seen this input before," and you can
>> accept some false positives, then bloom filters are a good choice, as you
>> suggest, and John has an implementation of these in Julia:
>> https://github.com/johnmyleswhite/BloomFilters.jl
>>
>> I've tried using this, but it isn't quite "prime time ready" (
> https://github.com/johnmyleswhite/BloomFilters.jl/issues/7)
>

Implementing your own Bloom filter really shouldn't be too hard.
Alternatively, it might not be too hard to file some issues against John's
package and get it into better working state. If you mention me in an
issue, I can also take a look at things. Having the BloomFilters package
working would probably be a good thing.


> See also the BioJulia organization. Looks like they already have an
>> implementation of the 2-bits per base pair idea: https://github.com/
>> BioJulia/BioSeq.jl
>>
> This is currently the approach I'm taking (with my own implementation).
>

One issue with this may be the overhead per sequence. I think this package
is really designed for longer sequences since it uses array objects to
store sequences – a perfectly reasonable thing to do. There are a few ways
to avoid this kind of overhead. One would be to use a custom type something
like this:

immutable ShortNucleotideSeq
    bits::Uint128
end


You can store an array of these with no overhead per item – it's just data.
You may want to use the first byte to store the length or something like
that. Another approach would be to use a single long BioSeq "string" and
just implicitly use it to store lots of little sequences. You could just
pick some maximum number of nucleotides per sequence and assume that each
one is of that fixed length. Alternatively, you could store an array of
offsets.

Reply via email to