This is the best you can do if

1. Every input in your space of possibilities is equally likely, and
2. You need to remember every input that you've seen
3. You need to know the order you saw them in

If you just need to know "have I seen this input before," and you can 
accept some false positives, then bloom filters are a good choice, as you 
suggest, and John has an implementation of these in 
Julia: https://github.com/johnmyleswhite/BloomFilters.jl

If you just need to know "how many different inputs have I seen," then the 
HyperLogLog is another semi-miraculous data structure for doing that, and 
John also has an implementation of these in 
Julia: https://github.com/johnmyleswhite/HyperLogLog.jl

As you can see, John's been busy :-)

But probably there are other things you want to query about your data. 
There might be good answers to some of those questions too if you ask them. 
In general, you might want to read up on Streaming 
Algorithms: http://en.wikipedia.org/wiki/Streaming_algorithm

See also the BioJulia organization. Looks like they already have an 
implementation of the 2-bits per base pair 
idea: https://github.com/BioJulia/BioSeq.jl

On Friday, December 5, 2014 5:25:04 PM UTC-8, David Koslicki wrote:
>
> Good suggestion, but I've tried that already, and besides the fact that 
> the HDF5 package (https://github.com/timholy/HDF5.jl) doesn't yet support 
> Int128, this would result in file sizes upwards of 750Gb (too large for my 
> purposes).
>
> On Friday, December 5, 2014 5:19:00 PM UTC-8, Jason Merrill wrote:
>>
>> Here's one possibility:
>>
>> Interpret A, C, T, G as two bit integers, i.e. A=00, C=01, T=10, G=11. A 
>> string of up to 50 of these has 2*50=100 bits, so you could store any such 
>> string as a unique Int128.
>>
>

Reply via email to