This is the best you can do if 1. Every input in your space of possibilities is equally likely, and 2. You need to remember every input that you've seen 3. You need to know the order you saw them in
If you just need to know "have I seen this input before," and you can accept some false positives, then bloom filters are a good choice, as you suggest, and John has an implementation of these in Julia: https://github.com/johnmyleswhite/BloomFilters.jl If you just need to know "how many different inputs have I seen," then the HyperLogLog is another semi-miraculous data structure for doing that, and John also has an implementation of these in Julia: https://github.com/johnmyleswhite/HyperLogLog.jl As you can see, John's been busy :-) But probably there are other things you want to query about your data. There might be good answers to some of those questions too if you ask them. In general, you might want to read up on Streaming Algorithms: http://en.wikipedia.org/wiki/Streaming_algorithm See also the BioJulia organization. Looks like they already have an implementation of the 2-bits per base pair idea: https://github.com/BioJulia/BioSeq.jl On Friday, December 5, 2014 5:25:04 PM UTC-8, David Koslicki wrote: > > Good suggestion, but I've tried that already, and besides the fact that > the HDF5 package (https://github.com/timholy/HDF5.jl) doesn't yet support > Int128, this would result in file sizes upwards of 750Gb (too large for my > purposes). > > On Friday, December 5, 2014 5:19:00 PM UTC-8, Jason Merrill wrote: >> >> Here's one possibility: >> >> Interpret A, C, T, G as two bit integers, i.e. A=00, C=01, T=10, G=11. A >> string of up to 50 of these has 2*50=100 bits, so you could store any such >> string as a unique Int128. >> >