I varied the ms increment randomly between 1-20, then created 10 mil dates. The FST was then 58,481,582 bytes, eg, 57 MB. Guess it's not perfect! 19,739,994 bytes, eg, 18.8 MB for random 1-5 increments. I think that's still pretty good. I need to try varying the long value stored alongside to see how much that affects the total size.
How many keys are there for a typical region, and total per region server? One could try MMap'ing the FST however because access to it is completely random, there's a high risk of page faults. If a key set is sequentially incremented by 1, then that's clearly optimal. I'd guess that's a fairly common case though? > How would you search a 'key' in the FST? Searching on a key is via BytesRefFSTEnum.seekCeil method, which'll match anything greater than or equal to the given parameter. Clearly more benchmarking is required, however this would seem to be a highly promising avenue of research for HBase. On Fri, Jun 3, 2011 at 10:57 PM, Stack <st...@duboce.net> wrote: > That can't be true? (smile) How would you search a 'key' in the FST? > St.Ack > > On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: >> Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz >> >> You can see the line that's adding the string: >> >> fstBuilder.add(new BytesRef(date), new Long(x)); >> >> On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <mcor...@hotpads.com> wrote: >>> Jason - are you feeding it that whole string for each date? Input data is >>> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >>> Is it possible to compress by that much? Maybe I'm missing something about >>> how the FST works. >>> >>> Matt >>> >>> >>> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <jason.rutherg...@gmail.com >>>> wrote: >>> >>>> Also the next thing to measure with the FST is the key lookup speed. >>>> I'm not sure what that'd look like, or how to compare with HBase right >>>> now? >>>> >>>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>>> <jason.rutherg...@gmail.com> wrote: >>>> > Here's a nice preliminary number with the FST, 50 million dates of the >>>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>>> > FST is 984 bytes, with an incrementing long to point to the presumably >>>> > MMap'd value data. This's a bit crazy. >>>> > >>>> > Perhaps we should try other increments as well? Given that HBase keys >>>> > especially are probably close increments of each other, I think the >>>> > FST can always be loaded into RAM with pointers out to the actual >>>> > values. >>>> > >>>> >>> >> >