Wow, 17 replies to my email overnight! This is clearly an interesting topic to folks.
Hi Dawid. Sadly, I won't be at Lucene Revolution next week. That's where all the cool kids will be; I'll be home and be square. I made it to O'Reilly Strata in February (a great conference) and I'll be presenting at Basis's "Open Source Search Conference" (government customer focused) mid-June. I've used up my conference budget for this fiscal year. Yes, the use-case here is a unique integer reference to a String that can be looked up fairly quickly, whereas the set of all strings are in a compressed data structure that won't change after its built. A bonus benefit would be that this integer is a sortable substitute for the String. Your observation of this integer being a perfect-hash is astute. I wonder if Lucene could store this FST on-disk for the bytes in a segment instead of what it's doing now? Read-time construction would be super-fast, though for multi-segment indexes, I suppose they'd need to be merged. I expect that this use-case would be particularly useful for cases when you know that the set of strings tends to have a great deal of prefixes in common, such as when EdgeNGramming (applications: query-complete, hierarchical faceting, prefix/tree based geospatial indexing). ~ David Smiley Dawid Weiss wrote: > > Hi David, > >> but with less memory. As I understand it, FSTs are a highly compressed >> representation of a set of Strings (among other possibilities). The > > Yep. Not only, but this is one of the use cases. Will you be at Lucene > Revolution next week? I'll be talking about it there. > >> representation of a set of Strings (among other possibilities). The >> fieldCache would need to point to an FST entry (an "arc"?) using >> something >> small, say an integer. Is there a way to point to an FST entry with an >> integer, and then somehow with relative efficiency construct the String >> from >> the arcs to get there? > > Correct me if my understanding is wrong: you'd like to assign a unique > integer to each String and then retrieve it by this integer (something > like a > Map<Integer, String>)? This would be something called perfect > hashing > and this can be done on top of an automaton (fairly easily). I assume > the data structure is immutable once constructed and does not change > too often, right? > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org