Wow, 17 replies to my email overnight! This is clearly an interesting topic
to folks.

Hi Dawid.
Sadly, I won't be at Lucene Revolution next week. That's where all the cool
kids will be; I'll be home and be square. I made it to O'Reilly Strata in
February (a great conference) and I'll be presenting at Basis's "Open Source
Search Conference" (government customer focused) mid-June.  I've used up my
conference budget for this fiscal year.

Yes, the use-case here is a unique integer reference to a String that can be
looked up fairly quickly, whereas the set of all strings are in a compressed
data structure that won't change after its built. A bonus benefit would be
that this integer is a sortable substitute for the String.  Your observation
of this integer being a perfect-hash is astute.

I wonder if Lucene could store this FST on-disk for the bytes in a segment
instead of what it's doing now? Read-time construction would be super-fast,
though for multi-segment indexes, I suppose they'd need to be merged.

I expect that this use-case would be particularly useful for cases when you
know that the set of strings tends to have a great deal of prefixes in
common, such as when EdgeNGramming (applications: query-complete,
hierarchical faceting, prefix/tree based geospatial indexing).

~ David Smiley


Dawid Weiss wrote:
> 
> Hi David,
> 
>> but with less memory.  As I understand it, FSTs are a highly compressed
>> representation of a set of Strings (among other possibilities).  The
> 
> Yep. Not only, but this is one of the use cases. Will you be at Lucene
> Revolution next week? I'll be talking about it there.
> 
>> representation of a set of Strings (among other possibilities).  The
>> fieldCache would need to point to an FST entry (an "arc"?) using
>> something
>> small, say an integer.  Is there a way to point to an FST entry with an
>> integer, and then somehow with relative efficiency construct the String
>> from
>> the arcs to get there?
> 
> Correct me if my understanding is wrong: you'd like to assign a unique
> integer to each String and then retrieve it by this integer (something
> like a
> Map<Integer, String>)? This would be something called perfect
> hashing
> and this can be done on top of an automaton (fairly easily). I assume
> the data structure is immutable once constructed and does not change
> too often, right?
> 
> Dawid
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to