[ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966510#action_12966510 ]
Robert Muir commented on LUCENE-2792: ------------------------------------- This sounds awesome! It would be neat to write intersect(FST, Automaton), and somehow optionally use it if available. Such a thing might be cleaner now that MTQ.getTermsEnum takes a Terms structure > Add a simple FST impl to Lucene > ------------------------------- > > Key: LUCENE-2792 > URL: https://issues.apache.org/jira/browse/LUCENE-2792 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: FSTExample.png, LUCENE-2792.patch > > > I implemented the algo described at > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for > incrementally building a finite state transducer (FST) from sorted > inputs. > This is not a fully general FST impl -- it's only able to build up an > FST incrementally from input/output pairs that are pre-sorted. > Currently the inputs are BytesRefs, and the outputs are pluggable -- > NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long, > ByteSequenceOutput maps to a BytesRef. > The implementation has a low memory overhead, so that it can handle a > fairly large set of terms. For example, it can build the FSA for the > 9.8M terms from a 10M document wikipedia index in ~8 seconds (on > beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB. > It packs the FST as-it-builds into a compact byte[], and then exposes > the API to read nodes/arcs directly from the byte[]. The FST can be > quickly saved/loaded to/from a Directory since it's just a big byte[]. > The format is similar to what Morfologik uses > (http://sourceforge.net/projects/morfologik/). > I think there are a number of possible places we can use this in > Lucene. For example, I think many apps could hold the entire terms > dict in RAM, either at the multi-reader level or maybe per-segment > (mapping to file offset or to something else custom to the app), which > may possibly be a good speedup for certain MTQs (though, because the > format is packed into a byte[], there is a decode cost when visiting > arcs). > The builder can also prune as it goes, so you get a prefix trie pruned > according to how many terms run through the nodes, which makes it > faster and even less memory consuming. This may be useful as a > replacement for our current binary search terms index since it can > achieve higher term density for the same RAM consumption of our > current index. > As an initial usage to make sure this is exercised, I cutover the > SimpleText codec, which currently fully loads all terms into a > TreeMap (and has caused intermittent OOME in some tests), to use an FST > instead. SimpleText uses a PairOutputs which is able to "pair up" any > two other outputs, since it needs to map each input term to an int > docFreq and long filePosition. > All tests pass w/ SimpleText forced codec, and I think this is > committable except I'd love to get some help w/ the generics > (confession to the policeman: I had to add > @SuppressWarnings({"unchecked"})) all over!! Ideally an FST is > parameterized by its output type (Integer, BytesRef, etc.). > I even added a new @nightly test that makes a largeish set of random > terms and tests the resulting FST on different outputs :) > I think it would also be easy to make a variant that uses char[] > instead of byte[] as its inputs, so we could eg use this during analysis > (Robert's idea). It's already be easy to have a CharSequence > output type since the outputs are pluggable. > Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and > Morfologik -- http://sourceforge.net/projects/morfologik/) > was very helpful iterating with me on this (thank you!). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org