>> I think, if we add ord as an output to the FST, then it builds >> everything we need? Ie no further data structures should be needed? >> Maybe I'm confused :) > > If you put the ord as an output the common part will be shifted towards the > front of the tree. This will work if you want to look up a given value > assigned to some string, but will not work if you need to look up the string > from its value. The latter case can be solved if you "know" which branch to > take while descending from root and the "shared prefix" alone won't give you > this information. At least I don't see how it could. > > I am familiar with the basic prefix hashing procedure suggested by Daciuk > (and other authors), but maybe some progress has been made there, I don't > know... the one I know is really conceptually simple -- since each arc > encodes the number of leaves (or input sequences) in the automaton, you know > which path must lead you to your string. For example if you have a node like > this and seek for the 12-th term: > > 0 -- 10 -- ... > +- 10 -- ... > +- 5 -- .. > you look at the first path, it'd give you terms 1..10, then the next one > contains terms 11..20 so you add 10 to an internal counter which is added to > further computations, descend and repeat the procedure until you find a leaf > node. > > Dawid
There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org