On Mon, Jun 21, 2010 at 3:16 PM, eks dev <eks...@yahoo.co.uk> wrote: > ok, that explains it, but I didn't expect it, considering small size of the > library. >
well, its not that small. for example, the original brics jar is 170KB. Our minimal use takes up significantly less space (i dont remember i think 30-40KBish). I am not sure adding 100+KB of untested* unused code to the lucene core jar would go over very well :) * see below > i would even argue it makes sense to keep some (all?) of these methods, > especially if intended use of the Automaton code gets expanded to Analyzer > chains. This particular method has usage in our code for optimizing matching > based on minimum possible length that can get accepted. > I tend to agree with you, but there is some complexity: 1. brics automaton doesn't have a unit testing package (this would really be a nice contribution to the brics package by the way) 2. our automaton package is not simply a slimmed down version, there are important differences... (two are below) state machine representation: * brics automaton uses a utf-16 transition representation (Automaton) and a utf-16 tableized matcher (RunAutomaton) * lucene's automaton uses a utf-32 transition representation (Automaton) and both utf-8 (ByteRunAutomaton) and utf-32 (CharacterRunAutomaton) tableized matchers. This gives us better unicode support, but also allows us to improve performance even more in the future: for example we could make better use of shared byte[] prefixes to speed up the termsenum for faster queries. we don't do this yet... internal representation: * lucene's automaton stores a set of numbered States in Automaton. In addition to this we have a completely revamped determinize() method, along with some other performance improvements. This is all different from brics automaton, where Automaton is basically only a pointer to an initial state... performance can suffer due to the fact it has to often iterate over all the states. Because we have modified automaton, we have written a lot of unit tests, many that work via actual queries, to ensure everything is fully functional. Adding additional methods means we have to add proper tests, too. > > i would really try to avoid having two, 99% identical tools in code, or to > specialize Automaton & co classes to do what they did in the first place. > Could get confusing. > See above, I dont think they are 99% identical. If you are trying to do interact with a lucene index via NFA/DFA, I think you want to use org.apache.lucene.automaton, as its geared towards that. But I don't think its the best for general purpose use. > Also, having full library (or at least imported classes) makes upgrades > easier. 1.11.3 will come one day... > I don't think upgrades will be easy, do to many of the modifications above. At the same time, we are in communication with the author, and are trying to determine a strategy for pushing some of our modifications/improvements into brics automaton itself. Its just a matter of time, I think its difficult but the first step would be to try to add real unit tests to brics automaton. -- Robert Muir rcm...@gmail.com