[ https://issues.apache.org/jira/browse/LUCENE-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034513#comment-14034513 ]
Michael McCandless commented on LUCENE-5752: -------------------------------------------- Thanks Rob. bq. concatenate: as mentioned before, we rely on this today in quite a few places, and now the runtime has significantly changed (when the left side is a singleton) Well, in RegExp we followup that concatenate with a minimize. In WildcardQuery the incoming automata are small anyway... and I fixed LevA to insert the prefix itself to avoid the full copy of the fuzzy suffix part.. bq. singleton: speaking of such, this optimization is removed, but are we sure about this? In practice this is probably extremely effective, maybe even outweighing any other optimizations we could do. I really didn't like this duality / mutability (how you sometimes had to call expandSingleton for ops that cared) and I don't see where this opto would really make a difference in Lucene. We already have DaciukMihov to efficiently build minimal union automaton ... I agree for a general purpose automaton library this might make sense ... but I don't think it really helps Lucene. bq. regex/wildcard parsing: we should really test that this isn't totally crazy (read: blowing up) now. I was worried about this too but when I looked at RegExp it calls minimize after all of these ops! So I think the added cost of the copy is likely in the noise ... bq. acceptStates: should this really be a hashset? is there a reason not to use a bitset? Hmm it could be a bitset. I thought that typically the number of accept states is small, but I agree in the case when it's large it'd be nice to not use way way too much RAM ... I'll change it to bitset. > Explore light weight Automaton replacement > ------------------------------------------ > > Key: LUCENE-5752 > URL: https://issues.apache.org/jira/browse/LUCENE-5752 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 5.0 > > Attachments: LUCENE-5752.patch > > > This effort started with the patch on LUCENE-4556, to create a "light > weight" replacement for the current object-heavy Automaton class > (which creates separate State and Transition objects). > I took that initial patch much further, and cutover most places in > Lucene that use Automaton to LightAutomaton. Tests pass. > The core idea of LightAutomaton is all states are ints, and you build > up the automaton under the restriction that you add all outgoing > transitions one state at a time. This worked well for most > operations, but for some (e.g. UTF32ToUTF8!!) it was harder, so I also > added a separate builder to add transitions in any order and then in > the end they are sorted and added to the real automaton. > If this is successful I think we should just replace the current > Automaton with LightAutomaton; right now they both exist in my current > patch... > This is very much a work in progress, and I'm not sure the > restrictions the API imposes are "reasonable" (some algos got uglier). > But I think it's at least worth exploring/iterating... I'll make a branch and > commit my current state. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org