[ 
https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4556:
---------------------------------------

    Attachment: LUCENE-4556.patch

I'm attaching a possible alternate way to reduce objects ... it's
only just a start ...

I created a new LightAutomaton class (I'm not wed to that name!) which
places a severe "append only" restriction on how you are allowed to
build up the FSA: you must add all transitions for a given state
before adding another state's transitions.

It operates with only "int state", and stores all transitions in a
private int[].

This is a big restriction, but I think a number of our FSA ops would
work fine with this.  I'm pretty sure building the LevA, and doing the
UTF32->UTF8 conversion would work fine append-only...

In the patch, I added Automaton.toLightAutomaton to convert from
"heavy" to LightAutomaton, and then fixed CompiledAutomaton (and its
consumers) to use that.  Tests pass.

I think it shouldn't be too hard to cut over the Lev building to this
too ... but wanted to get feedback first.

Simon, it'd be great if you could try this patch on your benchmark
since I can't reproduce the too-heavy GC in my benchmark ... I'm
particularly curious whether the 50% time spent in GC you see is due
to 1) creating too many objects vs 2) holding onto those objects for
too long (in CompiledAutomaton, while the query runs...).  So this
patch would test whether it's case 2).

                
> FuzzyTermsEnum creates tons of objects
> --------------------------------------
>
>                 Key: LUCENE-4556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4556
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search, modules/spellchecker
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Critical
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4556.patch, LUCENE-4556.patch
>
>
> I ran into this problem in production using the DirectSpellchecker. The 
> number of objects created by the spellchecker shoot through the roof very 
> very quickly. We ran about 130 queries and ended up with > 2M transitions / 
> states. We spend 50% of the time in GC just because of transitions. Other 
> parts of the system behave just fine here.
> I talked quickly to robert and gave a POC a shot providing a 
> LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case 
> and build a array based strucuture converted into UTF-8 directly instead of 
> going through the object based APIs. This involved quite a bit of changes but 
> they are all package private at this point. I have a patch that still has a 
> fair set of nocommits but its shows that its possible and IMO worth the 
> trouble to make this really useable in production. All tests pass with the 
> patch - its a start....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to