[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-4556: Assignee: Michael McCandless (was: Simon Willnauer) > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Michael McCandless >Priority: Critical > Fix For: 4.9, 5.0 > > Attachments: LUCENE-4556.patch, LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-4556: - Fix Version/s: (was: 4.7) 4.8 > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.8 > > Attachments: LUCENE-4556.patch, LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-4556: -- Fix Version/s: (was: 4.3) 4.4 > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.4 > > Attachments: LUCENE-4556.patch, LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-4556: --- Fix Version/s: (was: 4.1) 4.2 > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.2, 5.0 > > Attachments: LUCENE-4556.patch, LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4556: --- Attachment: LUCENE-4556.patch I'm attaching a possible alternate way to reduce objects ... it's only just a start ... I created a new LightAutomaton class (I'm not wed to that name!) which places a severe "append only" restriction on how you are allowed to build up the FSA: you must add all transitions for a given state before adding another state's transitions. It operates with only "int state", and stores all transitions in a private int[]. This is a big restriction, but I think a number of our FSA ops would work fine with this. I'm pretty sure building the LevA, and doing the UTF32->UTF8 conversion would work fine append-only... In the patch, I added Automaton.toLightAutomaton to convert from "heavy" to LightAutomaton, and then fixed CompiledAutomaton (and its consumers) to use that. Tests pass. I think it shouldn't be too hard to cut over the Lev building to this too ... but wanted to get feedback first. Simon, it'd be great if you could try this patch on your benchmark since I can't reproduce the too-heavy GC in my benchmark ... I'm particularly curious whether the 50% time spent in GC you see is due to 1) creating too many objects vs 2) holding onto those objects for too long (in CompiledAutomaton, while the query runs...). So this patch would test whether it's case 2). > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4556.patch, LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4556) FuzzyTermsEnum creates tons of objects
[ https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-4556: Attachment: LUCENE-4556.patch here is a patch ...scaryâ„¢ > FuzzyTermsEnum creates tons of objects > -- > > Key: LUCENE-4556 > URL: https://issues.apache.org/jira/browse/LUCENE-4556 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search, modules/spellchecker >Affects Versions: 4.0 >Reporter: Simon Willnauer >Assignee: Simon Willnauer >Priority: Critical > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4556.patch > > > I ran into this problem in production using the DirectSpellchecker. The > number of objects created by the spellchecker shoot through the roof very > very quickly. We ran about 130 queries and ended up with > 2M transitions / > states. We spend 50% of the time in GC just because of transitions. Other > parts of the system behave just fine here. > I talked quickly to robert and gave a POC a shot providing a > LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case > and build a array based strucuture converted into UTF-8 directly instead of > going through the object based APIs. This involved quite a bit of changes but > they are all package private at this point. I have a patch that still has a > fair set of nocommits but its shows that its possible and IMO worth the > trouble to make this really useable in production. All tests pass with the > patch - its a start -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org