[ https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612160#comment-15612160 ]
Michael Braun commented on LUCENE-7525: --------------------------------------- Was just profiling indexing yesterday on our side and noticed the exact same thing - would love to see the performance of this method improved! > ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method > size > ---------------------------------------------------------------------------------- > > Key: LUCENE-7525 > URL: https://issues.apache.org/jira/browse/LUCENE-7525 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 6.2.1 > Reporter: Karl von Randow > > The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch > statement and is too large for the HotSpot compiler to compile; causing a > performance problem. > The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting > the method in half works around the problem. > In my tests splitting the method in half resulted in a 5X performance > increase. > In the test code below you can see how slow the fold method is, even when it > is using the shortcut when the character is less than 0x80, compared to an > inline implementation of the same shortcut. > So a workaround is to split the method. I'm happy to provide a patch. It's a > hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input > file as per SOLR-2013 would be a better replacement for this method in this > class? > {code:java} > public class ASCIIFoldingFilterPerformanceTest { > private static final int ITERATIONS = 1_000_000; > @Test > public void testFoldShortString() { > char[] input = "testing".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, > input.length); > } > } > @Test > public void testFoldShortAccentedString() { > char[] input = "éúéúøßüäéúéúøßüä".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, > input.length); > } > } > @Test > public void testManualFoldTinyString() { > char[] input = "t".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > int k = 0; > for (int j = 0; j < 1; ++j) { > final char c = input[j]; > if (c < '\u0080') { > output[k++] = c; > } else { > Assert.assertTrue(false); > } > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org