[ https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842901#comment-15842901 ]
Steve Rowe commented on LUCENE-7525: ------------------------------------ {quote} I think, we can for now replace the large switch statement with a resource file. I'd have 2 ideas: # A UTF-8 encoded file with 2 columns: first column is a single char, 2nd column is a series of replacements. I don't really like this approach as it is very sensitive to corrumption by editors and hard to commit correct # A simple file like int => int,int,int // comment, this is easy to parse and convert, but backside is that its harder to read the codepoints (for that we have a comment) {quote} I wrote a Perl script to create {{mapping-FoldToASCII.txt}}, which is usable with {{MappingCharFilter}}, from the {{ASCIIFoldingFilter}} code - the script is actually embedded in that file, which is included in several of Solr's example configsets, e.g. under {{solr/server/solr/configsets/sample_techproducts_configs/conf/}}. Maybe this file could be used directly? It's human friendly, so would allow for easy user customization. > ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method > size > ---------------------------------------------------------------------------------- > > Key: LUCENE-7525 > URL: https://issues.apache.org/jira/browse/LUCENE-7525 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 6.2.1 > Reporter: Karl von Randow > Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, > LUCENE-7525.patch, TestASCIIFolding.java > > > The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch > statement and is too large for the HotSpot compiler to compile; causing a > performance problem. > The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting > the method in half works around the problem. > In my tests splitting the method in half resulted in a 5X performance > increase. > In the test code below you can see how slow the fold method is, even when it > is using the shortcut when the character is less than 0x80, compared to an > inline implementation of the same shortcut. > So a workaround is to split the method. I'm happy to provide a patch. It's a > hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input > file as per SOLR-2013 would be a better replacement for this method in this > class? > {code:java} > public class ASCIIFoldingFilterPerformanceTest { > private static final int ITERATIONS = 1_000_000; > @Test > public void testFoldShortString() { > char[] input = "testing".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, > input.length); > } > } > @Test > public void testFoldShortAccentedString() { > char[] input = "éúéúøßüäéúéúøßüä".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, > input.length); > } > } > @Test > public void testManualFoldTinyString() { > char[] input = "t".toCharArray(); > char[] output = new char[input.length * 4]; > for (int i = 0; i < ITERATIONS; i++) { > int k = 0; > for (int j = 0; j < 1; ++j) { > final char c = input[j]; > if (c < '\u0080') { > output[k++] = c; > } else { > Assert.assertTrue(false); > } > } > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org