[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

Steve Rowe (JIRA) Fri, 27 Jan 2017 06:07:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842901#comment-15842901
 ]


Steve Rowe commented on LUCENE-7525:
------------------------------------

{quote}
I think, we can for now replace the large switch statement with a resource 
file. I'd have 2 ideas:
# A UTF-8 encoded file with 2 columns: first column is a single char, 2nd 
column is a series of replacements. I don't really like this approach as it is 
very sensitive to corrumption by editors and hard to commit correct
# A simple file like int => int,int,int // comment, this is easy to parse and 
convert, but backside is that its harder to read the codepoints (for that we 
have a comment)
{quote}

I wrote a Perl script to create {{mapping-FoldToASCII.txt}}, which is usable 
with {{MappingCharFilter}}, from the {{ASCIIFoldingFilter}} code - the script 
is actually embedded in that file, which is included in several of Solr's 
example configsets, e.g. under 
{{solr/server/solr/configsets/sample_techproducts_configs/conf/}}.  Maybe this 
file could be used directly?  It's human friendly, so would allow for easy user 
customization.

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-7525
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7525
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 6.2.1
>            Reporter: Karl von Randow
>         Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, 
> LUCENE-7525.patch, TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>       private static final int ITERATIONS = 1_000_000;
>       @Test
>       public void testFoldShortString() {
>               char[] input = "testing".toCharArray();
>               char[] output = new char[input.length * 4];
>               for (int i = 0; i < ITERATIONS; i++) {
>                       ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>               }
>       }
>       @Test
>       public void testFoldShortAccentedString() {
>               char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>               char[] output = new char[input.length * 4];
>               for (int i = 0; i < ITERATIONS; i++) {
>                       ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>               }
>       }
>       @Test
>       public void testManualFoldTinyString() {
>               char[] input = "t".toCharArray();
>               char[] output = new char[input.length * 4];
>               for (int i = 0; i < ITERATIONS; i++) {
>                       int k = 0;
>                       for (int j = 0; j < 1; ++j) {
>                               final char c = input[j];
>                               if (c < '\u0080') {
>                                       output[k++] = c;
>                               } else {
>                                       Assert.assertTrue(false);
>                               }
>                       }
>               }
>       }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

Reply via email to