[
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15617679#comment-15617679
]
Uwe Schindler commented on LUCENE-7525:
---------------------------------------
I'd suggest to use the simple binary search approach, but without generated
code. I'd suggest to convert the large switch statement once to a simple text
file and load it as resource in static initializer.
This allows to maybe further extend the folding filter so people can use their
own mappings by pointing to an input stream!
> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method
> size
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 6.2.1
> Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java,
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch
> statement and is too large for the HotSpot compiler to compile; causing a
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance
> increase.
> In the test code below you can see how slow the fold method is, even when it
> is using the shortcut when the character is less than 0x80, compared to an
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input
> file as per SOLR-2013 would be a better replacement for this method in this
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
> private static final int ITERATIONS = 1_000_000;
> @Test
> public void testFoldShortString() {
> char[] input = "testing".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
> input.length);
> }
> }
> @Test
> public void testFoldShortAccentedString() {
> char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
> input.length);
> }
> }
> @Test
> public void testManualFoldTinyString() {
> char[] input = "t".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> int k = 0;
> for (int j = 0; j < 1; ++j) {
> final char c = input[j];
> if (c < '\u0080') {
> output[k++] = c;
> } else {
> Assert.assertTrue(false);
> }
> }
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]