[
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613746#comment-15613746
]
Karl von Randow commented on LUCENE-7525:
-----------------------------------------
I have created a version that uses static arrays and a binary search. This
performs similarly to splitting the existing code into two methods; splitting
the switch cases between the methods, and delegating to the second method from
the first's {{default:}} block. Not surprising, as the switch of that size (and
breadth of values) is a binary search, but native.
I'll try to attach the code examples of the two approaches. These two
approaches are perhaps a simpler change than going to a new lookup structure,
and I think will be more performant (as they're a binary search over an array)?
The simplest change is the switch-split. It will also be the most attractive in
the Git history (as the method is literally split in two about half-way through
the case statements). It will also be the easiest to prove that the behaviour
is the same as before.
> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method
> size
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 6.2.1
> Reporter: Karl von Randow
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch
> statement and is too large for the HotSpot compiler to compile; causing a
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance
> increase.
> In the test code below you can see how slow the fold method is, even when it
> is using the shortcut when the character is less than 0x80, compared to an
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input
> file as per SOLR-2013 would be a better replacement for this method in this
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
> private static final int ITERATIONS = 1_000_000;
> @Test
> public void testFoldShortString() {
> char[] input = "testing".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
> input.length);
> }
> }
> @Test
> public void testFoldShortAccentedString() {
> char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> ASCIIFoldingFilter.foldToASCII(input, 0, output, 0,
> input.length);
> }
> }
> @Test
> public void testManualFoldTinyString() {
> char[] input = "t".toCharArray();
> char[] output = new char[input.length * 4];
> for (int i = 0; i < ITERATIONS; i++) {
> int k = 0;
> for (int j = 0; j < 1; ++j) {
> final char c = input[j];
> if (c < '\u0080') {
> output[k++] = c;
> } else {
> Assert.assertTrue(false);
> }
> }
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]