Le 06/04/16 08:47, Stefan Seelmann a écrit : > On 04/06/2016 01:05 AM, Emmanuel Lécharny wrote: >> So for the record, after a couple of hours working on it tonite, I get >> the DeepTrimToLowerNormalizer() working fine, with tests passing. >> >> I was also able to improve the performances of the beast : from 20 >> seconds to normalize 10 000 000 or String like "xs crvtbynU >> Jikl7897790", down to 4.3s. I just assumed that most of the time, we >> will deal with chars between 0x00 and 0x7F, and wrote a specific >> function for that. If we have chars above 0x7F, then an exception is >> thrown and we fell back to the complexe process, which will then take >> 47s instead of 20s. >> >> So this is a balance : >> - we have an implementation that covers all the chars, and takes 20s for >> 10M Strings >> - we have an implementation that tries to process the String if chars >> are in [0c00, 0x7F] and takes 4.3 s for 10M Strings, but takes 47 >> seconds if we have a char outside this range. >> >> Beside the obvious gain, there is another reason why I wanted to do that >> : processing IA5String values will benefit from this separation, and >> that covers numerous AttributeTypes (like mail, homeDirectory, >> memberUid, krb5principalname, krb5Realmname, and a lot more. >> >> wdyt ? Going for an average of 20s no matter what, or accepting a huge >> penalty when the String does not contain ASCII chars ? > I'd go for the 2nd optimized way. > > Is the cause of the penalty only the exception-throw-catch?
It's part of it. Changing the code to use a static Exception that is being thrown, instead of creating a new exception everytime saves 20s. This is probably teh way to go : we benefit from a huge improvement when the String is pure ASCII, and the penalty is just the time spent in this phase if this is not the case. Here are the new numbers : - pure ASCII String : 4s - non ASCII String : 24,8s - catch-all solution (ie, no ASCII optimisation) : 20s Way better than the previous solution by simpy adding : /** An exception used to get out of the map method quickly */ private static final ArrayIndexOutOfBoundsException AIOOBE = new ArrayIndexOutOfBoundsException(); and throwing AIOOBE in the ascii method... Otherwise, there are other parts that can be improved : we always process a String in the map(), normalize(), checkProhibited() and insignifiantSpacesString() methods. That means weget the char[] out of the String, and create a new String. We could most certainly do it only once at least for the 2 last methods that are run consecutively (the normalize() method uses a Java method that expect a String()). I'll check that tonite. Thanks for the feedback !