https://issues.apache.org/jira/browse/LANG-877
On Fri, Mar 8, 2013 at 3:21 PM, Henri Yandell <[email protected]> wrote: > Noting Lawrence's response. I'll get his email's contents added to JIRA :) > > ---------- Forwarded message ---------- > From: Lawrence Angrave <[email protected]> > Date: Fri, Mar 8, 2013 at 2:36 PM > Subject: Re: [lang] Fix for Apache2,Apache3 > StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii > char) & simpler unicode hex logic > To: Henri Yandell <[email protected]> > > > On 3/8/13 12:51 PM, Henri Yandell wrote: >> >> Thanks for the report Lawrence. >> >> With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted >> to resolve the attribution comment first: >> >> "> Legal: I, Lawrence Angrave (email address, [email protected] >>> >>> <mailto:[email protected]>), am the author and copyright holder of the >>> new code provided in this email, and share and re-license this code under >>> the current Apache2 license." >> >> Our solution on attribution of contributions is to add them to the >> contributor list (inside the pom.xml in the source and published on >> this page http://commons.apache.org/proper/commons-lang/team-list.html). >> >> Can you confirm that's acceptable? > > Yes! Thanks! > > > > >> Typically we don't put the mail address in for contributors to reduce >> spam hitting them. > > Very sensible. Glad you found it helpful. > Best, > Lawrence. > >> Thanks again, >> >> Hen >> >> On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <[email protected]> >> wrote: >>> >>> Hi, >>> >>> Some comments that are relevant to Apache3 UnicodeEscaper and Apache2's >>> StringEscapeUtils.java >>> Summary- >>> >>> * I noticed the current Apache code creates three String objects each >>> time it writes a unicode hexadecimal value. >>> * Apache3 can also create a char[] array per character translation >>> (but I do include a fix for that) >>> * This is a easy-to-fix performance bottleneck when writing many >>> non-ascii characters. >>> * The logic to test for unicode values of different magnitudes can >>> also be simplified. >>> * Benchmark and code fixes for Apache2 and Apache 3 are included. I do >>> not have time to become an Apache maintainer. use or ignore at your >>> choice. >>> * I'm not interested in being a developer for Commons Lang Use it or >>> not - that's a choice for Commons Lang developers. >>> >>> A simple fix more than doubles the string escape speed (40 ms v 100ms to >>> translate all unicode characters) for Apache3. >>> The older Apache2-style implementation can now translate all unicode >>> characters in 8ms. >>> >>> The existing Apache3/Apache2 write unicode hex values like this- >>> if (codepoint > 0xfff) { >>> out.write("\\u" + hex(codepoint)); >>> } else if (codepoint > 0xff) { >>> out.write("\\u0" + hex(codepoint)); >>> } else if (codepoint > 0xf) { >>> out.write("\\u00" + hex(codepoint)); >>> } else { >>> out.write("\\u000" + hex(codepoint)); >>> } >>> The hex() function, >>> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH); >>> also creates two string objects, so we have 3 objects per unicode hex value. >>> >>> FIX: >>> >>> The padding logic can be simplified and per-character object creation can be >>> eliminated by writing hex digits directly >>> out.write("\\u"); >>> out.write(HEX_DIGIT[(codepoint >> 12) & 15]); >>> out.write(HEX_DIGIT[(codepoint >> 8) & 15]); >>> out.write(HEX_DIGIT[(codepoint >> 4) & 15]); >>> out.write(HEX_DIGIT[(codepoint) & 15]); >>> >>> >>> where HEX_DIGIT is >>> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray(); >>> I believe this is safe for all Locales. >>> >>> >>> When benchmarked it was disconcerting that Apache3 is still five times >>> slower (40ms instead of 8ms) than my rewritten Apache2 version (included >>> below). >>> My guess is that there are other unnecessary per-character object creation >>> issues still lurking Here's one example - >>> CharSequenceTranslator.translate(CharSequence input, Writer out) : >>> char[] c = *Character.toChars*(Character.codePointAt(input, pos)) >>> >>> For better performance this should use toChars(int codePoint, char[] dst, >>> int dstIndex) , which can re-use the dst char array >>> >>> >>> >>> The benchmark, my version of a Apache2-style escapeJavaStyleString >>> implementation and the code fix for UnicodeEscaper.java are included below. >>> If you do find this useful, some source-code attribution would be >>> appreciated. >>> Legal: I, Lawrence Angrave (email address, [email protected] >>> <mailto:[email protected]>), am the author and copyright holder of the >>> new code provided in this email, and share and re-license this code under >>> the current Apache2 license. >>> >>> I hope this email does not go into a blackhole... Feel free to forward it to >>> the relevant maintainers. >>> >>> Regards, >>> Lawrence. >>> >>> public static final char[] HEX_DIGIT = >>> "0123456789ABCDEF".toCharArray(); >>> public static final char[] CONTROL_CHARS; // non-zero entries for >>> special case control characters >>> static { >>> CONTROL_CHARS = new char[32]; >>> CONTROL_CHARS['\b'] = 'b'; >>> CONTROL_CHARS['\n'] = 'n'; >>> CONTROL_CHARS['\t'] = 't'; >>> CONTROL_CHARS['\f'] = 'f'; >>> CONTROL_CHARS['\r'] = 'r'; >>> } >>> public static void escapeJavaStyleString(Writer out, String s, boolean >>> escapeSingleQuote) throws IOException { >>> // Apache2 makes the following checks, so we will too- >>> if(out==null) throw new IllegalArgumentException("The Writer must not >>> be >>> null"); >>> if(s == null) return; >>> >>> final int len = s.length(); >>> for(int i =0; i < len;i++) >>> escapeChar(out,s.charAt(i), escapeSingleQuote); >>> } >>> public static void escapeChar(Writer out, char c, boolean >>> escapeSingleQuote) >>> throws IOException { >>> // Most common case >>> if (c >= 32 && c < 127) { >>> if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote)) >>> out.write('\\'); >>> out.write(c); >>> return; >>> } >>> out.write('\\'); >>> if (c < 32 && CONTROL_CHARS[c] != 0) { >>> out.write(CONTROL_CHARS[c]); >>> return; >>> } >>> // Fast 4 digit uppercase hexadecimal without object creation >>> out.write('u'); >>> out.write(HEX_DIGIT[(c >> 12) & 15]); >>> out.write(HEX_DIGIT[(c >> 8) & 15]); >>> out.write(HEX_DIGIT[(c >> 4) & 15]); >>> out.write(HEX_DIGIT[(c) & 15]); >>> } >>> >>> >>> >>> >>> FYI The benchmark test just writes all possible unicode characters into a >>> null writer: >>> Writer nullWriter = new Writer() { >>> public void write(String s) { >>> }; >>> >>> public void write(int c) { >>> } >>> >>> public void close() throws IOException { >>> } >>> >>> public void flush() throws IOException { >>> } >>> >>> public void write(char[] cbuf, int off, int len) throws >>> IOException { >>> } >>> }; >>> StringBuilder sb = new StringBuilder(0x10000); >>> for (int i = 0; i <= 0xffff; i++) >>> sb.append((char) i); >>> String allChars = sb.toString(); >>> >>> long t1 = System.currentTimeMillis(); >>> StringEscaper.escapeJavaStyleString(nullWriter, allChars, true); >>> long t2 = System.currentTimeMillis(); >>> System.out.println(t2 - t1); >>> >>> long t3 = System.currentTimeMillis(); >>> CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA; >>> translator.translate(allChars, nullWriter); >>> long t4 = System.currentTimeMillis(); >>> System.out.println(t4 - t3); >>> >>> >>> >>> The modification to Apache3 UnicodeEscaper : >>> >>> if (codepoint > 0xffff) { >>> // TODO: Figure out what to do. Output as two Unicodes? >>> // Does this make this a Java-specific output class? >>> out.write("\\u" + hex(codepoint)); >>> } else if (1 == 0) { //*OLD SLOW CODE* (can be removed) >>> *if (codepoint > 0xfff) { >>> out.write("\\u" + hex(codepoint)); >>> } else if (codepoint > 0xff) { >>> out.write("\\u0" + hex(codepoint)); >>> } else if (codepoint > 0xf) { >>> out.write("\\u00" + hex(codepoint)); >>> } else { >>> out.write("\\u000" + hex(codepoint)); >>> }* >>> } else { // *NEW FAST CODE* >>> * out.write("\\u"); >>> out.write(HEX_DIGIT[(codepoint >> 12) & 15]); >>> out.write(HEX_DIGIT[(codepoint >> 8) & 15]); >>> out.write(HEX_DIGIT[(codepoint >> 4) & 15]); >>> out.write(HEX_DIGIT[(codepoint) & 15]);* >>> } >>> >>> *and add public static final char[] HEX_DIGIT = >>> "0123456789ABCDEF".toCharArray();** >>> * >>> >>> ps. >>> The home page for Commons lang >>> http://commons.apache.org/proper/commons-lang/) >>> has the following 404 link on the left >>> Contributing Patches >>> http://commons.apache.org/proper/patches.html >>> (I believe the 'proper' is unnecessary) >>> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
