Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Henri Yandell Fri, 08 Mar 2013 15:25:30 -0800

https://issues.apache.org/jira/browse/LANG-877


On Fri, Mar 8, 2013 at 3:21 PM, Henri Yandell <flame...@gmail.com> wrote:
> Noting Lawrence's response. I'll get his email's contents added to JIRA :)
>
> ---------- Forwarded message ----------
> From: Lawrence Angrave <angr...@illinois.edu>
> Date: Fri, Mar 8, 2013 at 2:36 PM
> Subject: Re: [lang] Fix for Apache2,Apache3
> StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii
> char) & simpler unicode hex logic
> To: Henri Yandell <flame...@gmail.com>
>
>
> On 3/8/13 12:51 PM, Henri Yandell wrote:
>>
>> Thanks for the report Lawrence.
>>
>> With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
>> to resolve the attribution comment first:
>>
>> "> Legal:  I, Lawrence Angrave (email address, angr...@illinois.edu
>>>
>>> <mailto:angr...@illinois.edu>), am the author and copyright holder of the
>>> new code provided in this email, and share and re-license this code under
>>> the current Apache2 license."
>>
>> Our solution on attribution of contributions is to add them to the
>> contributor list (inside the pom.xml in the source and published on
>> this page http://commons.apache.org/proper/commons-lang/team-list.html).
>>
>> Can you confirm that's acceptable?
>
> Yes! Thanks!
>
>
>
>
>> Typically we don't put the mail address in for contributors to reduce
>> spam hitting them.
>
> Very sensible. Glad you found it helpful.
> Best,
> Lawrence.
>
>> Thanks again,
>>
>> Hen
>>
>> On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <angr...@illinois.edu> 
>> wrote:
>>>
>>> Hi,
>>>
>>> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
>>> StringEscapeUtils.java
>>> Summary-
>>>
>>>   * I noticed the current Apache code creates three String objects each
>>>     time it writes a unicode hexadecimal  value.
>>>   * Apache3 can also create a char[] array per character translation
>>>     (but I do include a fix for that)
>>>   * This is a easy-to-fix performance bottleneck when writing many
>>>     non-ascii characters.
>>>   * The logic to test for unicode values of different magnitudes can
>>>     also be simplified.
>>>   * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>>>     not have time to become an Apache maintainer. use or ignore at your
>>>     choice.
>>>   * I'm not interested in being a developer for Commons Lang  Use it or
>>>     not  - that's a choice for Commons Lang developers.
>>>
>>> A simple fix more than doubles the string escape speed (40 ms v 100ms to
>>> translate all unicode characters) for Apache3.
>>> The older Apache2-style implementation can now translate all unicode
>>> characters in 8ms.
>>>
>>> The existing Apache3/Apache2 write unicode hex values like this-
>>>          if (codepoint > 0xfff) {
>>>                  out.write("\\u" + hex(codepoint));
>>>              } else if (codepoint > 0xff) {
>>>                  out.write("\\u0" + hex(codepoint));
>>>              } else if (codepoint > 0xf) {
>>>                  out.write("\\u00" + hex(codepoint));
>>>              } else {
>>>                  out.write("\\u000" + hex(codepoint));
>>>              }
>>> The hex() function,
>>> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
>>> also creates two string objects, so we have 3 objects per unicode hex value.
>>>
>>> FIX:
>>>
>>> The padding logic can be simplified and per-character object creation can be
>>> eliminated by writing hex digits directly
>>>              out.write("\\u");
>>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint) & 15]);
>>>
>>>
>>> where  HEX_DIGIT  is
>>> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>>> I believe this is safe for all Locales.
>>>
>>>
>>> When benchmarked it was disconcerting that Apache3 is still five times
>>> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
>>> below).
>>> My guess is that there are other unnecessary per-character object creation
>>> issues still lurking Here's one example -
>>> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>>>         char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>>>
>>> For better performance this should use toChars(int codePoint,  char[] dst,
>>> int dstIndex) , which can re-use the dst char array
>>>
>>>
>>>
>>> The benchmark, my version of a  Apache2-style escapeJavaStyleString
>>> implementation and the code fix for UnicodeEscaper.java  are included below.
>>> If you do find this useful, some source-code attribution would be
>>> appreciated.
>>> Legal:  I, Lawrence Angrave (email address, angr...@illinois.edu
>>> <mailto:angr...@illinois.edu>), am the author and copyright holder of the
>>> new code provided in this email, and share and re-license this code under
>>> the current Apache2 license.
>>>
>>> I hope this email does not go into a blackhole... Feel free to forward it to
>>> the relevant maintainers.
>>>
>>> Regards,
>>> Lawrence.
>>>
>>>      public static final char[] HEX_DIGIT = 
>>> "0123456789ABCDEF".toCharArray();
>>>      public static final char[] CONTROL_CHARS; // non-zero entries for
>>> special case control characters
>>>      static {
>>>          CONTROL_CHARS = new char[32];
>>>          CONTROL_CHARS['\b'] = 'b';
>>>          CONTROL_CHARS['\n'] = 'n';
>>>          CONTROL_CHARS['\t'] = 't';
>>>          CONTROL_CHARS['\f'] = 'f';
>>>          CONTROL_CHARS['\r'] = 'r';
>>>      }
>>>      public static void  escapeJavaStyleString(Writer out, String s, boolean
>>> escapeSingleQuote) throws IOException {
>>> // Apache2 makes the following checks, so we will too-
>>>      if(out==null) throw new IllegalArgumentException("The Writer must not 
>>> be
>>> null");
>>>          if(s == null) return;
>>>
>>>          final int len = s.length();
>>>          for(int i =0; i < len;i++)
>>>              escapeChar(out,s.charAt(i), escapeSingleQuote);
>>>      }
>>>      public static void escapeChar(Writer out, char c, boolean
>>> escapeSingleQuote)
>>>              throws IOException {
>>>          // Most common case
>>>          if (c >= 32 && c < 127) {
>>>              if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>>>                  out.write('\\');
>>>              out.write(c);
>>>              return;
>>>          }
>>>          out.write('\\');
>>>          if (c < 32 && CONTROL_CHARS[c] != 0) {
>>>              out.write(CONTROL_CHARS[c]);
>>>              return;
>>>          }
>>>          // Fast 4 digit uppercase hexadecimal without object creation
>>>          out.write('u');
>>>          out.write(HEX_DIGIT[(c >> 12) & 15]);
>>>          out.write(HEX_DIGIT[(c >> 8) & 15]);
>>>          out.write(HEX_DIGIT[(c >> 4) & 15]);
>>>          out.write(HEX_DIGIT[(c) & 15]);
>>>      }
>>>
>>>
>>>
>>>
>>> FYI The benchmark test just writes all possible unicode characters into a
>>> null writer:
>>>              Writer nullWriter = new Writer() {
>>>              public void write(String s) {
>>>              };
>>>
>>>              public void write(int c) {
>>>              }
>>>
>>>              public void close() throws IOException {
>>>              }
>>>
>>>              public void flush() throws IOException {
>>>              }
>>>
>>>              public void write(char[] cbuf, int off, int len) throws
>>> IOException {
>>>              }
>>>          };
>>>          StringBuilder sb = new StringBuilder(0x10000);
>>>          for (int i = 0; i <= 0xffff; i++)
>>>              sb.append((char) i);
>>>          String allChars = sb.toString();
>>>
>>>          long t1 = System.currentTimeMillis();
>>>          StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>>>          long t2 = System.currentTimeMillis();
>>>          System.out.println(t2 - t1);
>>>
>>>          long t3 = System.currentTimeMillis();
>>>          CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>>>          translator.translate(allChars, nullWriter);
>>>          long t4 = System.currentTimeMillis();
>>>          System.out.println(t4 - t3);
>>>
>>>
>>>
>>> The modification to Apache3 UnicodeEscaper :
>>>
>>>          if (codepoint > 0xffff) {
>>>              // TODO: Figure out what to do. Output as two Unicodes?
>>>              // Does this make this a Java-specific output class?
>>>              out.write("\\u" + hex(codepoint));
>>>          } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
>>> *if (codepoint > 0xfff) {
>>>                  out.write("\\u" + hex(codepoint));
>>>              } else if (codepoint > 0xff) {
>>>                  out.write("\\u0" + hex(codepoint));
>>>              } else if (codepoint > 0xf) {
>>>                  out.write("\\u00" + hex(codepoint));
>>>              } else {
>>>                  out.write("\\u000" + hex(codepoint));
>>>              }*
>>>          } else { // *NEW FAST CODE*
>>> *            out.write("\\u");
>>>              out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>>>              out.write(HEX_DIGIT[(codepoint) & 15]);*
>>>          }
>>>
>>> *and add    public static final char[] HEX_DIGIT =
>>> "0123456789ABCDEF".toCharArray();**
>>> *
>>>
>>> ps.
>>> The home page for Commons lang
>>> http://commons.apache.org/proper/commons-lang/)
>>>   has the following 404 link on the left
>>>      Contributing Patches
>>> http://commons.apache.org/proper/patches.html
>>> (I believe the 'proper' is unnecessary)
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic

Reply via email to