[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Robert Muir (JIRA) Fri, 03 Sep 2010 19:58:58 -0700

    [ 
https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906212#action_12906212
 ]


Robert Muir commented on THRIFT-765:
------------------------------------

bq. Do you think we should try some invalid strings as well? It should just 
lead to encoding exceptions, right?

The (commented-out test) generates purely random bytes for decode, and purely 
random chars for encode.

you can safely enable the random chars encode test, if you think there are no 
encode bugs in your jdk (sun is ok)
you cannot safely enable the random bytes decode test, because of the sun 
decode bug i mentioned (see testJDKCorrectness to test your jdk).

I reported the bug to sun here: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052

by the way: you can speed up decode further for the multibyte case:
{code}
      if (state != UTF8_ACCEPT) {
        // non-starter, or following illegal byte sequence
        if (state == UTF8_REJECT)               // remove this line
          throw new CharacterCodingException(); // remove this line also
{code}

this is because the DFA is designed such that, if the decoder enters a reject 
state it will never leave.
so this code is redundant, it will throw the exception at the end anyway (sorry 
for leaving the check in the code)


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Java - Library
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, 
> THRIFT-765.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and 
> deserialization is string encoding. For some inscrutable reason, 
> String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's 
> writeUTF method has a faster implementation of UTF-8 encoding and decoding. 
> We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-765) Improved string encoding and decoding performance

Reply via email to