[
https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906212#action_12906212
]
Robert Muir commented on THRIFT-765:
------------------------------------
bq. Do you think we should try some invalid strings as well? It should just
lead to encoding exceptions, right?
The (commented-out test) generates purely random bytes for decode, and purely
random chars for encode.
you can safely enable the random chars encode test, if you think there are no
encode bugs in your jdk (sun is ok)
you cannot safely enable the random bytes decode test, because of the sun
decode bug i mentioned (see testJDKCorrectness to test your jdk).
I reported the bug to sun here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6982052
by the way: you can speed up decode further for the multibyte case:
{code}
if (state != UTF8_ACCEPT) {
// non-starter, or following illegal byte sequence
if (state == UTF8_REJECT) // remove this line
throw new CharacterCodingException(); // remove this line also
{code}
this is because the DFA is designed such that, if the decoder enters a reject
state it will never leave.
so this code is redundant, it will throw the exception at the end anyway (sorry
for leaving the check in the code)
> Improved string encoding and decoding performance
> -------------------------------------------------
>
> Key: THRIFT-765
> URL: https://issues.apache.org/jira/browse/THRIFT-765
> Project: Thrift
> Issue Type: Improvement
> Components: Java - Library
> Affects Versions: 0.2
> Reporter: Bryan Duxbury
> Assignee: Bryan Duxbury
> Fix For: 0.4
>
> Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch,
> THRIFT-765.patch, thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and
> deserialization is string encoding. For some inscrutable reason,
> String.getBytes("UTF-8") is slow.
> However, it's recently been brought to my attention that DataOutputStream's
> writeUTF method has a faster implementation of UTF-8 encoding and decoding.
> We should use this style of encoding.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.