[ https://issues.apache.org/jira/browse/HBASE-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003260#comment-13003260 ]
ryan rawson commented on HBASE-3480: ------------------------------------ in my test the cost of serialization was larger than the time savings of data transmission on the wire. I think we are going to put a freeze on RPC changes soon, we need to be thinking next gen. > Reduce the size of Result serialization > --------------------------------------- > > Key: HBASE-3480 > URL: https://issues.apache.org/jira/browse/HBASE-3480 > Project: HBase > Issue Type: Bug > Affects Versions: 0.90.0 > Reporter: ryan rawson > Attachments: HBASE-3480-lzf.txt, HBASE-3480.txt > > > When faced with a gigabit ethernet network connection, things are pretty slow > actually. For example, let's take a 2 MB reply, using a 120MB/sec line rate, > we are talking about about 16ms to transfer that data across a gige line. > This is a pretty significant amount of time. > So this JIRA is about reducing the size of the Result[] serialization. By > exploiting family and qualifier and rowkey duplication, I created a simple > encoding scheme to use a dictionary instead of literal strings. > in my testing, I am seeing some success with the sizes. Average serialized > size is about 1/2 of previous, but time to serialize on the regionserver side > is way up, by a factor of 10x. This might be due to the simplistic first > implementation however. > Here is the post change size: > grep 'Serialized size' * | perl -ne '/Serialized size: (\d+?) in (\d+?) ns/ ; > print $1, " ", $2, "\n" if $1 > 10000;' | cut -f1 -d' ' | perl -ne '$sum += > $_; $count++; END {print $sum/$count, "\n"}' > 377047.1125 > Here is the pre change size: > grep 'Serialized size' * | perl -ne '/Serialized size: (\d+?) in (\d+?) ns/ ; > print $1, " ", $2, "\n" if $1 > 10000;' | cut -f1 -d' ' | perl -ne '$sum += > $_; $count++; END {print $sum/$count, "\n"}' > 601078.505882353 > That is about a 60% improvement in size. > But times are not so good, here are some samples of the old, in (size) (time > in ns) > 3874599 10685836 > 5582725 11525888 > so that is about 11ms to serialize 3-5mb of data. > In the new implementation: > 1898788 118504672 > 1630058 91133003 > this is 118-91ms for serialized sizes of 1.6-1.8 MB. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira