[ http://issues.apache.org/jira/browse/HADOOP-302?page=comments#action_12420152 ]
Milind Bhandarkar commented on HADOOP-302: ------------------------------------------ There is support for negative numbers as well in recordio scheme, which is not needed here, thus allowing us to save a few more bits. > class Text (replacement for class UTF8) was: HADOOP-136 > ------------------------------------------------------- > > Key: HADOOP-302 > URL: http://issues.apache.org/jira/browse/HADOOP-302 > Project: Hadoop > Type: Improvement > Components: io > Reporter: Michel Tourn > Assignee: Hairong Kuang > > Just to verify, which length-encoding scheme are we using for class Text (aka > LargeUTF8) > a) The "UTF-8/Lucene" scheme? (highest bit of each byte is an extension bit, > which I think is what Doug is describing in his last comment) or > b) the record-IO scheme in o.a.h.record.Utils.java:readInt > Either way, note that: > 1. UTF8.java and its successor Text.java need to read the length in two ways: > 1a. consume 1+ bytes from a DataInput and > 1b. parse the length within a byte array at a given offset > (1.b is used for the "WritableComparator optimized for UTF8 keys" ). > o.a.h.record.Utils only supports the DataInput mode. > It is not clear to me what is the best way to extend this Utils code when you > need to support both reading modes > 2 Methods like UTF8's WritableComparator are to be low overhead, in partic. > there should be no Object allocation. > For the byte array case, the varlen-reader utility needs to be extended to > return both: > the decoded length and the length of the encoded length. > (so that the caller can do offset += encodedlength) > > 3. A String length does not need (small) negative integers. > 4. One advantage of a) is that it is standard (or at least well-known and > natural) and there are no magic constants (like -120, -121 -124) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
