FYI: this (open) JIRA might be interesting to you: <http://issues.apache.org/jira/browse/HADOOP-3788>
Alex On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon <t...@cloudera.com> wrote: > On Wed, Apr 8, 2009 at 7:14 PM, bzheng <bing.zh...@gmail.com> wrote: > > > > > Thanks for the clarification. Though I still find it strange why not > have > > the get() method return what's actually stored regardless of buffer size. > > Is there any reason why you'd want to use/examine what's in the buffer? > > > > Because doing so requires an array copy. It's important for hadoop > performance to avoid needless copies of data when they're unnecessary. Most > APIs that take byte[] arrays have a version that includes an offset and > length. > > -Todd > > > > > > > > > Todd Lipcon-4 wrote: > > > > > > Hi Bing, > > > > > > The issue here is that BytesWritable uses an internal buffer which is > > > grown > > > but not shrunk. The cause of this is that Writables in general are > single > > > instances that are shared across multiple input records. If you look at > > > the > > > internals of the input reader, you'll see that a single BytesWritable > is > > > instantiated, and then each time a record is read, it's read into that > > > same > > > instance. The purpose here is to avoid the allocation cost for each > row. > > > > > > The end result is, as you've seen, that getBytes() returns an array > which > > > may be larger than the actual amount of data. In fact, the extra bytes > > > (between .getSize() and .get().length) have undefined contents, not > zero. > > > > > > Unfortunately, if the protobuffer API doesn't allow you to deserialize > > out > > > of a smaller portion of a byte array, you're out of luck and will have > to > > > do > > > the copy like you've mentioned. I imagine, though, that there's some > way > > > around this in the protobuffer API - perhaps you can use a > > > ByteArrayInputStream here to your advantage. > > > > > > Hope that helps > > > -Todd > > > > > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <bing.zh...@gmail.com> wrote: > > > > > >> > > >> I tried to store protocolbuffer as BytesWritable in a sequence file > > >> <Text, > > >> BytesWritable>. It's stored using SequenceFile.Writer(new Text(key), > > new > > >> BytesWritable(protobuf.convertToBytes())). When reading the values > from > > >> key/value pairs using value.get(), it returns more then what's stored. > > >> However, value.getSize() returns the correct number. This means in > > order > > >> to > > >> convert the byte[] to protocol buffer again, I have to do > > >> Arrays.copyOf(value.get(), value.getSize()). This happens on both > > >> version > > >> 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes > for > > >> a > > >> few entries in the sequence file below. The extra bytes in > value.get() > > >> all > > >> have values of zero. > > >> > > >> value.getSize(): 7066 value.get().length: 10599 > > >> value.getSize(): 36456 value.get().length: 54684 > > >> value.getSize(): 32275 value.get().length: 54684 > > >> value.getSize(): 40561 value.get().length: 54684 > > >> value.getSize(): 16855 value.get().length: 54684 > > >> value.getSize(): 66304 value.get().length: 99456 > > >> value.getSize(): 26488 value.get().length: 99456 > > >> value.getSize(): 59327 value.get().length: 99456 > > >> value.getSize(): 36865 value.get().length: 99456 > > >> > > >> -- > > >> View this message in context: > > >> > > > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html > > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > > >> > > >> > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > >