Encoder with empty bytes deserializes with non-empty bytes

David Capwell Wed, 21 Feb 2018 21:56:16 -0800

I am trying to create a Encoder for protobuf data and noticed something
rather weird.  When we have a empty ByteString (not null, just empty), when
we deserialize we get back a empty array of length 8.  I took the generated
code and see something weird going on.


UnsafeRowWriter


   1.

   public void setOffsetAndSize(int ordinal, long currentCursor, long size) {

   2.

       final long relativeOffset = currentCursor - startingOffset;

   3.

       final long fieldOffset = getFieldOffset(ordinal);

   4.

       final long offsetAndSize = (relativeOffset << 32) | size;

   5.

   6.

       Platform.putLong(holder.buffer, fieldOffset, offsetAndSize);

   7.

     }



So this takes the size of the array and stores it... but its not the array
size, its how many bytes were added

rowWriter2.setOffsetAndSize(2, tmpCursor16, holder.cursor - tmpCursor16);



So since the data is empty the only method that moves the cursor forward is

arrayWriter1.initialize(holder, numElements1, 8);

which does the following

holder.cursor += (headerInBytes + fixedPartInBytes);

in a debugger I see that headerInBytes = 8 and fixedPartInBytes = 0.

Here is the header write


   1.

   Platform.putLong(holder.buffer, startingOffset, numElements);

   2.

       for (int i = 8; i < headerInBytes; i += 8) {

   3.

         Platform.putLong(holder.buffer, startingOffset + i, 0L);

   4.

       }




Ok so so far this makes sense, in order to deserialize you need to know
about the data, so all good. Now to look at the deserialize path


UnsafeRow.java

@Override
public byte[] getBinary(int ordinal) {
  if (isNullAt(ordinal)) {
    return null;
  } else {
    final long offsetAndSize = getLong(ordinal);
    final int offset = (int) (offsetAndSize >> 32);
    final int size = (int) offsetAndSize;
    final byte[] bytes = new byte[size];
    Platform.copyMemory(
      baseObject,
      baseOffset + offset,
      bytes,
      Platform.BYTE_ARRAY_OFFSET,
      size
    );
    return bytes;
  }
}



Since this doesn't read the header to return the user-bytes, it tries to
return header + user-data.



Is this expected? Am I supposed to filter out the header and force a
mem-copy to filter out for just the user-data? Since header appears to be
dynamic, how would I know the header length?

Thanks for your time reading this email.


Spark version: spark_2.11-2.2.1

Encoder with empty bytes deserializes with non-empty bytes

Reply via email to