I am trying to create a Encoder for protobuf data and noticed something rather weird. When we have a empty ByteString (not null, just empty), when we deserialize we get back a empty array of length 8. I took the generated code and see something weird going on.
UnsafeRowWriter 1. public void setOffsetAndSize(int ordinal, long currentCursor, long size) { 2. final long relativeOffset = currentCursor - startingOffset; 3. final long fieldOffset = getFieldOffset(ordinal); 4. final long offsetAndSize = (relativeOffset << 32) | size; 5. 6. Platform.putLong(holder.buffer, fieldOffset, offsetAndSize); 7. } So this takes the size of the array and stores it... but its not the array size, its how many bytes were added rowWriter2.setOffsetAndSize(2, tmpCursor16, holder.cursor - tmpCursor16); So since the data is empty the only method that moves the cursor forward is arrayWriter1.initialize(holder, numElements1, 8); which does the following holder.cursor += (headerInBytes + fixedPartInBytes); in a debugger I see that headerInBytes = 8 and fixedPartInBytes = 0. Here is the header write 1. Platform.putLong(holder.buffer, startingOffset, numElements); 2. for (int i = 8; i < headerInBytes; i += 8) { 3. Platform.putLong(holder.buffer, startingOffset + i, 0L); 4. } Ok so so far this makes sense, in order to deserialize you need to know about the data, so all good. Now to look at the deserialize path UnsafeRow.java @Override public byte[] getBinary(int ordinal) { if (isNullAt(ordinal)) { return null; } else { final long offsetAndSize = getLong(ordinal); final int offset = (int) (offsetAndSize >> 32); final int size = (int) offsetAndSize; final byte[] bytes = new byte[size]; Platform.copyMemory( baseObject, baseOffset + offset, bytes, Platform.BYTE_ARRAY_OFFSET, size ); return bytes; } } Since this doesn't read the header to return the user-bytes, it tries to return header + user-data. Is this expected? Am I supposed to filter out the header and force a mem-copy to filter out for just the user-data? Since header appears to be dynamic, how would I know the header length? Thanks for your time reading this email. Spark version: spark_2.11-2.2.1