Ok found my issue case c if c == classOf[ByteString] => StaticInvoke(classOf[Protobufs], ArrayType(ByteType), "fromByteString", parent :: Nil)
Should be case c if c == classOf[ByteString] => StaticInvoke(classOf[Protobufs], BinaryType, "fromByteString", parent :: Nil) This causes the java code to see a byte[] which uses a different code path than linked. Since I did ArrayType(ByteTyep) I had to wrap the data in a ArrayData class On Wed, Feb 21, 2018 at 9:55 PM, David Capwell <dcapw...@gmail.com> wrote: > I am trying to create a Encoder for protobuf data and noticed something > rather weird. When we have a empty ByteString (not null, just empty), when > we deserialize we get back a empty array of length 8. I took the generated > code and see something weird going on. > > UnsafeRowWriter > > > 1. > > public void setOffsetAndSize(int ordinal, long currentCursor, long size) { > > 2. > > final long relativeOffset = currentCursor - startingOffset; > > 3. > > final long fieldOffset = getFieldOffset(ordinal); > > 4. > > final long offsetAndSize = (relativeOffset << 32) | size; > > 5. > > 6. > > Platform.putLong(holder.buffer, fieldOffset, offsetAndSize); > > 7. > > } > > > > So this takes the size of the array and stores it... but its not the array > size, its how many bytes were added > > rowWriter2.setOffsetAndSize(2, tmpCursor16, holder.cursor - tmpCursor16); > > > > So since the data is empty the only method that moves the cursor forward is > > arrayWriter1.initialize(holder, numElements1, 8); > > which does the following > > holder.cursor += (headerInBytes + fixedPartInBytes); > > in a debugger I see that headerInBytes = 8 and fixedPartInBytes = 0. > > Here is the header write > > > 1. > > Platform.putLong(holder.buffer, startingOffset, numElements); > > 2. > > for (int i = 8; i < headerInBytes; i += 8) { > > 3. > > Platform.putLong(holder.buffer, startingOffset + i, 0L); > > 4. > > } > > > > > Ok so so far this makes sense, in order to deserialize you need to know > about the data, so all good. Now to look at the deserialize path > > > UnsafeRow.java > > @Override > public byte[] getBinary(int ordinal) { > if (isNullAt(ordinal)) { > return null; > } else { > final long offsetAndSize = getLong(ordinal); > final int offset = (int) (offsetAndSize >> 32); > final int size = (int) offsetAndSize; > final byte[] bytes = new byte[size]; > Platform.copyMemory( > baseObject, > baseOffset + offset, > bytes, > Platform.BYTE_ARRAY_OFFSET, > size > ); > return bytes; > } > } > > > > Since this doesn't read the header to return the user-bytes, it tries to > return header + user-data. > > > > Is this expected? Am I supposed to filter out the header and force a > mem-copy to filter out for just the user-data? Since header appears to be > dynamic, how would I know the header length? > > Thanks for your time reading this email. > > > Spark version: spark_2.11-2.2.1 >