Compact Strings and APIs for fast decoding of string data

Chris Vest Tue, 09 Feb 2016 07:22:26 -0800

Hi,

Aleksey Shipilev did a talk on his journey to implement compact strings and 
indified string concat at the JVM Tech Summit yesterday, and this reminded me 
that we (Neo4j) have a need for turning segments of DirectByteBuffers into 
Strings as fast as possible. If we already store the string data in Latin1, 
which is one of the two special encodings for compact strings, we’d ideally 
like to produce the String object with just the two necessary object 
allocations and a single, vectorised memory copy.

Our use case is that we are a database and we do our own file paging,
effectively having file data in a large set of DirectByteBuffers. We have
string data in our files in a number of different encodings, a popular one
being Latin1. Occasionally these String values span multiple buffers. We often
need to expose this data as String objects, in which case decoding the bytes
and turning them into a String is often very performance sensitive - to the
point of being one of our top bottlenecks for the given queries. Part of the
story is that in the case of Latin1, I’ll know up front exactly how many bytes
my string data takes up, though I might not know how many buffers are going to
be involved.

As far as I can tell, this is currently not possible using public APIs. Using
private APIs it may be possible, but will be relying on the JIT for vectorising
the memory copying.

From an API standpoint, CharsetDecoder is close to home, but is not quite
there. It’s stateful and not thread-safe, so I either have to allocate new ones
every time or cache them in thread-locals. I’m also required to allocate the
receiving CharBuffer. Since I may need to decode from multiple buffers, I
realise that I might not be able to get away from allocating at least one extra
object to keep track of intermediate decoding state. The CharsetDecoder does
not have a method where I can specify the offset and length for the desired
part of the ByteBuffer I want to decode, which forces be to allocate views
instead.

The CharBuffers are allocated with a length up front, which is nice, but I
can’t restrict its encoding so it has to allocate a char array instead of the
byte array that I really want. Even if it did allocate a byte array, the
CharBuffer is mutable, which would force String do a defensive copy anyway.

One way I imagine this could be solved would be with a less dynamic kind of
decoder, where the target length is given upfront to the decoder. Buffers are
then consumed one by one, and a terminal method performs finishing sanity
checks (did we get all the bytes we were promised?) and returns the result.

StringDecoder decoder =
Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
String result = decoder.decode(buf1, off1, len1).decode(buf2, off2,
len2).done();

This will in principle allow the string decoding to be 2 small allocations, an
array allocation without zeroing, and a sequence of potentially vectorised
memcpys. I don’t see any potentially troubling interactions with fused Strings
either, since all the knowledge (except for the string data itself) needed to
allocate the String objects are available from the get-go.

What do you guys think?

Btw, Richard Warburton has already done some work in this area, and made a
patch that adds a constructor to String that takes a buffer, offset, length,
and charset. This work now at least needs rebasing:
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/
<http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
It doesn’t solve the case where multiple buffers are used to build the string,
but does remove the need for a separate intermediate state-holding object when
a single buffer is enough. It’d be a nice addition if possible, but I (for one)
can tolerate a small object allocation otherwise.

Cheers,
Chris

Compact Strings and APIs for fast decoding of string data

Reply via email to