Hi,

Aleksey Shipilev did a talk on his journey to implement compact strings and 
indified string concat at the JVM Tech Summit yesterday, and this reminded me 
that we (Neo4j) have a need for turning segments of DirectByteBuffers into 
Strings as fast as possible. If we already store the string data in Latin1, 
which is one of the two special encodings for compact strings, we’d ideally 
like to produce the String object with just the two necessary object 
allocations and a single, vectorised memory copy.

Our use case is that we are a database and we do our own file paging, 
effectively having file data in a large set of DirectByteBuffers. We have 
string data in our files in a number of different encodings, a popular one 
being Latin1. Occasionally these String values span multiple buffers. We often 
need to expose this data as String objects, in which case decoding the bytes 
and turning them into a String is often very performance sensitive - to the 
point of being one of our top bottlenecks for the given queries. Part of the 
story is that in the case of Latin1, I’ll know up front exactly how many bytes 
my string data takes up, though I might not know how many buffers are going to 
be involved.

As far as I can tell, this is currently not possible using public APIs. Using 
private APIs it may be possible, but will be relying on the JIT for vectorising 
the memory copying.

From an API standpoint, CharsetDecoder is close to home, but is not quite 
there. It’s stateful and not thread-safe, so I either have to allocate new ones 
every time or cache them in thread-locals. I’m also required to allocate the 
receiving CharBuffer. Since I may need to decode from multiple buffers, I 
realise that I might not be able to get away from allocating at least one extra 
object to keep track of intermediate decoding state. The CharsetDecoder does 
not have a method where I can specify the offset and length for the desired 
part of the ByteBuffer I want to decode, which forces be to allocate views 
instead.

The CharBuffers are allocated with a length up front, which is nice, but I 
can’t restrict its encoding so it has to allocate a char array instead of the 
byte array that I really want. Even if it did allocate a byte array, the 
CharBuffer is mutable, which would force String do a defensive copy anyway.

One way I imagine this could be solved would be with a less dynamic kind of 
decoder, where the target length is given upfront to the decoder. Buffers are 
then consumed one by one, and a terminal method performs finishing sanity 
checks (did we get all the bytes we were promised?) and returns the result.

StringDecoder decoder = 
Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, 
len2).done();

This will in principle allow the string decoding to be 2 small allocations, an 
array allocation without zeroing, and a sequence of potentially vectorised 
memcpys. I don’t see any potentially troubling interactions with fused Strings 
either, since all the knowledge (except for the string data itself) needed to 
allocate the String objects are available from the get-go.

What do you guys think?

Btw, Richard Warburton has already done some work in this area, and made a 
patch that adds a constructor to String that takes a buffer, offset, length, 
and charset. This work now at least needs rebasing: 
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ 
<http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
It doesn’t solve the case where multiple buffers are used to build the string, 
but does remove the need for a separate intermediate state-holding object when 
a single buffer is enough. It’d be a nice addition if possible, but I (for one) 
can tolerate a small object allocation otherwise.

Cheers,
Chris

Reply via email to