Hi Chris,
I think basically you are asking a String constructor that takes a
ByteBuffer. StringCoding
then can take advantage of the current CompactString design to optimize
the decoding
operation by just a single byte[]/vectorized memory copy from the
ByteBuffer to the String's
internal byte[], WHEN the charset is 8859-1.
String(ByteBuffer src, String charset);
Further we will need a "buffer gathering" style constructor
String(ByteBuffer[] srcs, String charset);
(or more generally, String(ByteBuffer[] srcs, int off, int len, String
charset)
to create a String object from a sequence of ByteBuffers, if it's really
desired.
And then I would also assume it will also be desired to extend the current
CharsetDecoder/Encoder class as well to add a pair of the "gathering"
style coding
methods
CharBuffer CharsetDecoder.decode(ByteBuffer... ins);
ByteBuffer CharsetEncoder.encode(CharBuffer... ins);
Though the implementation might have to deal with the tricky "splitting
byte/char" issue, in which part of the "byte/char sequence" is in the
previous
buffer and the continuing byte/chars are in the next following buffer ...
-Sherman
On 2/9/16 7:20 AM, Chris Vest wrote:
Hi,
Aleksey Shipilev did a talk on his journey to implement compact strings and
indified string concat at the JVM Tech Summit yesterday, and this reminded me
that we (Neo4j) have a need for turning segments of DirectByteBuffers into
Strings as fast as possible. If we already store the string data in Latin1,
which is one of the two special encodings for compact strings, we’d ideally
like to produce the String object with just the two necessary object
allocations and a single, vectorised memory copy.
Our use case is that we are a database and we do our own file paging,
effectively having file data in a large set of DirectByteBuffers. We have
string data in our files in a number of different encodings, a popular one
being Latin1. Occasionally these String values span multiple buffers. We often
need to expose this data as String objects, in which case decoding the bytes
and turning them into a String is often very performance sensitive - to the
point of being one of our top bottlenecks for the given queries. Part of the
story is that in the case of Latin1, I’ll know up front exactly how many bytes
my string data takes up, though I might not know how many buffers are going to
be involved.
As far as I can tell, this is currently not possible using public APIs. Using
private APIs it may be possible, but will be relying on the JIT for vectorising
the memory copying.
From an API standpoint, CharsetDecoder is close to home, but is not quite
there. It’s stateful and not thread-safe, so I either have to allocate new ones
every time or cache them in thread-locals. I’m also required to allocate the
receiving CharBuffer. Since I may need to decode from multiple buffers, I
realise that I might not be able to get away from allocating at least one extra
object to keep track of intermediate decoding state. The CharsetDecoder does
not have a method where I can specify the offset and length for the desired
part of the ByteBuffer I want to decode, which forces be to allocate views
instead.
The CharBuffers are allocated with a length up front, which is nice, but I
can’t restrict its encoding so it has to allocate a char array instead of the
byte array that I really want. Even if it did allocate a byte array, the
CharBuffer is mutable, which would force String do a defensive copy anyway.
One way I imagine this could be solved would be with a less dynamic kind of
decoder, where the target length is given upfront to the decoder. Buffers are
then consumed one by one, and a terminal method performs finishing sanity
checks (did we get all the bytes we were promised?) and returns the result.
StringDecoder decoder =
Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest);
String result = decoder.decode(buf1, off1, len1).decode(buf2, off2,
len2).done();
This will in principle allow the string decoding to be 2 small allocations, an
array allocation without zeroing, and a sequence of potentially vectorised
memcpys. I don’t see any potentially troubling interactions with fused Strings
either, since all the knowledge (except for the string data itself) needed to
allocate the String objects are available from the get-go.
What do you guys think?
Btw, Richard Warburton has already done some work in this area, and made a patch that
adds a constructor to String that takes a buffer, offset, length, and charset. This
work now at least needs rebasing:
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/
<http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/>
It doesn’t solve the case where multiple buffers are used to build the string,
but does remove the need for a separate intermediate state-holding object when
a single buffer is enough. It’d be a nice addition if possible, but I (for one)
can tolerate a small object allocation otherwise.
Cheers,
Chris