Or as Chris explain, having a string that span on more than one buffer is a corner case of this software, so for most of the strings, the constructor that takes a buffer is fine, and for the corner case, a constructor of String that takes a CharSequence seems to be easier to implement than creating a new kind of buffer that represents several buffers.
and by the way, i would prefer to have static factory methods instead of constructors in String, there are already too many constructors. regards, Rémi ----- Mail original ----- > De: "Paul Sandoz" <paul.san...@oracle.com> > Cc: core-libs-dev@openjdk.java.net > Envoyé: Mercredi 10 Février 2016 09:54:17 > Objet: Re: Compact Strings and APIs for fast decoding of string data > > Hi, > > A more functional approach would be to compose a sequence buffers into one > view, perhaps read-only. Then there would be no need to accept arrays of > buffers. That should work well for bulk operations. That’s a non-trivial but > not very difficult amount of work, and possibly simplified if restricted to > read-only views. > > Thus i think we should focus Richard’s work with: > > String(ByteBuffer src, String charset) > > and perhaps a sub-range variant, if perturbing the position/limit of an > existing buffer and/or slicing is too problematic. > > — > > Zeroing memory and possibly avoiding it can be tricky. Any such optimisations > have to be carefully performed otherwise uninitialised regions might leak > and be accessed, nefariously or otherwise. I imagine it’s easier to > contain/control within a constructor than say a builder. > > Paul. > > > On 10 Feb 2016, at 05:38, Xueming Shen <xueming.s...@oracle.com> wrote: > > > > Hi Chris, > > > > I think basically you are asking a String constructor that takes a > > ByteBuffer. StringCoding > > then can take advantage of the current CompactString design to optimize the > > decoding > > operation by just a single byte[]/vectorized memory copy from the > > ByteBuffer to the String's > > internal byte[], WHEN the charset is 8859-1. > > > > String(ByteBuffer src, String charset); > > > > Further we will need a "buffer gathering" style constructor > > > > String(ByteBuffer[] srcs, String charset); > > (or more generally, String(ByteBuffer[] srcs, int off, int len, String > > charset) > > > > to create a String object from a sequence of ByteBuffers, if it's really > > desired. > > > > And then I would also assume it will also be desired to extend the current > > CharsetDecoder/Encoder class as well to add a pair of the "gathering" style > > coding > > methods > > > > CharBuffer CharsetDecoder.decode(ByteBuffer... ins); > > ByteBuffer CharsetEncoder.encode(CharBuffer... ins); > > > > Though the implementation might have to deal with the tricky "splitting > > byte/char" issue, in which part of the "byte/char sequence" is in the > > previous > > buffer and the continuing byte/chars are in the next following buffer ... > > > > -Sherman > > > > > > On 2/9/16 7:20 AM, Chris Vest wrote: > >> Hi, > >> > >> Aleksey Shipilev did a talk on his journey to implement compact strings > >> and indified string concat at the JVM Tech Summit yesterday, and this > >> reminded me that we (Neo4j) have a need for turning segments of > >> DirectByteBuffers into Strings as fast as possible. If we already store > >> the string data in Latin1, which is one of the two special encodings for > >> compact strings, we’d ideally like to produce the String object with just > >> the two necessary object allocations and a single, vectorised memory > >> copy. > >> > >> Our use case is that we are a database and we do our own file paging, > >> effectively having file data in a large set of DirectByteBuffers. We have > >> string data in our files in a number of different encodings, a popular > >> one being Latin1. Occasionally these String values span multiple buffers. > >> We often need to expose this data as String objects, in which case > >> decoding the bytes and turning them into a String is often very > >> performance sensitive - to the point of being one of our top bottlenecks > >> for the given queries. Part of the story is that in the case of Latin1, > >> I’ll know up front exactly how many bytes my string data takes up, though > >> I might not know how many buffers are going to be involved. > >> > >> As far as I can tell, this is currently not possible using public APIs. > >> Using private APIs it may be possible, but will be relying on the JIT for > >> vectorising the memory copying. > >> > >> From an API standpoint, CharsetDecoder is close to home, but is not quite > >> there. It’s stateful and not thread-safe, so I either have to allocate > >> new ones every time or cache them in thread-locals. I’m also required to > >> allocate the receiving CharBuffer. Since I may need to decode from > >> multiple buffers, I realise that I might not be able to get away from > >> allocating at least one extra object to keep track of intermediate > >> decoding state. The CharsetDecoder does not have a method where I can > >> specify the offset and length for the desired part of the ByteBuffer I > >> want to decode, which forces be to allocate views instead. > >> > >> The CharBuffers are allocated with a length up front, which is nice, but I > >> can’t restrict its encoding so it has to allocate a char array instead of > >> the byte array that I really want. Even if it did allocate a byte array, > >> the CharBuffer is mutable, which would force String do a defensive copy > >> anyway. > >> > >> One way I imagine this could be solved would be with a less dynamic kind > >> of decoder, where the target length is given upfront to the decoder. > >> Buffers are then consumed one by one, and a terminal method performs > >> finishing sanity checks (did we get all the bytes we were promised?) and > >> returns the result. > >> > >> StringDecoder decoder = > >> Charset.forName(“latin1").newStringDecoder(lengthInCharactersOrBytesImNotSureWhichIsBest); > >> String result = decoder.decode(buf1, off1, len1).decode(buf2, off2, > >> len2).done(); > >> > >> This will in principle allow the string decoding to be 2 small > >> allocations, an array allocation without zeroing, and a sequence of > >> potentially vectorised memcpys. I don’t see any potentially troubling > >> interactions with fused Strings either, since all the knowledge (except > >> for the string data itself) needed to allocate the String objects are > >> available from the get-go. > >> > >> What do you guys think? > >> > >> Btw, Richard Warburton has already done some work in this area, and made a > >> patch that adds a constructor to String that takes a buffer, offset, > >> length, and charset. This work now at least needs rebasing: > >> http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/ > >> <http://cr.openjdk.java.net/~rwarburton/string-patch-webrev/> > >> It doesn’t solve the case where multiple buffers are used to build the > >> string, but does remove the need for a separate intermediate > >> state-holding object when a single buffer is enough. It’d be a nice > >> addition if possible, but I (for one) can tolerate a small object > >> allocation otherwise. > >> > >> Cheers, > >> Chris > >> > > > >