[ https://issues.apache.org/jira/browse/SOLR-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713437#comment-14713437 ]
Shalin Shekhar Mangar commented on SOLR-7971: --------------------------------------------- bq. couldn't it turn that calling frequent call allocateDirect()&clear() takes too much time? in this case isn't it worth to reuse directBuffer across writeStr() calls as JavaBinCodec's field. Yes, allocateDirect() can be slower and we should reuse the buffer as much as possible. This was just an idea as a patch. I don't intend to commit it as it is. bq. I've got that buffering is necessary just because we need to calculate length of encoded bytes for starting tag, is it a big problem if we loop ByteUtils.UTF16toUTF8() twice, the first time to calculate the length and completely dropping the content, then writing content in the second time loop. Hmm, interesting idea. We could also have a method calcUTF16toUTF8Length which avoids all the bitwise operators and just returns the required length. bq. just curious, how much efforts does it take to extend javabin format by http-like chunks? It should be possible. We'll need a new chunked type and an upgrade to the JavaBin version. Or we may be able to get away with modifying only the LogCodec in TransactionLog. > Reduce memory allocated by JavaBinCodec to encode large strings > --------------------------------------------------------------- > > Key: SOLR-7971 > URL: https://issues.apache.org/jira/browse/SOLR-7971 > Project: Solr > Issue Type: Sub-task > Components: Response Writers, SolrCloud > Reporter: Shalin Shekhar Mangar > Assignee: Shalin Shekhar Mangar > Priority: Minor > Fix For: Trunk, 5.4 > > Attachments: SOLR-7971-directbuffer.patch, SOLR-7971.patch > > > As discussed in SOLR-7927, we can reduce the buffer memory allocated by > JavaBinCodec while writing large strings. > https://issues.apache.org/jira/browse/SOLR-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700420#comment-14700420 > {quote} > The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF > ([http://www.unicode.org/glossary/#code_point]). This is encoded in UTF-16 > as surrogate pair {{\uDBFF\uDFFF}}, which takes up two Java chars, and is > represented in UTF-8 as the 4-byte sequence {{F4 8F BF BF}}. This is likely > where the mistaken 4-bytes-per-Java-char formulation came from: the maximum > number of UTF-8 bytes required to represent a Unicode *code point* is 4. > The maximum Java char is {{\uFFFF}}, which is represented in UTF-8 as the > 3-byte sequence {{EF BF BF}}. > So I think it's safe to switch to using 3 bytes per Java char (the unit of > measurement returned by {{String.length()}}), like > {{CompressingStoredFieldsWriter.writeField()}} does. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org