[ 
https://issues.apache.org/jira/browse/SOLR-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713437#comment-14713437
 ] 

Shalin Shekhar Mangar commented on SOLR-7971:
---------------------------------------------

bq. couldn't it turn that calling frequent call allocateDirect()&clear() takes 
too much time? in this case isn't it worth to reuse directBuffer across 
writeStr() calls as JavaBinCodec's field.

Yes, allocateDirect() can be slower and we should reuse the buffer as much as 
possible. This was just an idea as a patch. I don't intend to commit it as it 
is.

bq. I've got that buffering is necessary just because we need to calculate 
length of encoded bytes for starting tag, is it a big problem if we loop 
ByteUtils.UTF16toUTF8() twice, the first time to calculate the length and 
completely dropping the content, then writing content in the second time loop.

Hmm, interesting idea. We could also have a method calcUTF16toUTF8Length which 
avoids all the bitwise operators and just returns the required length.

bq. just curious, how much efforts does it take to extend javabin format by 
http-like chunks?

It should be possible. We'll need a new chunked type and an upgrade to the 
JavaBin version. Or we may be able to get away with modifying only the LogCodec 
in TransactionLog.

> Reduce memory allocated by JavaBinCodec to encode large strings
> ---------------------------------------------------------------
>
>                 Key: SOLR-7971
>                 URL: https://issues.apache.org/jira/browse/SOLR-7971
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Response Writers, SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>             Fix For: Trunk, 5.4
>
>         Attachments: SOLR-7971-directbuffer.patch, SOLR-7971.patch
>
>
> As discussed in SOLR-7927, we can reduce the buffer memory allocated by 
> JavaBinCodec while writing large strings.
> https://issues.apache.org/jira/browse/SOLR-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700420#comment-14700420
> {quote}
> The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF 
> ([http://www.unicode.org/glossary/#code_point]).  This is encoded in UTF-16 
> as surrogate pair {{\uDBFF\uDFFF}}, which takes up two Java chars, and is 
> represented in UTF-8 as the 4-byte sequence {{F4 8F BF BF}}.  This is likely 
> where the mistaken 4-bytes-per-Java-char formulation came from: the maximum 
> number of UTF-8 bytes required to represent a Unicode *code point* is 4.
> The maximum Java char is {{\uFFFF}}, which is represented in UTF-8 as the 
> 3-byte sequence {{EF BF BF}}.
> So I think it's safe to switch to using 3 bytes per Java char (the unit of 
> measurement returned by {{String.length()}}), like 
> {{CompressingStoredFieldsWriter.writeField()}} does.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to