[ 
https://issues.apache.org/jira/browse/SOLR-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700420#comment-14700420
 ] 

Steve Rowe commented on SOLR-7927:
----------------------------------

The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF 
([http://www.unicode.org/glossary/#code_point]).  This is encoded in UTF-16 as 
surrogate pair {{\uDBFF\uDFFF}}, which takes up two Java chars, and is 
represented in UTF-8 as the 4-byte sequence {{F4 8F BF BF}}.  This is likely 
where the mistaken 4-bytes-per-Java-char formulation came from: the maximum 
number of UTF-8 bytes required to represent a Unicode *code point* is 4.

The maximum Java char is {{\uFFFF}}, which is represented in UTF-8 as the 
3-byte sequence {{EF BF BF}}.

So I think it's safe to switch to using 3 bytes per Java char (the unit of 
measurement returned by {{String.length()}}), like 
{{CompressingStoredFieldsWriter.writeField()}} does.

> Transaction log consumes lot of memory when indexing large documents
> --------------------------------------------------------------------
>
>                 Key: SOLR-7927
>                 URL: https://issues.apache.org/jira/browse/SOLR-7927
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 5.2.1
>            Reporter: Shalin Shekhar Mangar
>             Fix For: Trunk, 5.4
>
>
> Solr is started with 1280M heap.
> ./bin/solr start -m 1280m
> Indexing a 100MB JSON file (using curl) containing large JSON documents from 
> project Gutenberg fails with OOM but indexing a 549M JSON file containing 
> small documents is indexed just fine.
> The same 100MB JSON file with the same heap size can be indexed just fine if 
> I disable the transaction log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to