Pierre Salagnac created SOLR-18237:
--------------------------------------
Summary: Collection state deserialization duplicates content for
no good reason
Key: SOLR-18237
URL: https://issues.apache.org/jira/browse/SOLR-18237
Project: Solr
Issue Type: Bug
Affects Versions: 10.0, 9.10
Reporter: Pierre Salagnac
When deserializing collection state from Zookeeper, we create large char arrays
to convert UTF8 data into Java strings.
In Utils.java
{code:java}
CharArr chars = new CharArr();
ByteUtils.UTF8toUTF16(utf8, offset, length, chars);
{code}
This consumes a lot of memory several threads concurrently deserialize a big
collection. We had OOM error because of this. This is a corner case since the
OOM were raised with a non full heap (~50%). My understanding is this is
because big arrays are allocated directly in the _Old_ space, and we end with
fragmented heap where one more big array does not fit, even if the heap is not
full.
This may easily replaced by a reader that does the UTF8 decoding on the fly,
with a much much smaller buffer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]