[
https://issues.apache.org/jira/browse/LUCENE-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754885#comment-16754885
]
Mark Miller commented on LUCENE-8662:
-------------------------------------
I’ve discussed this issue a little and I’m not sure we know the performance
implications of ceil over exact (mostly just reading the comment mentioning
ceil is slow with some codecs), but there is an index and query that appears to
blow up the FST (ord gets ridiculously high and creates huge stack[] leading to
OOM) where exact seems to not have this issue. The query appears to be for a
field with many unuiqe long terms that share a long common prefix.
> Override seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum
> ----------------------------------------------------------------
>
> Key: LUCENE-8662
> URL: https://issues.apache.org/jira/browse/LUCENE-8662
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Affects Versions: 5.5.5, 6.6.5, 7.6, 8.0
> Reporter: jefferyyuan
> Priority: Major
> Labels: query
> Fix For: 8.0, 7.7
>
> Attachments: output of test program.txt
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Recently in our production, we found that Sole uses a lot of memory(more than
> 10g) during recovery or commit for a small index (3.5gb)
> The stack trace is:
>
> {code:java}
> Thread 0x4d4b115c0
> at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125)
> at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V
> (SegmentTermsEnumFrame.java:157)
> at
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
> (SegmentTermsEnumFrame.java:786)
> at
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
> (SegmentTermsEnumFrame.java:538)
> at
> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
> (SegmentTermsEnum.java:757)
> at
> org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
> (FilterLeafReader.java:185)
> at
> org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
> (TermsEnum.java:74)
> at
> org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J
> (SolrIndexSearcher.java:823)
> at
> org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
> (VersionInfo.java:204)
> at
> org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
> (UpdateLog.java:786)
> at
> org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
> (VersionInfo.java:194)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z
> (DistributedUpdateProcessor.java:1051)
> {code}
> We reproduced the problem locally with the following code using Lucene code.
> {code:java}
> public static void main(String[] args) throws IOException {
> FSDirectory index = FSDirectory.open(Paths.get("the-index"));
> try (IndexReader reader = new
> ExitableDirectoryReader(DirectoryReader.open(index),
> new QueryTimeoutImpl(1000 * 60 * 5))) {
> String id = "the-id";
> BytesRef text = new BytesRef(id);
> for (LeafReaderContext lf : reader.leaves()) {
> TermsEnum te = lf.reader().terms("id").iterator();
> System.out.println(te.seekExact(text));
> }
> }
> }
> {code}
>
> I added System.out.println("ord: " + ord); in
> codecs.blocktree.SegmentTermsEnum.getFrame(int).
> Please check the attached output of test program.txt.
>
> We found out the root cause:
> we didn't implement seekExact(BytesRef) method in
> FilterLeafReader.FilterTerms, so it uses the base class
> TermsEnum.seekExact(BytesRef) implementation which is very inefficient in
> this case.
> {code:java}
> public boolean seekExact(BytesRef text) throws IOException {
> return seekCeil(text) == SeekStatus.FOUND;
> }
> {code}
> The fix is simple, just override seekExact(BytesRef) method in
> FilterLeafReader.FilterTerms
> {code:java}
> @Override
> public boolean seekExact(BytesRef text) throws IOException {
> return in.seekExact(text);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]