[ 
https://issues.apache.org/jira/browse/LUCENE-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754976#comment-16754976
 ] 

Adrien Grand commented on LUCENE-8662:
--------------------------------------

[[email protected]] I agree that this index seems to have pathological 
ids, something like URLs maybe? Regarding performance, seekExact is still 
expected to perform better with regular IDs since it may save disk seeks in 
case no prefix of the looked up ID can be found in the terms index (this is why 
having increasing ids like incremental or time-based ids work well: new ids are 
unlikely to share a prefix with older ids). On the other hand, seekCeil has to 
move to the next term, which forces a lookup in the terms dict file (.tim).

> Override seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum
> ----------------------------------------------------------------
>
>                 Key: LUCENE-8662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8662
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 5.5.5, 6.6.5, 7.6, 8.0
>            Reporter: jefferyyuan
>            Priority: Major
>              Labels: query
>             Fix For: 8.0, 7.7
>
>         Attachments: output of test program.txt
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Recently in our production, we found that Sole uses a lot of memory(more than 
> 10g) during recovery or commit for a small index (3.5gb)
>  The stack trace is:
>  
> {code:java}
> Thread 0x4d4b115c0 
>   at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125) 
>   at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V 
> (SegmentTermsEnumFrame.java:157) 
>   at 
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
>  (SegmentTermsEnumFrame.java:786) 
>   at 
> org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
>  (SegmentTermsEnumFrame.java:538) 
>   at 
> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
>  (SegmentTermsEnum.java:757) 
>   at 
> org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus;
>  (FilterLeafReader.java:185) 
>   at 
> org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z
>  (TermsEnum.java:74) 
>   at 
> org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J
>  (SolrIndexSearcher.java:823) 
>   at 
> org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
>  (VersionInfo.java:204) 
>   at 
> org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
>  (UpdateLog.java:786) 
>   at 
> org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long;
>  (VersionInfo.java:194) 
>   at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z
>  (DistributedUpdateProcessor.java:1051)  
> {code}
> We reproduced the problem locally with the following code using Lucene code.
> {code:java}
> public static void main(String[] args) throws IOException {
>   FSDirectory index = FSDirectory.open(Paths.get("the-index"));
>   try (IndexReader reader = new   
> ExitableDirectoryReader(DirectoryReader.open(index),
> new QueryTimeoutImpl(1000 * 60 * 5))) {
>     String id = "the-id";
>     BytesRef text = new BytesRef(id);
>     for (LeafReaderContext lf : reader.leaves()) {
>       TermsEnum te = lf.reader().terms("id").iterator();
>       System.out.println(te.seekExact(text));
>     }
>   }
> }
> {code}
>  
> I added System.out.println("ord: " + ord); in 
> codecs.blocktree.SegmentTermsEnum.getFrame(int).
> Please check the attached output of test program.txt. 
>  
> We found out the root cause:
> we didn't implement seekExact(BytesRef) method in 
> FilterLeafReader.FilterTerms, so it uses the base class 
> TermsEnum.seekExact(BytesRef) implementation which is very inefficient in 
> this case.
> {code:java}
> public boolean seekExact(BytesRef text) throws IOException {
>   return seekCeil(text) == SeekStatus.FOUND;
> }
> {code}
> The fix is simple, just override seekExact(BytesRef) method in 
> FilterLeafReader.FilterTerms
> {code:java}
> @Override
> public boolean seekExact(BytesRef text) throws IOException {
>   return in.seekExact(text);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to