[ https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley resolved SOLR-5855. -------------------------------- Resolution: Fixed Thanks for finding the problem and the initial patch, [~ddebray]. It would be great if those who have benchmarked could try again with this patch (or by pulling from branch 5x since it's committed) -- just to be sure it's working well. The 5.2 release branch is going to be cut later today. > re-use document term-vector Fields instance across fields in the > DefaultSolrHighlighter > --------------------------------------------------------------------------------------- > > Key: SOLR-5855 > URL: https://issues.apache.org/jira/browse/SOLR-5855 > Project: Solr > Issue Type: Improvement > Components: highlighter > Affects Versions: Trunk > Reporter: Daniel Debray > Assignee: David Smiley > Fix For: 5.2 > > Attachments: SOLR-5855-without-cache.patch, > SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, > highlight.patch > > > Hi folks, > while investigating possible performance bottlenecks in the highlight > component i discovered two places where we can save some cpu cylces. > Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter > First in method doHighlighting (lines 411-417): > In the loop we try to highlight every field that has been resolved from the > params on each document. Ok, but why not skip those fields that are not > present on the current document? > So i changed the code from: > for (String fieldName : fieldNames) { > fieldName = fieldName.trim(); > if( useFastVectorHighlighter( params, schema, fieldName ) ) > doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, > docSummaries, docId, doc, fieldName ); > else > doHighlightingByHighlighter( query, req, docSummaries, docId, doc, > fieldName ); > } > to: > for (String fieldName : fieldNames) { > fieldName = fieldName.trim(); > if (doc.get(fieldName) != null) { > if( useFastVectorHighlighter( params, schema, fieldName ) ) > doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, > docSummaries, docId, doc, fieldName ); > else > doHighlightingByHighlighter( query, req, docSummaries, docId, doc, > fieldName ); > } > } > The second place is where we try to retrieve the TokenStream from the > document for a specific field. > line 472: > TokenStream tvStream = > TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, > fieldName); > where.. > public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int > docId, String field) throws IOException { > Fields vectors = reader.getTermVectors(docId); > if (vectors == null) { > return null; > } > Terms vector = vectors.terms(field); > if (vector == null) { > return null; > } > if (!vector.hasPositions() || !vector.hasOffsets()) { > return null; > } > return getTokenStream(vector); > } > keep in mind that we currently hit the IndexReader n times where n = > requested rows(documents) * requested amount of highlight fields. > in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on > a warm solr and 1.100.000ns on a cold solr. > If we store the returning Fields vectors in a cache, this lookups only take > 25000ns. > I would suggest something like the following code in the > doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line > 472): > Fields vectors = null; > SolrCache termVectorCache = searcher.getCache("termVectorCache"); > if (termVectorCache != null) { > vectors = (Fields) termVectorCache.get(Integer.valueOf(docId)); > if (vectors == null) { > vectors = searcher.getIndexReader().getTermVectors(docId); > if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors); > } > } else { > vectors = searcher.getIndexReader().getTermVectors(docId); > } > TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, > fieldName); > and TokenSources class: > public static TokenStream getTokenStreamWithOffsets(Fields vectors, String > field) throws IOException { > if (vectors == null) { > return null; > } > Terms vector = vectors.terms(field); > if (vector == null) { > return null; > } > if (!vector.hasPositions() || !vector.hasOffsets()) { > return null; > } > return getTokenStream(vector); > } > 4000ms on 1000 docs without cache > 639ms on 1000 docs with cache > 102ms on 30 docs without cache > 22ms on 30 docs with cache > on an index with 190.000 docs with a numFound of 32000 and 80 different > highlight fields. > I think querys with only one field to highlight on a document does not > benefit that much from a cache like this, thats why i think an optional cache > would be the best solution there. > As i saw the FastVectorHighlighter uses more or less the same approach and > could also benefit from this cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org