[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-06-09 Thread Ere Maijala (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578437#comment-14578437
 ] 

Ere Maijala commented on SOLR-5855:
---

I was hoping, perhaps naively, that this would help with the highlighter 
performance problems we're having with Solr 5. Unfortunately this doesn't seems 
to have made a difference. Using hl.usePhraseHighlighter=false has a 
significant effect, but obviously with downsides and still much slower than 
4.10.2.

For what it's worth, here is some additional information:

Timing from Solr 4.10.2 (42.5 million records):

process: {
time: 1711,
query: {
time: 0
},
facet: {
time: 66
},
mlt: {
time: 0
},
highlight: {
time: 708
},
stats: {
time: 0
},
expand: {
time: 0
},
spellcheck: {
time: 433
},
debug: {
time: 503
}
}

Timing from Solr 5.2.0 (38.8 million records):

process: {
time: 10172,
query: {
time: 0
},
facet: {
time: 45
},
facet_module: {
time: 0
},
mlt: {
time: 0
},
highlight: {
time: 9310
},
stats: {
time: 0
},
expand: {
time: 0
},
spellcheck: {
time: 345
},
debug: {
time: 472
}
}

A couple of jstack outputs during the query execution are here: 
http://pastebin.com/8FJiq5R3. The schema and solrconfig are at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf. 

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, 
 SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
 highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my 

[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-06-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579443#comment-14579443
 ] 

David Smiley commented on SOLR-5855:


I was initially skeptical the stack traces would show anything of interest but 
I am pleasantly mistaken.  Apparently, getting the FieldInfos from 
SlowCompositeReaderWrapper is a bottleneck.  We look this up to determine if 
there are payloads or not, so that we can then tell MemoryIndex to capture them 
as well.  FYI the call to get this was added recently in SOLR-6916 
(Highlighting using payloads), its not related to term vectors -- this issue.

Can you please download the 5x branch, comment out the 
{{scorer.getUsePayloads(...}} line (or set it to true if you want), and see how 
it performs?

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, 
 SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
 highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
 a warm solr and 1.100.000ns on a cold solr.
 If we store the returning Fields vectors in a cache, this lookups only take 
 25000ns.
 I would suggest something like the following code in the 
 doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
 472):
 Fields vectors = null;
 SolrCache termVectorCache = searcher.getCache(termVectorCache);
 if (termVectorCache != null) {
   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
   if (vectors == null) {
 vectors = searcher.getIndexReader().getTermVectors(docId);
 if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
   } 
 } else {
   vectors = searcher.getIndexReader().getTermVectors(docId);
 }
 TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
 fieldName);
 and TokenSources class:
 public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
 field) throws IOException {
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 4000ms on 1000 docs without cache
 639ms on 1000 docs with cache
 102ms on 30 docs without cache
 22ms on 30 docs with cache
 on an index with 190.000 docs with a numFound of 32000 and 80 different 
 

[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-06-09 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579480#comment-14579480
 ] 

David Smiley commented on SOLR-5855:


[~emaijala] I created an issue for this; please discuss further there: SOLR-7655

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, 
 SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
 highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
 a warm solr and 1.100.000ns on a cold solr.
 If we store the returning Fields vectors in a cache, this lookups only take 
 25000ns.
 I would suggest something like the following code in the 
 doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
 472):
 Fields vectors = null;
 SolrCache termVectorCache = searcher.getCache(termVectorCache);
 if (termVectorCache != null) {
   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
   if (vectors == null) {
 vectors = searcher.getIndexReader().getTermVectors(docId);
 if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
   } 
 } else {
   vectors = searcher.getIndexReader().getTermVectors(docId);
 }
 TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
 fieldName);
 and TokenSources class:
 public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
 field) throws IOException {
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 4000ms on 1000 docs without cache
 639ms on 1000 docs with cache
 102ms on 30 docs without cache
 22ms on 30 docs with cache
 on an index with 190.000 docs with a numFound of 32000 and 80 different 
 highlight fields.
 I think querys with only one field to highlight on a document does not 
 benefit that much from a cache like this, thats why i think an optional cache 
 would be the best solution there. 
 As i saw the FastVectorHighlighter uses more or less the same approach and 
 could also benefit from this cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional 

[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-05-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554272#comment-14554272
 ] 

ASF subversion and git services commented on SOLR-5855:
---

Commit 1680871 from [~dsmiley] in branch 'dev/trunk'
[ https://svn.apache.org/r1680871 ]

SOLR-5855:  Re-use the document's term vectors in DefaultSolrHighlighter.
Also refactored DefaultSolrHighlighter's methods to be a little nicer.

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, 
 SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
 highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
 a warm solr and 1.100.000ns on a cold solr.
 If we store the returning Fields vectors in a cache, this lookups only take 
 25000ns.
 I would suggest something like the following code in the 
 doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
 472):
 Fields vectors = null;
 SolrCache termVectorCache = searcher.getCache(termVectorCache);
 if (termVectorCache != null) {
   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
   if (vectors == null) {
 vectors = searcher.getIndexReader().getTermVectors(docId);
 if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
   } 
 } else {
   vectors = searcher.getIndexReader().getTermVectors(docId);
 }
 TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
 fieldName);
 and TokenSources class:
 public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
 field) throws IOException {
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 4000ms on 1000 docs without cache
 639ms on 1000 docs with cache
 102ms on 30 docs without cache
 22ms on 30 docs with cache
 on an index with 190.000 docs with a numFound of 32000 and 80 different 
 highlight fields.
 I think querys with only one field to highlight on a document does not 
 benefit that much from a cache like this, thats why i think an optional cache 
 would be the best solution there. 
 As i saw the FastVectorHighlighter uses more or less the same approach and 
 could also benefit from this cache.



--
This 

[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-05-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554282#comment-14554282
 ] 

ASF subversion and git services commented on SOLR-5855:
---

Commit 1680872 from [~dsmiley] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1680872 ]

SOLR-5855: Re-use the document's term vectors in DefaultSolrHighlighter. Also 
refactored DefaultSolrHighlighter's methods to be a little nicer.

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, 
 SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
 highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
 a warm solr and 1.100.000ns on a cold solr.
 If we store the returning Fields vectors in a cache, this lookups only take 
 25000ns.
 I would suggest something like the following code in the 
 doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
 472):
 Fields vectors = null;
 SolrCache termVectorCache = searcher.getCache(termVectorCache);
 if (termVectorCache != null) {
   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
   if (vectors == null) {
 vectors = searcher.getIndexReader().getTermVectors(docId);
 if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
   } 
 } else {
   vectors = searcher.getIndexReader().getTermVectors(docId);
 }
 TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
 fieldName);
 and TokenSources class:
 public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
 field) throws IOException {
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 4000ms on 1000 docs without cache
 639ms on 1000 docs with cache
 102ms on 30 docs without cache
 22ms on 30 docs with cache
 on an index with 190.000 docs with a numFound of 32000 and 80 different 
 highlight fields.
 I think querys with only one field to highlight on a document does not 
 benefit that much from a cache like this, thats why i think an optional cache 
 would be the best solution there. 
 As i saw the FastVectorHighlighter uses more or less the same approach and 
 could also benefit from this cache.




[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

2015-04-03 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394454#comment-14394454
 ] 

David Smiley commented on SOLR-5855:


Another thing that should be done is to figure out how to avoid grabbing the 
term vector Fields altogether if none of the fields to highlight have term 
vectors in the first place.

 re-use document term-vector Fields instance across fields in the 
 DefaultSolrHighlighter
 ---

 Key: SOLR-5855
 URL: https://issues.apache.org/jira/browse/SOLR-5855
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: Trunk
Reporter: Daniel Debray
Assignee: David Smiley
 Fix For: 5.2

 Attachments: SOLR-5855-without-cache.patch, highlight.patch


 Hi folks,
 while investigating possible performance bottlenecks in the highlight 
 component i discovered two places where we can save some cpu cylces.
 Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
 First in method doHighlighting (lines 411-417):
 In the loop we try to highlight every field that has been resolved from the 
 params on each document. Ok, but why not skip those fields that are not 
 present on the current document? 
 So i changed the code from:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if( useFastVectorHighlighter( params, schema, fieldName ) )
 doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
   else
 doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
 }
 to:
 for (String fieldName : fieldNames) {
   fieldName = fieldName.trim();
   if (doc.get(fieldName) != null) {
 if( useFastVectorHighlighter( params, schema, fieldName ) )
   doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
 docSummaries, docId, doc, fieldName );
 else
   doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
 fieldName );
   }
 }
 The second place is where we try to retrieve the TokenStream from the 
 document for a specific field.
 line 472:
 TokenStream tvStream = 
 TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
 fieldName);
 where..
 public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
 docId, String field) throws IOException {
   Fields vectors = reader.getTermVectors(docId);
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 keep in mind that we currently hit the IndexReader n times where n = 
 requested rows(documents) * requested amount of highlight fields.
 in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
 a warm solr and 1.100.000ns on a cold solr.
 If we store the returning Fields vectors in a cache, this lookups only take 
 25000ns.
 I would suggest something like the following code in the 
 doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
 472):
 Fields vectors = null;
 SolrCache termVectorCache = searcher.getCache(termVectorCache);
 if (termVectorCache != null) {
   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
   if (vectors == null) {
 vectors = searcher.getIndexReader().getTermVectors(docId);
 if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
   } 
 } else {
   vectors = searcher.getIndexReader().getTermVectors(docId);
 }
 TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
 fieldName);
 and TokenSources class:
 public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
 field) throws IOException {
   if (vectors == null) {
 return null;
   }
   Terms vector = vectors.terms(field);
   if (vector == null) {
 return null;
   }
   if (!vector.hasPositions() || !vector.hasOffsets()) {
 return null;
   }
   return getTokenStream(vector);
 }
 4000ms on 1000 docs without cache
 639ms on 1000 docs with cache
 102ms on 30 docs without cache
 22ms on 30 docs with cache
 on an index with 190.000 docs with a numFound of 32000 and 80 different 
 highlight fields.
 I think querys with only one field to highlight on a document does not 
 benefit that much from a cache like this, thats why i think an optional cache 
 would be the best solution there. 
 As i saw the FastVectorHighlighter uses more or less the same approach and 
 could also benefit from this cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: