RE: Slow Highlighter Performance Even Using FastVectorHighlighter
Andy, OK, I get what you're doing. As far as alternate paths, you could index normally and use WildcardQuery, but that wouldn't get you the boost on exact word matches. That makes me wonder whether there's a way to use edismax to combine the results of a wildcard search and a non-wildcard search against the same field, boosting the latter. I haven't looked into it, but it seems possible that it might be done. I am perplexed at this point by the poor highlight performance you're seeing, but we do have your profiling data that suggests that you have a very large number of matches to contend with, so that's interesting. At this point, faced with your issue, I would step my way through the FastVectorHighlighter code. About the first thing it does for each field is walk the terms in the document, and retain only those that matched some terms in the query. It may be interesting to see this set of terms it ends up with -- is it excessively large for some reason? -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Friday, June 14, 2013 1:52 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
Also, in your position, I would be very curious what would happen to highlighting performance, if I just took the EdgeNGramFilter out of the analysis chain and reindexed. That would immediately tell you that the problem lives there (or not). -- Bryan -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Tuesday, June 18, 2013 5:16 PM To: 'solr-user@lucene.apache.org' Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, OK, I get what you're doing. As far as alternate paths, you could index normally and use WildcardQuery, but that wouldn't get you the boost on exact word matches. That makes me wonder whether there's a way to use edismax to combine the results of a wildcard search and a non-wildcard search against the same field, boosting the latter. I haven't looked into it, but it seems possible that it might be done. I am perplexed at this point by the poor highlight performance you're seeing, but we do have your profiling data that suggests that you have a very large number of matches to contend with, so that's interesting. At this point, faced with your issue, I would step my way through the FastVectorHighlighter code. About the first thing it does for each field is walk the terms in the document, and retain only those that matched some terms in the query. It may be interesting to see this set of terms it ends up with -- is it excessively large for some reason? -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Friday, June 14, 2013 1:52 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I
Re: Slow Highlighter Performance Even Using FastVectorHighlighter
Hi Michael, How do I configure posthighlighter with my solr 4.2 box? Please kindly point me. Many thanks. 2013/6/15 下午10:48 於 Michael McCandless luc...@mikemccandless.com 寫道: You could also try the new[ish] PostingsHighlighter: http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html Mike McCandless http://blog.mikemccandless.com On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: If you have very large documents (many MB) that can lead to slow highlighting, even with FVH. See https://issues.apache.org/jira/browse/LUCENE-3234 and try setting phraseLimit=1 (or some bigger number, but not infinite, which is the default) -Mike On 6/14/13 4:52 PM, Andy Brown wrote: Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets
Re: Slow Highlighter Performance Even Using FastVectorHighlighter
If you have very large documents (many MB) that can lead to slow highlighting, even with FVH. See https://issues.apache.org/jira/browse/LUCENE-3234 and try setting phraseLimit=1 (or some bigger number, but not infinite, which is the default) -Mike On 6/14/13 4:52 PM, Andy Brown wrote: Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file
Re: Slow Highlighter Performance Even Using FastVectorHighlighter
You could also try the new[ish] PostingsHighlighter: http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html Mike McCandless http://blog.mikemccandless.com On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: If you have very large documents (many MB) that can lead to slow highlighting, even with FVH. See https://issues.apache.org/jira/browse/LUCENE-3234 and try setting phraseLimit=1 (or some bigger number, but not infinite, which is the default) -Mike On 6/14/13 4:52 PM, Andy Brown wrote: Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
Bryan, For specifics, I'll refer you back to my original email where I specified all the fields/field types/handlers I use. Here's a general overview. I really only have 3 fields that I index and search against: name, description, and content. All of which are just general text (string) fields. I have a catch-all field called text that is only used for querying. It's indexed but not stored. The name, description, and content fields are copied into the text field. For partial word matching, I have 4 more fields: name_par, description_par, content_par, and text_par. The text_par field has the same relationship to the *_par fields as text does to the others (only used for querying). Those partial word matching fields are of type text_general_partial which I created. That field type is analyzed different than the regular text field in that it goes through an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7 at index time. I query against both text and text_par fields using edismax deftype with my qf set to text^2 text_par^1 to give full word matches a higher score. This part returns back very fast as previously stated. It's when I turn on highlighting that I take the huge performance hit. Again, I'm using the FastVectorHighlighting. The hl.fl is set to name name_par description description_par content content_par so that it returns highlights for full and partial word matches. All of those fields have indexed, stored, termPositions, termVectors, and termOffsets set to true. It all seems redundant just to allow for partial word matching/highlighting but I didn't know of a better way. Does anything stand out to you that could be the culprit? Let me know if you need any more clarification. Thanks! - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Wednesday, May 29, 2013 5:44 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file cache. What do you have now? That is, how much memory do you have that is not used by the JVM or other apps, and how big is your Solr core? One way to start getting a handle on where time is being spent is to set up VisualVM. Turn on CPU sampling, send in a bunch of the slow
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
Andy, I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) That is a strange and interesting set of things to be spending most of your CPU time on. The implication, I think, is that the number of term matches in the document for terms in your query (or, at least, terms matching exact words or the beginning of phrases in your query) is extremely high . Perhaps that's coming from this partial word match you mention -- how does that work? -- Bryan My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file cache. What do you have now? That is, how much memory do you have that is not used by the JVM or other apps, and how big is your Solr core? One way to start getting a handle on where time is being spent is to set up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight queries, and look at where the time is being spent. If it's mostly in methods that are just reading from disk, buy more memory. If you're on Linux, look at what top is telling you. If the CPU usage is low and the wa number is above 1% more often than not, buy more memory (I don't know why that wa number makes sense, I just know that it has been a good rule of thumb for us). -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Monday, May 20, 2013 9:53 AM To: solr-user@lucene.apache.org Subject: Slow Highlighter Performance Even Using FastVectorHighlighter I'm providing a search feature in a web app that searches for documents that range in size from 1KB to 200MB of varying MIME types (PDF, DOC, etc). Currently there are about 3000 documents and this will continue to grow. I'm providing full word search and partial word search. For each document, there are three source fields that I'm interested in searching and highlighting on: name, description, and content. Since I'm providing both full and partial word search, I've created additional fields that get tokenized differently: name_par, description_par, and content_par. Those are indexed and stored as well for querying and highlighting. As suggested in the Solr wiki, I've got two catch all fields text and text_par for faster querying. An average search results page displays 25 results and I provide paging. I'm just returning the doc ID in my Solr search results and response times have been quite good (1 to 10 ms). The problem in performance occurs when I turn on highlighting. I'm already using the FastVectorHighlighter and depending on the query, it has taken as long as 15 seconds to get the highlight snippets. However, this isn't always the case. Certain query terms result in 1 sec or less response time. In any case, 15 seconds is way too long. I'm fairly new to Solr but I've spent days coming up with what
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
After taking your advice on profiling, I didn't see any memory issues. I wanted to verify this with a small set of data. So I created a new sandbox core with the exact same schema and config file settings. I indexed only 25 PDF documents with an average size of 2.8 MB, the largest is approx 5 MB (39 pages). I run the exact same query on that core and I'm seeing response times of 7 secs or more. Without highlighting the response is usually 1 ms. I don't understand why it's taking 7 secs to return highlights. The size of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to 1024 for this verification purpose and that should be more than enough. The processor is plenty powerful enough as well. Running VisualVM shows all my CPU time being taken by mainly these 3 methods: org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI nfo.getStartOffset() org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( ) My guess is that this has something to do with how I'm handling partial word matches/highlighting. I have setup another request handler that only searches the whole word fields and it returns in 850 ms with highlighting. Any ideas? - Andy -Original Message- From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] Sent: Monday, May 20, 2013 1:39 PM To: solr-user@lucene.apache.org Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file cache. What do you have now? That is, how much memory do you have that is not used by the JVM or other apps, and how big is your Solr core? One way to start getting a handle on where time is being spent is to set up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight queries, and look at where the time is being spent. If it's mostly in methods that are just reading from disk, buy more memory. If you're on Linux, look at what top is telling you. If the CPU usage is low and the wa number is above 1% more often than not, buy more memory (I don't know why that wa number makes sense, I just know that it has been a good rule of thumb for us). -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Monday, May 20, 2013 9:53 AM To: solr-user@lucene.apache.org Subject: Slow Highlighter Performance Even Using FastVectorHighlighter I'm providing a search feature in a web app that searches for documents that range in size from 1KB to 200MB of varying MIME types (PDF, DOC, etc). Currently there are about 3000 documents and this will continue to grow. I'm providing full word search and partial word search. For each document, there are three source fields that I'm interested in searching and highlighting on: name, description, and content. Since I'm providing both full and partial word search, I've created additional fields that get tokenized differently: name_par, description_par, and content_par. Those are indexed and stored as well for querying and highlighting. As suggested in the Solr wiki, I've got two catch all fields text and text_par for faster querying. An average search results page displays 25 results and I provide paging. I'm just returning the doc ID in my Solr search results and response times have been quite good (1 to 10 ms). The problem in performance occurs when I turn on highlighting. I'm already using the FastVectorHighlighter and depending on the query, it has taken as long as 15 seconds to get the highlight snippets. However, this isn't always the case. Certain query terms result in 1 sec or less response time. In any case, 15 seconds is way too long. I'm fairly new to Solr but I've spent days coming up with what I've got so far. Feel free
RE: Slow Highlighter Performance Even Using FastVectorHighlighter
My guess is that the problem is those 200M documents. FastVectorHighlighter is fast at deciding whether a match, especially a phrase, appears in a document, but it still starts out by walking the entire list of term vectors, and ends by breaking the document into candidate-snippet fragments, both processes that are proportional to the length of the document. It's hard to do much about the first, but for the second you could choose to expose FastVectorHighlighter's FieldPhraseList representation, and return offsets to the caller rather than fragments, building up your own snippets from a separate store of indexed files. This would also permit you to set stored=false, improving your memory/core size ratio, which I'm guessing could use some improving. It would require some work, and it would require you to store a representation of what was indexed outside the Solr core, in some constant-bytes-to-character representation that you can use offsets with (e.g. UTF-16, or ASCII+entity references). However, you may not need to do this -- it may be that you just need more memory for your search machine. Not JVM memory, but memory that the O/S can use as a file cache. What do you have now? That is, how much memory do you have that is not used by the JVM or other apps, and how big is your Solr core? One way to start getting a handle on where time is being spent is to set up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight queries, and look at where the time is being spent. If it's mostly in methods that are just reading from disk, buy more memory. If you're on Linux, look at what top is telling you. If the CPU usage is low and the wa number is above 1% more often than not, buy more memory (I don't know why that wa number makes sense, I just know that it has been a good rule of thumb for us). -- Bryan -Original Message- From: Andy Brown [mailto:andy_br...@rhoworld.com] Sent: Monday, May 20, 2013 9:53 AM To: solr-user@lucene.apache.org Subject: Slow Highlighter Performance Even Using FastVectorHighlighter I'm providing a search feature in a web app that searches for documents that range in size from 1KB to 200MB of varying MIME types (PDF, DOC, etc). Currently there are about 3000 documents and this will continue to grow. I'm providing full word search and partial word search. For each document, there are three source fields that I'm interested in searching and highlighting on: name, description, and content. Since I'm providing both full and partial word search, I've created additional fields that get tokenized differently: name_par, description_par, and content_par. Those are indexed and stored as well for querying and highlighting. As suggested in the Solr wiki, I've got two catch all fields text and text_par for faster querying. An average search results page displays 25 results and I provide paging. I'm just returning the doc ID in my Solr search results and response times have been quite good (1 to 10 ms). The problem in performance occurs when I turn on highlighting. I'm already using the FastVectorHighlighter and depending on the query, it has taken as long as 15 seconds to get the highlight snippets. However, this isn't always the case. Certain query terms result in 1 sec or less response time. In any case, 15 seconds is way too long. I'm fairly new to Solr but I've spent days coming up with what I've got so far. Feel free to correct any misconceptions I have. Can anyone advise me on what I'm doing wrong or offer a better way to setup my core to improve highlighting performance? A typical query would look like: /select?q=foostart=0rows=25fl=idhl=true I'm using Solr 4.1. Below the relevant core schema and config details: !-- Misc fields -- field name=_version_ type=long indexed=true stored=true/ field name=id type=string indexed=true stored=true required=true multiValued=false/ !-- Fields for whole word matches -- field name=name type=text_general indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=description type=text_general indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=content type=text_general indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=text type=text_general indexed=true stored=false multiValued=true/ !-- Fields for partial word matches -- field name=name_par type=text_general_partial indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=description_par type=text_general_partial indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=content_par type=text_general_partial indexed=true stored=true multiValued=true termPositions=true termVectors=true termOffsets=true/ field name=text_par type=text_general_partial indexed=true