Also, in your position, I would be very curious what would happen to highlighting performance, if I just took the EdgeNGramFilter out of the analysis chain and reindexed. That would immediately tell you that the problem lives there (or not).
-- Bryan > -----Original Message----- > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] > Sent: Tuesday, June 18, 2013 5:16 PM > To: 'solr-user@lucene.apache.org' > Subject: RE: Slow Highlighter Performance Even Using FastVectorHighlighter > > Andy, > > OK, I get what you're doing. As far as alternate paths, you could index > normally and use WildcardQuery, but that wouldn't get you the boost on > exact word matches. That makes me wonder whether there's a way to use > edismax to combine the results of a wildcard search and a non-wildcard > search against the same field, boosting the latter. I haven't looked into > it, but it seems possible that it might be done. > > I am perplexed at this point by the poor highlight performance you're > seeing, but we do have your profiling data that suggests that you have a > very large number of matches to contend with, so that's interesting. > > At this point, faced with your issue, I would step my way through the > FastVectorHighlighter code. About the first thing it does for each field > is walk the terms in the document, and retain only those that matched some > terms in the query. It may be interesting to see this set of terms it ends > up with -- is it excessively large for some reason? > > -- Bryan > > > -----Original Message----- > > From: Andy Brown [mailto:andy_br...@rhoworld.com] > > Sent: Friday, June 14, 2013 1:52 PM > > To: solr-user@lucene.apache.org > > Subject: RE: Slow Highlighter Performance Even Using > FastVectorHighlighter > > > > Bryan, > > > > For specifics, I'll refer you back to my original email where I > > specified all the fields/field types/handlers I use. Here's a general > > overview. > > > > I really only have 3 fields that I index and search against: "name", > > "description", and "content". All of which are just general text > > (string) fields. I have a catch-all field called "text" that is only > > used for querying. It's indexed but not stored. The "name", > > "description", and "content" fields are copied into the "text" field. > > > > For partial word matching, I have 4 more fields: "name_par", > > "description_par", "content_par", and "text_par". The "text_par" field > > has the same relationship to the "*_par" fields as "text" does to the > > others (only used for querying). Those partial word matching fields are > > of type "text_general_partial" which I created. That field type is > > analyzed different than the regular text field in that it goes through > > an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7" > > at index time. > > > > I query against both "text" and "text_par" fields using edismax deftype > > with my qf set to "text^2 text_par^1" to give full word matches a higher > > score. This part returns back very fast as previously stated. It's when > > I turn on highlighting that I take the huge performance hit. > > > > Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name > > name_par description description_par content content_par" so that it > > returns highlights for full and partial word matches. All of those > > fields have indexed, stored, termPositions, termVectors, and termOffsets > > set to "true". > > > > It all seems redundant just to allow for partial word > > matching/highlighting but I didn't know of a better way. Does anything > > stand out to you that could be the culprit? Let me know if you need any > > more clarification. > > > > Thanks! > > > > - Andy > > > > -----Original Message----- > > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] > > Sent: Wednesday, May 29, 2013 5:44 PM > > To: solr-user@lucene.apache.org > > Subject: RE: Slow Highlighter Performance Even Using > > FastVectorHighlighter > > > > Andy, > > > > > I don't understand why it's taking 7 secs to return highlights. The > > size > > > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set > > to > > > 1024 for this verification purpose and that should be more than > > enough. > > > The processor is plenty powerful enough as well. > > > > > > Running VisualVM shows all my CPU time being taken by mainly these 3 > > > methods: > > > > > > > > org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI > > > nfo.getStartOffset() > > > > > org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI > > > nfo.getStartOffset() > > > > > org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( > > > ) > > > > That is a strange and interesting set of things to be spending most of > > your CPU time on. The implication, I think, is that the number of term > > matches in the document for terms in your query (or, at least, terms > > matching exact words or the beginning of phrases in your query) is > > extremely high . Perhaps that's coming from this "partial word match" > > you > > mention -- how does that work? > > > > -- Bryan > > > > > My guess is that this has something to do with how I'm handling > > partial > > > word matches/highlighting. I have setup another request handler that > > > only searches the whole word fields and it returns in 850 ms with > > > highlighting. > > > > > > Any ideas? > > > > > > - Andy > > > > > > > > > -----Original Message----- > > > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] > > > Sent: Monday, May 20, 2013 1:39 PM > > > To: solr-user@lucene.apache.org > > > Subject: RE: Slow Highlighter Performance Even Using > > > FastVectorHighlighter > > > > > > My guess is that the problem is those 200M documents. > > > FastVectorHighlighter is fast at deciding whether a match, especially > > a > > > phrase, appears in a document, but it still starts out by walking the > > > entire list of term vectors, and ends by breaking the document into > > > candidate-snippet fragments, both processes that are proportional to > > the > > > length of the document. > > > > > > It's hard to do much about the first, but for the second you could > > > choose > > > to expose FastVectorHighlighter's FieldPhraseList representation, and > > > return offsets to the caller rather than fragments, building up your > > own > > > snippets from a separate store of indexed files. This would also > > permit > > > you to set stored="false", improving your memory/core size ratio, > > which > > > I'm guessing could use some improving. It would require some work, and > > > it > > > would require you to store a representation of what was indexed > > outside > > > the Solr core, in some constant-bytes-to-character representation that > > > you > > > can use offsets with (e.g. UTF-16, or ASCII+entity references). > > > > > > However, you may not need to do this -- it may be that you just need > > > more > > > memory for your search machine. Not JVM memory, but memory that the > > O/S > > > can use as a file cache. What do you have now? That is, how much > > memory > > > do > > > you have that is not used by the JVM or other apps, and how big is > > your > > > Solr core? > > > > > > One way to start getting a handle on where time is being spent is to > > set > > > up VisualVM. Turn on CPU sampling, send in a bunch of the slow > > highlight > > > queries, and look at where the time is being spent. If it's mostly in > > > methods that are just reading from disk, buy more memory. If you're on > > > Linux, look at what top is telling you. If the CPU usage is low and > > the > > > "wa" number is above 1% more often than not, buy more memory (I don't > > > know > > > why that wa number makes sense, I just know that it has been a good > > rule > > > of thumb for us). > > > > > > -- Bryan > > > > > > > -----Original Message----- > > > > From: Andy Brown [mailto:andy_br...@rhoworld.com] > > > > Sent: Monday, May 20, 2013 9:53 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: Slow Highlighter Performance Even Using > > FastVectorHighlighter > > > > > > > > I'm providing a search feature in a web app that searches for > > > documents > > > > that range in size from 1KB to 200MB of varying MIME types (PDF, > > DOC, > > > > etc). Currently there are about 3000 documents and this will > > continue > > > to > > > > grow. I'm providing full word search and partial word search. For > > each > > > > document, there are three source fields that I'm interested in > > > searching > > > > and highlighting on: name, description, and content. Since I'm > > > providing > > > > both full and partial word search, I've created additional fields > > that > > > > get tokenized differently: name_par, description_par, and > > content_par. > > > > Those are indexed and stored as well for querying and highlighting. > > As > > > > suggested in the Solr wiki, I've got two catch all fields text and > > > > text_par for faster querying. > > > > > > > > An average search results page displays 25 results and I provide > > > paging. > > > > I'm just returning the doc ID in my Solr search results and response > > > > times have been quite good (1 to 10 ms). The problem in performance > > > > occurs when I turn on highlighting. I'm already using the > > > > FastVectorHighlighter and depending on the query, it has taken as > > long > > > > as 15 seconds to get the highlight snippets. However, this isn't > > > always > > > > the case. Certain query terms result in 1 sec or less response time. > > > In > > > > any case, 15 seconds is way too long. > > > > > > > > I'm fairly new to Solr but I've spent days coming up with what I've > > > got > > > > so far. Feel free to correct any misconceptions I have. Can anyone > > > > advise me on what I'm doing wrong or offer a better way to setup my > > > core > > > > to improve highlighting performance? > > > > > > > > A typical query would look like: > > > > /select?q=foo&start=0&rows=25&fl=id&hl=true > > > > > > > > I'm using Solr 4.1. Below the relevant core schema and config > > details: > > > > > > > > <!-- Misc fields --> > > > > <field name="_version_" type="long" indexed="true" stored="true"/> > > > > <field name="id" type="string" indexed="true" stored="true" > > > > required="true" multiValued="false"/> > > > > > > > > > > > > <!-- Fields for whole word matches --> > > > > <field name="name" type="text_general" indexed="true" stored="true" > > > > multiValued="true" termPositions="true" termVectors="true" > > > > termOffsets="true"/> > > > > <field name="description" type="text_general" indexed="true" > > > > stored="true" multiValued="true" termPositions="true" > > > termVectors="true" > > > > termOffsets="true"/> > > > > <field name="content" type="text_general" indexed="true" > > stored="true" > > > > multiValued="true" termPositions="true" termVectors="true" > > > > termOffsets="true"/> > > > > <field name="text" type="text_general" indexed="true" stored="false" > > > > multiValued="true"/> > > > > > > > > <!-- Fields for partial word matches --> > > > > <field name="name_par" type="text_general_partial" indexed="true" > > > > stored="true" multiValued="true" termPositions="true" > > > termVectors="true" > > > > termOffsets="true"/> > > > > <field name="description_par" type="text_general_partial" > > > indexed="true" > > > > stored="true" multiValued="true" termPositions="true" > > > termVectors="true" > > > > termOffsets="true"/> > > > > <field name="content_par" type="text_general_partial" indexed="true" > > > > stored="true" multiValued="true" termPositions="true" > > > termVectors="true" > > > > termOffsets="true"/> > > > > <field name="text_par" type="text_general_partial" indexed="true" > > > > stored="false" multiValued="true"/> > > > > > > > > > > > > <!-- Copy source name, description, and content fields to name_par, > > > > description_par, and content_par for partial word searches --> > > > > <copyField source="name" dest="name_par"/> > > > > <copyField source="description" dest="description_par"/> > > > > <copyField source="content" dest="content_par"/> > > > > > > > > <!-- Copy source name, description, and content fields to catch-all > > > text > > > > field for faster querying. --> > > > > <copyField source="name" dest="text"/> > > > > <copyField source="description" dest="text"/> > > > > <copyField source="content" dest="text"/> > > > > > > > > <!-- Copy source name, description, and content fields to catch-all > > > > text_par field for faster querying of partial word searches. --> > > > > <copyField source="name" dest="text_par"/> > > > > <copyField source="description" dest="text_par"/> > > > > <copyField source="content" dest="text_par"/> > > > > > > > > <!-- A text field for whole word matches --> > > > > <fieldType name="text_general" class="solr.TextField" > > > > positionIncrementGap="100"> > > > > <analyzer type="index"> > > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true" /> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > </analyzer> > > > > <analyzer type="query"> > > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true" /> > > > > <filter class="solr.SynonymFilterFactory" > > synonyms="synonyms.txt" > > > > ignoreCase="true" expand="true"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > </analyzer> > > > > </fieldType> > > > > > > > > <!-- A text field for parital matches --> > > > > <fieldType name="text_general_partial" class="solr.TextField" > > > > positionIncrementGap="100"> > > > > <analyzer type="index"> > > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true" /> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" > > > > maxGramSize="7"/> > > > > </analyzer> > > > > <analyzer type="query"> > > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > > words="stopwords.txt" enablePositionIncrements="true" /> > > > > <filter class="solr.SynonymFilterFactory" > > synonyms="synonyms.txt" > > > > ignoreCase="true" expand="true"/> > > > > <filter class="solr.LowerCaseFilterFactory"/> > > > > </analyzer> > > > > </fieldType> > > > > > > > > > > > > > > > > <requestHandler name="/select" class="solr.SearchHandler"> > > > > <!-- default values for query parameters can be specified, these > > > > will be overridden by parameters in the request. --> > > > > <lst name="defaults"> > > > > <str name="echoParams">explicit</str> > > > > <int name="rows">10</int> > > > > <str name="df">text</str> > > > > <str name="defType">edismax</str> > > > > <str name="qf">text^2 text_par^1</str> <!-- Boost whole > > > > word matches more than partial matches in the scroing. --> > > > > <bool name="termVectors">true</bool> > > > > <bool name="termPositions">true</bool> > > > > <bool name="termOffsets">true</bool> > > > > <bool name="hl.useFastVectorHighlighter">true</bool> > > > > <str name="hl.boundaryScanner">breakIterator</str> > > > > <str name="hl.snippets">2</str> > > > > <str name="hl.fl">name name_par description description_par > > > > content content_par</str> > > > > <int name="hl.fragsize">162</int> > > > > <str name="hl.fragListBuilder">simple</str> > > > > <str name="hl.fragmentsBuilder">default</str> > > > > <str name="hl.simple.pre"><![CDATA[<strong>]]></str> > > > > <str name="hl.simple.post"><![CDATA[</strong>]]></str> > > > > <str name="hl.tag.pre"><![CDATA[<strong>]]></str> > > > > <str name="hl.tag.post"><![CDATA[</strong>]]></str> > > > > </lst> > > > > </requestHandler> > > > > > > > > > > > > Cheers! > > > > > > > > - Andy