RE: Slow Highlighter Performance Even Using FastVectorHighlighter

Bryan Loofbourrow Tue, 18 Jun 2013 17:17:14 -0700

Andy,

OK, I get what you're doing. As far as alternate paths, you could index
normally and use WildcardQuery, but that wouldn't get you the boost on
exact word matches. That makes me wonder whether there's a way to use
edismax to combine the results of a wildcard search and a non-wildcard
search against the same field, boosting the latter. I haven't looked into
it, but it seems possible that it might be done.


I am perplexed at this point by the poor highlight performance you're
seeing, but we do have your profiling data that suggests that you have a
very large number of matches to contend with, so that's interesting.

At this point, faced with your issue, I would step my way through the
FastVectorHighlighter code. About the first thing it does for each field
is walk the terms in the document, and retain only those that matched some
terms in the query. It may be interesting to see this set of terms it ends
up with -- is it excessively large for some reason?

-- Bryan

> -----Original Message-----
> From: Andy Brown [mailto:andy_br...@rhoworld.com]
> Sent: Friday, June 14, 2013 1:52 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Bryan,
>
> For specifics, I'll refer you back to my original email where I
> specified all the fields/field types/handlers I use. Here's a general
> overview.
>
> I really only have 3 fields that I index and search against: "name",
> "description", and "content". All of which are just general text
> (string) fields. I have a catch-all field called "text" that is only
> used for querying. It's indexed but not stored. The "name",
> "description", and "content" fields are copied into the "text" field.
>
> For partial word matching, I have 4 more fields: "name_par",
> "description_par", "content_par", and "text_par". The "text_par" field
> has the same relationship to the "*_par" fields as "text" does to the
> others (only used for querying). Those partial word matching fields are
> of type "text_general_partial" which I created. That field type is
> analyzed different than the regular text field in that it goes through
> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> at index time.
>
> I query against both "text" and "text_par" fields using edismax deftype
> with my qf set to "text^2 text_par^1" to give full word matches a higher
> score. This part returns back very fast as previously stated. It's when
> I turn on highlighting that I take the huge performance hit.
>
> Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> name_par description description_par content content_par" so that it
> returns highlights for full and partial word matches. All of those
> fields have indexed, stored, termPositions, termVectors, and termOffsets
> set to "true".
>
> It all seems redundant just to allow for partial word
> matching/highlighting but I didn't know of a better way. Does anything
> stand out to you that could be the culprit? Let me know if you need any
> more clarification.
>
> Thanks!
>
> - Andy
>
> -----Original Message-----
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Wednesday, May 29, 2013 5:44 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
>
> Andy,
>
> > I don't understand why it's taking 7 secs to return highlights. The
> size
> > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> to
> > 1024 for this verification purpose and that should be more than
> enough.
> > The processor is plenty powerful enough as well.
> >
> > Running VisualVM shows all my CPU time being taken by mainly these 3
> > methods:
> >
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > nfo.getStartOffset()
> >
> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> > )
>
> That is a strange and interesting set of things to be spending most of
> your CPU time on. The implication, I think, is that the number of term
> matches in the document for terms in your query (or, at least, terms
> matching exact words or the beginning of phrases in your query) is
> extremely high . Perhaps that's coming from this "partial word match"
> you
> mention -- how does that work?
>
> -- Bryan
>
> > My guess is that this has something to do with how I'm handling
> partial
> > word matches/highlighting. I have setup another request handler that
> > only searches the whole word fields and it returns in 850 ms with
> > highlighting.
> >
> > Any ideas?
> >
> > - Andy
> >
> >
> > -----Original Message-----
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Monday, May 20, 2013 1:39 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> >
> > My guess is that the problem is those 200M documents.
> > FastVectorHighlighter is fast at deciding whether a match, especially
> a
> > phrase, appears in a document, but it still starts out by walking the
> > entire list of term vectors, and ends by breaking the document into
> > candidate-snippet fragments, both processes that are proportional to
> the
> > length of the document.
> >
> > It's hard to do much about the first, but for the second you could
> > choose
> > to expose FastVectorHighlighter's FieldPhraseList representation, and
> > return offsets to the caller rather than fragments, building up your
> own
> > snippets from a separate store of indexed files. This would also
> permit
> > you to set stored="false", improving your memory/core size ratio,
> which
> > I'm guessing could use some improving. It would require some work, and
> > it
> > would require you to store a representation of what was indexed
> outside
> > the Solr core, in some constant-bytes-to-character representation that
> > you
> > can use offsets with (e.g. UTF-16, or ASCII+entity references).
> >
> > However, you may not need to do this -- it may be that you just need
> > more
> > memory for your search machine. Not JVM memory, but memory that the
> O/S
> > can use as a file cache. What do you have now? That is, how much
> memory
> > do
> > you have that is not used by the JVM or other apps, and how big is
> your
> > Solr core?
> >
> > One way to start getting a handle on where time is being spent is to
> set
> > up VisualVM. Turn on CPU sampling, send in a bunch of the slow
> highlight
> > queries, and look at where the time is being spent. If it's mostly in
> > methods that are just reading from disk, buy more memory. If you're on
> > Linux, look at what top is telling you. If the CPU usage is low and
> the
> > "wa" number is above 1% more often than not, buy more memory (I don't
> > know
> > why that wa number makes sense, I just know that it has been a good
> rule
> > of thumb for us).
> >
> > -- Bryan
> >
> > > -----Original Message-----
> > > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > > Sent: Monday, May 20, 2013 9:53 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Slow Highlighter Performance Even Using
> FastVectorHighlighter
> > >
> > > I'm providing a search feature in a web app that searches for
> > documents
> > > that range in size from 1KB to 200MB of varying MIME types (PDF,
> DOC,
> > > etc). Currently there are about 3000 documents and this will
> continue
> > to
> > > grow. I'm providing full word search and partial word search. For
> each
> > > document, there are three source fields that I'm interested in
> > searching
> > > and highlighting on: name, description, and content. Since I'm
> > providing
> > > both full and partial word search, I've created additional fields
> that
> > > get tokenized differently: name_par, description_par, and
> content_par.
> > > Those are indexed and stored as well for querying and highlighting.
> As
> > > suggested in the Solr wiki, I've got two catch all fields text and
> > > text_par for faster querying.
> > >
> > > An average search results page displays 25 results and I provide
> > paging.
> > > I'm just returning the doc ID in my Solr search results and response
> > > times have been quite good (1 to 10 ms). The problem in performance
> > > occurs when I turn on highlighting. I'm already using the
> > > FastVectorHighlighter and depending on the query, it has taken as
> long
> > > as 15 seconds to get the highlight snippets. However, this isn't
> > always
> > > the case. Certain query terms result in 1 sec or less response time.
> > In
> > > any case, 15 seconds is way too long.
> > >
> > > I'm fairly new to Solr but I've spent days coming up with what I've
> > got
> > > so far. Feel free to correct any misconceptions I have. Can anyone
> > > advise me on what I'm doing wrong or offer a better way to setup my
> > core
> > > to improve highlighting performance?
> > >
> > > A typical query would look like:
> > > /select?q=foo&start=0&rows=25&fl=id&hl=true
> > >
> > > I'm using Solr 4.1. Below the relevant core schema and config
> details:
> > >
> > > <!-- Misc fields -->
> > > <field name="_version_" type="long" indexed="true" stored="true"/>
> > > <field name="id" type="string" indexed="true" stored="true"
> > > required="true" multiValued="false"/>
> > >
> > >
> > > <!-- Fields for whole word matches -->
> > > <field name="name" type="text_general" indexed="true" stored="true"
> > > multiValued="true" termPositions="true" termVectors="true"
> > > termOffsets="true"/>
> > > <field name="description" type="text_general" indexed="true"
> > > stored="true" multiValued="true" termPositions="true"
> > termVectors="true"
> > > termOffsets="true"/>
> > > <field name="content" type="text_general" indexed="true"
> stored="true"
> > > multiValued="true" termPositions="true" termVectors="true"
> > > termOffsets="true"/>
> > > <field name="text" type="text_general" indexed="true" stored="false"
> > > multiValued="true"/>
> > >
> > > <!-- Fields for partial word matches -->
> > > <field name="name_par" type="text_general_partial" indexed="true"
> > > stored="true" multiValued="true" termPositions="true"
> > termVectors="true"
> > > termOffsets="true"/>
> > > <field name="description_par" type="text_general_partial"
> > indexed="true"
> > > stored="true" multiValued="true" termPositions="true"
> > termVectors="true"
> > > termOffsets="true"/>
> > > <field name="content_par" type="text_general_partial" indexed="true"
> > > stored="true" multiValued="true" termPositions="true"
> > termVectors="true"
> > > termOffsets="true"/>
> > > <field name="text_par" type="text_general_partial" indexed="true"
> > > stored="false" multiValued="true"/>
> > >
> > >
> > > <!-- Copy source name, description, and content fields to name_par,
> > > description_par, and content_par for partial word searches -->
> > > <copyField source="name" dest="name_par"/>
> > > <copyField source="description" dest="description_par"/>
> > > <copyField source="content" dest="content_par"/>
> > >
> > > <!-- Copy source name, description, and content fields to catch-all
> > text
> > > field for faster querying. -->
> > > <copyField source="name" dest="text"/>
> > > <copyField source="description" dest="text"/>
> > > <copyField source="content" dest="text"/>
> > >
> > > <!-- Copy source name, description, and content fields to catch-all
> > > text_par field for faster querying of partial word searches. -->
> > > <copyField source="name" dest="text_par"/>
> > > <copyField source="description" dest="text_par"/>
> > > <copyField source="content" dest="text_par"/>
> > >
> > > <!-- A text field for whole word matches -->
> > > <fieldType name="text_general" class="solr.TextField"
> > > positionIncrementGap="100">
> > >   <analyzer type="index">
> > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >     <filter class="solr.LowerCaseFilterFactory"/>
> > >   </analyzer>
> > >   <analyzer type="query">
> > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >     <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > >     <filter class="solr.LowerCaseFilterFactory"/>
> > >    </analyzer>
> > >  </fieldType>
> > >
> > > <!-- A text field for parital matches -->
> > > <fieldType name="text_general_partial" class="solr.TextField"
> > > positionIncrementGap="100">
> > >   <analyzer type="index">
> > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >     <filter class="solr.LowerCaseFilterFactory"/>
> > >   <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
> > > maxGramSize="7"/>
> > >   </analyzer>
> > >   <analyzer type="query">
> > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt" enablePositionIncrements="true" />
> > >     <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > >     <filter class="solr.LowerCaseFilterFactory"/>
> > >   </analyzer>
> > > </fieldType>
> > >
> > >
> > >
> > > <requestHandler name="/select" class="solr.SearchHandler">
> > >     <!-- default values for query parameters can be specified, these
> > > will be overridden by parameters in the request. -->
> > >      <lst name="defaults">
> > >        <str name="echoParams">explicit</str>
> > >        <int name="rows">10</int>
> > >        <str name="df">text</str>
> > >      <str name="defType">edismax</str>
> > >      <str name="qf">text^2 text_par^1</str>   <!-- Boost whole
> > > word matches more than partial matches in the scroing. -->
> > >      <bool name="termVectors">true</bool>
> > >        <bool name="termPositions">true</bool>
> > >        <bool name="termOffsets">true</bool>
> > >        <bool name="hl.useFastVectorHighlighter">true</bool>
> > >        <str name="hl.boundaryScanner">breakIterator</str>
> > >        <str name="hl.snippets">2</str>
> > >      <str name="hl.fl">name name_par description description_par
> > > content content_par</str>
> > >        <int name="hl.fragsize">162</int>
> > >      <str name="hl.fragListBuilder">simple</str>
> > >        <str name="hl.fragmentsBuilder">default</str>
> > >        <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
> > >        <str name="hl.simple.post"><![CDATA[</strong>]]></str>
> > >      <str name="hl.tag.pre"><![CDATA[<strong>]]></str>
> > >        <str name="hl.tag.post"><![CDATA[</strong>]]></str>
> > >      </lst>
> > >  </requestHandler>
> > >
> > >
> > > Cheers!
> > >
> > > - Andy

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

Reply via email to