Also, in your position, I would be very curious what would happen to
highlighting performance, if I just took the EdgeNGramFilter out of the
analysis chain and reindexed. That would immediately tell you that the
problem lives there (or not).

-- Bryan

> -----Original Message-----
> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> Sent: Tuesday, June 18, 2013 5:16 PM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter
>
> Andy,
>
> OK, I get what you're doing. As far as alternate paths, you could index
> normally and use WildcardQuery, but that wouldn't get you the boost on
> exact word matches. That makes me wonder whether there's a way to use
> edismax to combine the results of a wildcard search and a non-wildcard
> search against the same field, boosting the latter. I haven't looked
into
> it, but it seems possible that it might be done.
>
> I am perplexed at this point by the poor highlight performance you're
> seeing, but we do have your profiling data that suggests that you have a
> very large number of matches to contend with, so that's interesting.
>
> At this point, faced with your issue, I would step my way through the
> FastVectorHighlighter code. About the first thing it does for each field
> is walk the terms in the document, and retain only those that matched
some
> terms in the query. It may be interesting to see this set of terms it
ends
> up with -- is it excessively large for some reason?
>
> -- Bryan
>
> > -----Original Message-----
> > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > Sent: Friday, June 14, 2013 1:52 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> FastVectorHighlighter
> >
> > Bryan,
> >
> > For specifics, I'll refer you back to my original email where I
> > specified all the fields/field types/handlers I use. Here's a general
> > overview.
> >
> > I really only have 3 fields that I index and search against: "name",
> > "description", and "content". All of which are just general text
> > (string) fields. I have a catch-all field called "text" that is only
> > used for querying. It's indexed but not stored. The "name",
> > "description", and "content" fields are copied into the "text" field.
> >
> > For partial word matching, I have 4 more fields: "name_par",
> > "description_par", "content_par", and "text_par". The "text_par" field
> > has the same relationship to the "*_par" fields as "text" does to the
> > others (only used for querying). Those partial word matching fields
are
> > of type "text_general_partial" which I created. That field type is
> > analyzed different than the regular text field in that it goes through
> > an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> > at index time.
> >
> > I query against both "text" and "text_par" fields using edismax
deftype
> > with my qf set to "text^2 text_par^1" to give full word matches a
higher
> > score. This part returns back very fast as previously stated. It's
when
> > I turn on highlighting that I take the huge performance hit.
> >
> > Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> > name_par description description_par content content_par" so that it
> > returns highlights for full and partial word matches. All of those
> > fields have indexed, stored, termPositions, termVectors, and
termOffsets
> > set to "true".
> >
> > It all seems redundant just to allow for partial word
> > matching/highlighting but I didn't know of a better way. Does anything
> > stand out to you that could be the culprit? Let me know if you need
any
> > more clarification.
> >
> > Thanks!
> >
> > - Andy
> >
> > -----Original Message-----
> > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > Sent: Wednesday, May 29, 2013 5:44 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> >
> > Andy,
> >
> > > I don't understand why it's taking 7 secs to return highlights. The
> > size
> > > of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> > to
> > > 1024 for this verification purpose and that should be more than
> > enough.
> > > The processor is plenty powerful enough as well.
> > >
> > > Running VisualVM shows all my CPU time being taken by mainly these 3
> > > methods:
> > >
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > > nfo.getStartOffset()
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> > > nfo.getStartOffset()
> > >
> >
org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> > > )
> >
> > That is a strange and interesting set of things to be spending most of
> > your CPU time on. The implication, I think, is that the number of term
> > matches in the document for terms in your query (or, at least, terms
> > matching exact words or the beginning of phrases in your query) is
> > extremely high . Perhaps that's coming from this "partial word match"
> > you
> > mention -- how does that work?
> >
> > -- Bryan
> >
> > > My guess is that this has something to do with how I'm handling
> > partial
> > > word matches/highlighting. I have setup another request handler that
> > > only searches the whole word fields and it returns in 850 ms with
> > > highlighting.
> > >
> > > Any ideas?
> > >
> > > - Andy
> > >
> > >
> > > -----Original Message-----
> > > From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> > > Sent: Monday, May 20, 2013 1:39 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Slow Highlighter Performance Even Using
> > > FastVectorHighlighter
> > >
> > > My guess is that the problem is those 200M documents.
> > > FastVectorHighlighter is fast at deciding whether a match,
especially
> > a
> > > phrase, appears in a document, but it still starts out by walking
the
> > > entire list of term vectors, and ends by breaking the document into
> > > candidate-snippet fragments, both processes that are proportional to
> > the
> > > length of the document.
> > >
> > > It's hard to do much about the first, but for the second you could
> > > choose
> > > to expose FastVectorHighlighter's FieldPhraseList representation,
and
> > > return offsets to the caller rather than fragments, building up your
> > own
> > > snippets from a separate store of indexed files. This would also
> > permit
> > > you to set stored="false", improving your memory/core size ratio,
> > which
> > > I'm guessing could use some improving. It would require some work,
and
> > > it
> > > would require you to store a representation of what was indexed
> > outside
> > > the Solr core, in some constant-bytes-to-character representation
that
> > > you
> > > can use offsets with (e.g. UTF-16, or ASCII+entity references).
> > >
> > > However, you may not need to do this -- it may be that you just need
> > > more
> > > memory for your search machine. Not JVM memory, but memory that the
> > O/S
> > > can use as a file cache. What do you have now? That is, how much
> > memory
> > > do
> > > you have that is not used by the JVM or other apps, and how big is
> > your
> > > Solr core?
> > >
> > > One way to start getting a handle on where time is being spent is to
> > set
> > > up VisualVM. Turn on CPU sampling, send in a bunch of the slow
> > highlight
> > > queries, and look at where the time is being spent. If it's mostly
in
> > > methods that are just reading from disk, buy more memory. If you're
on
> > > Linux, look at what top is telling you. If the CPU usage is low and
> > the
> > > "wa" number is above 1% more often than not, buy more memory (I
don't
> > > know
> > > why that wa number makes sense, I just know that it has been a good
> > rule
> > > of thumb for us).
> > >
> > > -- Bryan
> > >
> > > > -----Original Message-----
> > > > From: Andy Brown [mailto:andy_br...@rhoworld.com]
> > > > Sent: Monday, May 20, 2013 9:53 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Slow Highlighter Performance Even Using
> > FastVectorHighlighter
> > > >
> > > > I'm providing a search feature in a web app that searches for
> > > documents
> > > > that range in size from 1KB to 200MB of varying MIME types (PDF,
> > DOC,
> > > > etc). Currently there are about 3000 documents and this will
> > continue
> > > to
> > > > grow. I'm providing full word search and partial word search. For
> > each
> > > > document, there are three source fields that I'm interested in
> > > searching
> > > > and highlighting on: name, description, and content. Since I'm
> > > providing
> > > > both full and partial word search, I've created additional fields
> > that
> > > > get tokenized differently: name_par, description_par, and
> > content_par.
> > > > Those are indexed and stored as well for querying and
highlighting.
> > As
> > > > suggested in the Solr wiki, I've got two catch all fields text and
> > > > text_par for faster querying.
> > > >
> > > > An average search results page displays 25 results and I provide
> > > paging.
> > > > I'm just returning the doc ID in my Solr search results and
response
> > > > times have been quite good (1 to 10 ms). The problem in
performance
> > > > occurs when I turn on highlighting. I'm already using the
> > > > FastVectorHighlighter and depending on the query, it has taken as
> > long
> > > > as 15 seconds to get the highlight snippets. However, this isn't
> > > always
> > > > the case. Certain query terms result in 1 sec or less response
time.
> > > In
> > > > any case, 15 seconds is way too long.
> > > >
> > > > I'm fairly new to Solr but I've spent days coming up with what
I've
> > > got
> > > > so far. Feel free to correct any misconceptions I have. Can anyone
> > > > advise me on what I'm doing wrong or offer a better way to setup
my
> > > core
> > > > to improve highlighting performance?
> > > >
> > > > A typical query would look like:
> > > > /select?q=foo&start=0&rows=25&fl=id&hl=true
> > > >
> > > > I'm using Solr 4.1. Below the relevant core schema and config
> > details:
> > > >
> > > > <!-- Misc fields -->
> > > > <field name="_version_" type="long" indexed="true" stored="true"/>
> > > > <field name="id" type="string" indexed="true" stored="true"
> > > > required="true" multiValued="false"/>
> > > >
> > > >
> > > > <!-- Fields for whole word matches -->
> > > > <field name="name" type="text_general" indexed="true"
stored="true"
> > > > multiValued="true" termPositions="true" termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="description" type="text_general" indexed="true"
> > > > stored="true" multiValued="true" termPositions="true"
> > > termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="content" type="text_general" indexed="true"
> > stored="true"
> > > > multiValued="true" termPositions="true" termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="text" type="text_general" indexed="true"
stored="false"
> > > > multiValued="true"/>
> > > >
> > > > <!-- Fields for partial word matches -->
> > > > <field name="name_par" type="text_general_partial" indexed="true"
> > > > stored="true" multiValued="true" termPositions="true"
> > > termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="description_par" type="text_general_partial"
> > > indexed="true"
> > > > stored="true" multiValued="true" termPositions="true"
> > > termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="content_par" type="text_general_partial"
indexed="true"
> > > > stored="true" multiValued="true" termPositions="true"
> > > termVectors="true"
> > > > termOffsets="true"/>
> > > > <field name="text_par" type="text_general_partial" indexed="true"
> > > > stored="false" multiValued="true"/>
> > > >
> > > >
> > > > <!-- Copy source name, description, and content fields to
name_par,
> > > > description_par, and content_par for partial word searches -->
> > > > <copyField source="name" dest="name_par"/>
> > > > <copyField source="description" dest="description_par"/>
> > > > <copyField source="content" dest="content_par"/>
> > > >
> > > > <!-- Copy source name, description, and content fields to
catch-all
> > > text
> > > > field for faster querying. -->
> > > > <copyField source="name" dest="text"/>
> > > > <copyField source="description" dest="text"/>
> > > > <copyField source="content" dest="text"/>
> > > >
> > > > <!-- Copy source name, description, and content fields to
catch-all
> > > > text_par field for faster querying of partial word searches. -->
> > > > <copyField source="name" dest="text_par"/>
> > > > <copyField source="description" dest="text_par"/>
> > > > <copyField source="content" dest="text_par"/>
> > > >
> > > > <!-- A text field for whole word matches -->
> > > > <fieldType name="text_general" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >   <analyzer type="index">
> > > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > > words="stopwords.txt" enablePositionIncrements="true" />
> > > >     <filter class="solr.LowerCaseFilterFactory"/>
> > > >   </analyzer>
> > > >   <analyzer type="query">
> > > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > > words="stopwords.txt" enablePositionIncrements="true" />
> > > >     <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt"
> > > > ignoreCase="true" expand="true"/>
> > > >     <filter class="solr.LowerCaseFilterFactory"/>
> > > >    </analyzer>
> > > >  </fieldType>
> > > >
> > > > <!-- A text field for parital matches -->
> > > > <fieldType name="text_general_partial" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >   <analyzer type="index">
> > > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > > words="stopwords.txt" enablePositionIncrements="true" />
> > > >     <filter class="solr.LowerCaseFilterFactory"/>
> > > >         <filter class="solr.EdgeNGramFilterFactory"
minGramSize="2"
> > > > maxGramSize="7"/>
> > > >   </analyzer>
> > > >   <analyzer type="query">
> > > >     <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > > words="stopwords.txt" enablePositionIncrements="true" />
> > > >     <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt"
> > > > ignoreCase="true" expand="true"/>
> > > >     <filter class="solr.LowerCaseFilterFactory"/>
> > > >   </analyzer>
> > > > </fieldType>
> > > >
> > > >
> > > >
> > > > <requestHandler name="/select" class="solr.SearchHandler">
> > > >     <!-- default values for query parameters can be specified,
these
> > > > will be overridden by parameters in the request. -->
> > > >      <lst name="defaults">
> > > >        <str name="echoParams">explicit</str>
> > > >        <int name="rows">10</int>
> > > >        <str name="df">text</str>
> > > >            <str name="defType">edismax</str>
> > > >            <str name="qf">text^2 text_par^1</str>   <!-- Boost
whole
> > > > word matches more than partial matches in the scroing. -->
> > > >            <bool name="termVectors">true</bool>
> > > >        <bool name="termPositions">true</bool>
> > > >        <bool name="termOffsets">true</bool>
> > > >        <bool name="hl.useFastVectorHighlighter">true</bool>
> > > >        <str name="hl.boundaryScanner">breakIterator</str>
> > > >        <str name="hl.snippets">2</str>
> > > >            <str name="hl.fl">name name_par description
description_par
> > > > content content_par</str>
> > > >        <int name="hl.fragsize">162</int>
> > > >            <str name="hl.fragListBuilder">simple</str>
> > > >        <str name="hl.fragmentsBuilder">default</str>
> > > >        <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
> > > >        <str name="hl.simple.post"><![CDATA[</strong>]]></str>
> > > >            <str name="hl.tag.pre"><![CDATA[<strong>]]></str>
> > > >        <str name="hl.tag.post"><![CDATA[</strong>]]></str>
> > > >      </lst>
> > > >  </requestHandler>
> > > >
> > > >
> > > > Cheers!
> > > >
> > > > - Andy

Reply via email to