RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Andy,

OK, I get what you're doing. As far as alternate paths, you could index
normally and use WildcardQuery, but that wouldn't get you the boost on
exact word matches. That makes me wonder whether there's a way to use
edismax to combine the results of a wildcard search and a non-wildcard
search against the same field, boosting the latter. I haven't looked into
it, but it seems possible that it might be done.

I am perplexed at this point by the poor highlight performance you're
seeing, but we do have your profiling data that suggests that you have a
very large number of matches to contend with, so that's interesting.

At this point, faced with your issue, I would step my way through the
FastVectorHighlighter code. About the first thing it does for each field
is walk the terms in the document, and retain only those that matched some
terms in the query. It may be interesting to see this set of terms it ends
up with -- is it excessively large for some reason?

-- Bryan

 -Original Message-
 From: Andy Brown [mailto:andy_br...@rhoworld.com]
 Sent: Friday, June 14, 2013 1:52 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

 Bryan,

 For specifics, I'll refer you back to my original email where I
 specified all the fields/field types/handlers I use. Here's a general
 overview.

 I really only have 3 fields that I index and search against: name,
 description, and content. All of which are just general text
 (string) fields. I have a catch-all field called text that is only
 used for querying. It's indexed but not stored. The name,
 description, and content fields are copied into the text field.

 For partial word matching, I have 4 more fields: name_par,
 description_par, content_par, and text_par. The text_par field
 has the same relationship to the *_par fields as text does to the
 others (only used for querying). Those partial word matching fields are
 of type text_general_partial which I created. That field type is
 analyzed different than the regular text field in that it goes through
 an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
 at index time.

 I query against both text and text_par fields using edismax deftype
 with my qf set to text^2 text_par^1 to give full word matches a higher
 score. This part returns back very fast as previously stated. It's when
 I turn on highlighting that I take the huge performance hit.

 Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
 name_par description description_par content content_par so that it
 returns highlights for full and partial word matches. All of those
 fields have indexed, stored, termPositions, termVectors, and termOffsets
 set to true.

 It all seems redundant just to allow for partial word
 matching/highlighting but I didn't know of a better way. Does anything
 stand out to you that could be the culprit? Let me know if you need any
 more clarification.

 Thanks!

 - Andy

 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Wednesday, May 29, 2013 5:44 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 Andy,

  I don't understand why it's taking 7 secs to return highlights. The
 size
  of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
 to
  1024 for this verification purpose and that should be more than
 enough.
  The processor is plenty powerful enough as well.
 
  Running VisualVM shows all my CPU time being taken by mainly these 3
  methods:
 
 
 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
  nfo.getStartOffset()
 
 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
  nfo.getStartOffset()
 
 org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
  )

 That is a strange and interesting set of things to be spending most of
 your CPU time on. The implication, I think, is that the number of term
 matches in the document for terms in your query (or, at least, terms
 matching exact words or the beginning of phrases in your query) is
 extremely high . Perhaps that's coming from this partial word match
 you
 mention -- how does that work?

 -- Bryan

  My guess is that this has something to do with how I'm handling
 partial
  word matches/highlighting. I have setup another request handler that
  only searches the whole word fields and it returns in 850 ms with
  highlighting.
 
  Any ideas?
 
  - Andy
 
 
  -Original Message-
  From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
  Sent: Monday, May 20, 2013 1:39 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Slow Highlighter Performance Even Using
  FastVectorHighlighter
 
  My guess is that the problem is those 200M documents.
  FastVectorHighlighter is fast at deciding whether a match, especially
 a
  phrase, appears in a document, but it still starts out by walking

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-18 Thread Bryan Loofbourrow
Also, in your position, I would be very curious what would happen to
highlighting performance, if I just took the EdgeNGramFilter out of the
analysis chain and reindexed. That would immediately tell you that the
problem lives there (or not).

-- Bryan

 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Tuesday, June 18, 2013 5:16 PM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

 Andy,

 OK, I get what you're doing. As far as alternate paths, you could index
 normally and use WildcardQuery, but that wouldn't get you the boost on
 exact word matches. That makes me wonder whether there's a way to use
 edismax to combine the results of a wildcard search and a non-wildcard
 search against the same field, boosting the latter. I haven't looked
into
 it, but it seems possible that it might be done.

 I am perplexed at this point by the poor highlight performance you're
 seeing, but we do have your profiling data that suggests that you have a
 very large number of matches to contend with, so that's interesting.

 At this point, faced with your issue, I would step my way through the
 FastVectorHighlighter code. About the first thing it does for each field
 is walk the terms in the document, and retain only those that matched
some
 terms in the query. It may be interesting to see this set of terms it
ends
 up with -- is it excessively large for some reason?

 -- Bryan

  -Original Message-
  From: Andy Brown [mailto:andy_br...@rhoworld.com]
  Sent: Friday, June 14, 2013 1:52 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter
 
  Bryan,
 
  For specifics, I'll refer you back to my original email where I
  specified all the fields/field types/handlers I use. Here's a general
  overview.
 
  I really only have 3 fields that I index and search against: name,
  description, and content. All of which are just general text
  (string) fields. I have a catch-all field called text that is only
  used for querying. It's indexed but not stored. The name,
  description, and content fields are copied into the text field.
 
  For partial word matching, I have 4 more fields: name_par,
  description_par, content_par, and text_par. The text_par field
  has the same relationship to the *_par fields as text does to the
  others (only used for querying). Those partial word matching fields
are
  of type text_general_partial which I created. That field type is
  analyzed different than the regular text field in that it goes through
  an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
  at index time.
 
  I query against both text and text_par fields using edismax
deftype
  with my qf set to text^2 text_par^1 to give full word matches a
higher
  score. This part returns back very fast as previously stated. It's
when
  I turn on highlighting that I take the huge performance hit.
 
  Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
  name_par description description_par content content_par so that it
  returns highlights for full and partial word matches. All of those
  fields have indexed, stored, termPositions, termVectors, and
termOffsets
  set to true.
 
  It all seems redundant just to allow for partial word
  matching/highlighting but I didn't know of a better way. Does anything
  stand out to you that could be the culprit? Let me know if you need
any
  more clarification.
 
  Thanks!
 
  - Andy
 
  -Original Message-
  From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
  Sent: Wednesday, May 29, 2013 5:44 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Slow Highlighter Performance Even Using
  FastVectorHighlighter
 
  Andy,
 
   I don't understand why it's taking 7 secs to return highlights. The
  size
   of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
  to
   1024 for this verification purpose and that should be more than
  enough.
   The processor is plenty powerful enough as well.
  
   Running VisualVM shows all my CPU time being taken by mainly these 3
   methods:
  
  
 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
   nfo.getStartOffset()
  
 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
   nfo.getStartOffset()
  
 
org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
   )
 
  That is a strange and interesting set of things to be spending most of
  your CPU time on. The implication, I think, is that the number of term
  matches in the document for terms in your query (or, at least, terms
  matching exact words or the beginning of phrases in your query) is
  extremely high . Perhaps that's coming from this partial word match
  you
  mention -- how does that work?
 
  -- Bryan
 
   My guess is that this has something to do with how I'm handling
  partial
   word matches/highlighting. I

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-17 Thread Floyd Wu
Hi Michael, How do I configure posthighlighter with my solr 4.2 box?
Please kindly point me. Many thanks.
2013/6/15 下午10:48 於 Michael McCandless luc...@mikemccandless.com 寫道:

 You could also try the new[ish] PostingsHighlighter:

 http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
 msoko...@safaribooksonline.com wrote:
  If you have very large documents (many MB) that can lead to slow
  highlighting, even with FVH.
 
  See https://issues.apache.org/jira/browse/LUCENE-3234
 
  and try setting phraseLimit=1 (or some bigger number, but not infinite,
  which is the default)
 
  -Mike
 
 
 
  On 6/14/13 4:52 PM, Andy Brown wrote:
 
  Bryan,
 
  For specifics, I'll refer you back to my original email where I
  specified all the fields/field types/handlers I use. Here's a general
  overview.
I really only have 3 fields that I index and search against: name,
  description, and content. All of which are just general text
  (string) fields. I have a catch-all field called text that is only
  used for querying. It's indexed but not stored. The name,
  description, and content fields are copied into the text field.
For partial word matching, I have 4 more fields: name_par,
  description_par, content_par, and text_par. The text_par field
  has the same relationship to the *_par fields as text does to the
  others (only used for querying). Those partial word matching fields are
  of type text_general_partial which I created. That field type is
  analyzed different than the regular text field in that it goes through
  an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
  at index time.
I query against both text and text_par fields using edismax
 deftype
  with my qf set to text^2 text_par^1 to give full word matches a higher
  score. This part returns back very fast as previously stated. It's when
  I turn on highlighting that I take the huge performance hit.
Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
  name_par description description_par content content_par so that it
  returns highlights for full and partial word matches. All of those
  fields have indexed, stored, termPositions, termVectors, and termOffsets
  set to true.
It all seems redundant just to allow for partial word
  matching/highlighting but I didn't know of a better way. Does anything
  stand out to you that could be the culprit? Let me know if you need any
  more clarification.
Thanks!
- Andy
 
  -Original Message-
  From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
  Sent: Wednesday, May 29, 2013 5:44 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Slow Highlighter Performance Even Using
  FastVectorHighlighter
 
  Andy,
 
  I don't understand why it's taking 7 secs to return highlights. The
 
  size
 
  of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
 
  to
 
  1024 for this verification purpose and that should be more than
 
  enough.
 
  The processor is plenty powerful enough as well.
 
  Running VisualVM shows all my CPU time being taken by mainly these 3
  methods:
 
 
  org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 
  nfo.getStartOffset()
 
  org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 
  nfo.getStartOffset()
 
  org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
 
  )
 
  That is a strange and interesting set of things to be spending most of
  your CPU time on. The implication, I think, is that the number of term
  matches in the document for terms in your query (or, at least, terms
  matching exact words or the beginning of phrases in your query) is
  extremely high . Perhaps that's coming from this partial word match
  you
  mention -- how does that work?
 
  -- Bryan
 
  My guess is that this has something to do with how I'm handling
 
  partial
 
  word matches/highlighting. I have setup another request handler that
  only searches the whole word fields and it returns in 850 ms with
  highlighting.
 
  Any ideas?
 
  - Andy
 
 
  -Original Message-
  From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
  Sent: Monday, May 20, 2013 1:39 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Slow Highlighter Performance Even Using
  FastVectorHighlighter
 
  My guess is that the problem is those 200M documents.
  FastVectorHighlighter is fast at deciding whether a match, especially
 
  a
 
  phrase, appears in a document, but it still starts out by walking the
  entire list of term vectors, and ends by breaking the document into
  candidate-snippet fragments, both processes that are proportional to
 
  the
 
  length of the document.
 
  It's hard to do much about the first, but for the second you could
  choose
  to expose FastVectorHighlighter's FieldPhraseList representation, and
  return offsets

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael Sokolov
If you have very large documents (many MB) that can lead to slow 
highlighting, even with FVH.


See https://issues.apache.org/jira/browse/LUCENE-3234

and try setting phraseLimit=1 (or some bigger number, but not infinite, 
which is the default)


-Mike


On 6/14/13 4:52 PM, Andy Brown wrote:

Bryan,

For specifics, I'll refer you back to my original email where I
specified all the fields/field types/handlers I use. Here's a general
overview.
  
I really only have 3 fields that I index and search against: name,

description, and content. All of which are just general text
(string) fields. I have a catch-all field called text that is only
used for querying. It's indexed but not stored. The name,
description, and content fields are copied into the text field.
  
For partial word matching, I have 4 more fields: name_par,

description_par, content_par, and text_par. The text_par field
has the same relationship to the *_par fields as text does to the
others (only used for querying). Those partial word matching fields are
of type text_general_partial which I created. That field type is
analyzed different than the regular text field in that it goes through
an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
at index time.
  
I query against both text and text_par fields using edismax deftype

with my qf set to text^2 text_par^1 to give full word matches a higher
score. This part returns back very fast as previously stated. It's when
I turn on highlighting that I take the huge performance hit.
  
Again, I'm using the FastVectorHighlighting. The hl.fl is set to name

name_par description description_par content content_par so that it
returns highlights for full and partial word matches. All of those
fields have indexed, stored, termPositions, termVectors, and termOffsets
set to true.
  
It all seems redundant just to allow for partial word

matching/highlighting but I didn't know of a better way. Does anything
stand out to you that could be the culprit? Let me know if you need any
more clarification.
  
Thanks!
  
- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Wednesday, May 29, 2013 5:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

Andy,


I don't understand why it's taking 7 secs to return highlights. The

size

of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

to

1024 for this verification purpose and that should be more than

enough.

The processor is plenty powerful enough as well.

Running VisualVM shows all my CPU time being taken by mainly these 3
methods:



org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

nfo.getStartOffset()


org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

)

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this partial word match
you
mention -- how does that work?

-- Bryan


My guess is that this has something to do with how I'm handling

partial

word matches/highlighting. I have setup another request handler that
only searches the whole word fields and it returns in 850 ms with
highlighting.

Any ideas?

- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
Sent: Monday, May 20, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially

a

phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to

the

length of the document.

It's hard to do much about the first, but for the second you could
choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your

own

snippets from a separate store of indexed files. This would also

permit

you to set stored=false, improving your memory/core size ratio,

which

I'm guessing could use some improving. It would require some work, and
it
would require you to store a representation of what was indexed

outside

the Solr core, in some constant-bytes-to-character representation that
you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need
more
memory for your search machine. Not JVM memory, but memory that the

O/S

can use as a file

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael McCandless
You could also try the new[ish] PostingsHighlighter:
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 If you have very large documents (many MB) that can lead to slow
 highlighting, even with FVH.

 See https://issues.apache.org/jira/browse/LUCENE-3234

 and try setting phraseLimit=1 (or some bigger number, but not infinite,
 which is the default)

 -Mike



 On 6/14/13 4:52 PM, Andy Brown wrote:

 Bryan,

 For specifics, I'll refer you back to my original email where I
 specified all the fields/field types/handlers I use. Here's a general
 overview.
   I really only have 3 fields that I index and search against: name,
 description, and content. All of which are just general text
 (string) fields. I have a catch-all field called text that is only
 used for querying. It's indexed but not stored. The name,
 description, and content fields are copied into the text field.
   For partial word matching, I have 4 more fields: name_par,
 description_par, content_par, and text_par. The text_par field
 has the same relationship to the *_par fields as text does to the
 others (only used for querying). Those partial word matching fields are
 of type text_general_partial which I created. That field type is
 analyzed different than the regular text field in that it goes through
 an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
 at index time.
   I query against both text and text_par fields using edismax deftype
 with my qf set to text^2 text_par^1 to give full word matches a higher
 score. This part returns back very fast as previously stated. It's when
 I turn on highlighting that I take the huge performance hit.
   Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
 name_par description description_par content content_par so that it
 returns highlights for full and partial word matches. All of those
 fields have indexed, stored, termPositions, termVectors, and termOffsets
 set to true.
   It all seems redundant just to allow for partial word
 matching/highlighting but I didn't know of a better way. Does anything
 stand out to you that could be the culprit? Let me know if you need any
 more clarification.
   Thanks!
   - Andy

 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Wednesday, May 29, 2013 5:44 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 Andy,

 I don't understand why it's taking 7 secs to return highlights. The

 size

 of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

 to

 1024 for this verification purpose and that should be more than

 enough.

 The processor is plenty powerful enough as well.

 Running VisualVM shows all my CPU time being taken by mainly these 3
 methods:


 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

 )

 That is a strange and interesting set of things to be spending most of
 your CPU time on. The implication, I think, is that the number of term
 matches in the document for terms in your query (or, at least, terms
 matching exact words or the beginning of phrases in your query) is
 extremely high . Perhaps that's coming from this partial word match
 you
 mention -- how does that work?

 -- Bryan

 My guess is that this has something to do with how I'm handling

 partial

 word matches/highlighting. I have setup another request handler that
 only searches the whole word fields and it returns in 850 ms with
 highlighting.

 Any ideas?

 - Andy


 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Monday, May 20, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 My guess is that the problem is those 200M documents.
 FastVectorHighlighter is fast at deciding whether a match, especially

 a

 phrase, appears in a document, but it still starts out by walking the
 entire list of term vectors, and ends by breaking the document into
 candidate-snippet fragments, both processes that are proportional to

 the

 length of the document.

 It's hard to do much about the first, but for the second you could
 choose
 to expose FastVectorHighlighter's FieldPhraseList representation, and
 return offsets to the caller rather than fragments, building up your

 own

 snippets from a separate store of indexed files. This would also

 permit

 you to set stored=false, improving your memory/core size ratio,

 which

 I'm guessing could use some improving. It would require some work, and
 it
 would require you to store

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-14 Thread Andy Brown
Bryan,

For specifics, I'll refer you back to my original email where I
specified all the fields/field types/handlers I use. Here's a general
overview. 
 
I really only have 3 fields that I index and search against: name,
description, and content. All of which are just general text
(string) fields. I have a catch-all field called text that is only
used for querying. It's indexed but not stored. The name,
description, and content fields are copied into the text field. 
 
For partial word matching, I have 4 more fields: name_par,
description_par, content_par, and text_par. The text_par field
has the same relationship to the *_par fields as text does to the
others (only used for querying). Those partial word matching fields are
of type text_general_partial which I created. That field type is
analyzed different than the regular text field in that it goes through
an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
at index time. 
 
I query against both text and text_par fields using edismax deftype
with my qf set to text^2 text_par^1 to give full word matches a higher
score. This part returns back very fast as previously stated. It's when
I turn on highlighting that I take the huge performance hit. 
 
Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
name_par description description_par content content_par so that it
returns highlights for full and partial word matches. All of those
fields have indexed, stored, termPositions, termVectors, and termOffsets
set to true. 
 
It all seems redundant just to allow for partial word
matching/highlighting but I didn't know of a better way. Does anything
stand out to you that could be the culprit? Let me know if you need any
more clarification. 
 
Thanks! 
 
- Andy 

-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] 
Sent: Wednesday, May 29, 2013 5:44 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

Andy,

 I don't understand why it's taking 7 secs to return highlights. The
size
 of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
to
 1024 for this verification purpose and that should be more than
enough.
 The processor is plenty powerful enough as well.

 Running VisualVM shows all my CPU time being taken by mainly these 3
 methods:


org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 nfo.getStartOffset()

org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 nfo.getStartOffset()

org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
 )

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this partial word match
you
mention -- how does that work?

-- Bryan

 My guess is that this has something to do with how I'm handling
partial
 word matches/highlighting. I have setup another request handler that
 only searches the whole word fields and it returns in 850 ms with
 highlighting.

 Any ideas?

 - Andy


 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Monday, May 20, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 My guess is that the problem is those 200M documents.
 FastVectorHighlighter is fast at deciding whether a match, especially
a
 phrase, appears in a document, but it still starts out by walking the
 entire list of term vectors, and ends by breaking the document into
 candidate-snippet fragments, both processes that are proportional to
the
 length of the document.

 It's hard to do much about the first, but for the second you could
 choose
 to expose FastVectorHighlighter's FieldPhraseList representation, and
 return offsets to the caller rather than fragments, building up your
own
 snippets from a separate store of indexed files. This would also
permit
 you to set stored=false, improving your memory/core size ratio,
which
 I'm guessing could use some improving. It would require some work, and
 it
 would require you to store a representation of what was indexed
outside
 the Solr core, in some constant-bytes-to-character representation that
 you
 can use offsets with (e.g. UTF-16, or ASCII+entity references).

 However, you may not need to do this -- it may be that you just need
 more
 memory for your search machine. Not JVM memory, but memory that the
O/S
 can use as a file cache. What do you have now? That is, how much
memory
 do
 you have that is not used by the JVM or other apps, and how big is
your
 Solr core?

 One way to start getting a handle on where time is being spent is to
set
 up VisualVM. Turn on CPU sampling, send in a bunch of the slow

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-29 Thread Bryan Loofbourrow
Andy,

 I don't understand why it's taking 7 secs to return highlights. The size
 of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to
 1024 for this verification purpose and that should be more than enough.
 The processor is plenty powerful enough as well.

 Running VisualVM shows all my CPU time being taken by mainly these 3
 methods:

 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 nfo.getStartOffset()
 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
 nfo.getStartOffset()
 org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
 )

That is a strange and interesting set of things to be spending most of
your CPU time on. The implication, I think, is that the number of term
matches in the document for terms in your query (or, at least, terms
matching exact words or the beginning of phrases in your query) is
extremely high . Perhaps that's coming from this partial word match you
mention -- how does that work?

-- Bryan

 My guess is that this has something to do with how I'm handling partial
 word matches/highlighting. I have setup another request handler that
 only searches the whole word fields and it returns in 850 ms with
 highlighting.

 Any ideas?

 - Andy


 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Monday, May 20, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 My guess is that the problem is those 200M documents.
 FastVectorHighlighter is fast at deciding whether a match, especially a
 phrase, appears in a document, but it still starts out by walking the
 entire list of term vectors, and ends by breaking the document into
 candidate-snippet fragments, both processes that are proportional to the
 length of the document.

 It's hard to do much about the first, but for the second you could
 choose
 to expose FastVectorHighlighter's FieldPhraseList representation, and
 return offsets to the caller rather than fragments, building up your own
 snippets from a separate store of indexed files. This would also permit
 you to set stored=false, improving your memory/core size ratio, which
 I'm guessing could use some improving. It would require some work, and
 it
 would require you to store a representation of what was indexed outside
 the Solr core, in some constant-bytes-to-character representation that
 you
 can use offsets with (e.g. UTF-16, or ASCII+entity references).

 However, you may not need to do this -- it may be that you just need
 more
 memory for your search machine. Not JVM memory, but memory that the O/S
 can use as a file cache. What do you have now? That is, how much memory
 do
 you have that is not used by the JVM or other apps, and how big is your
 Solr core?

 One way to start getting a handle on where time is being spent is to set
 up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
 queries, and look at where the time is being spent. If it's mostly in
 methods that are just reading from disk, buy more memory. If you're on
 Linux, look at what top is telling you. If the CPU usage is low and the
 wa number is above 1% more often than not, buy more memory (I don't
 know
 why that wa number makes sense, I just know that it has been a good rule
 of thumb for us).

 -- Bryan

  -Original Message-
  From: Andy Brown [mailto:andy_br...@rhoworld.com]
  Sent: Monday, May 20, 2013 9:53 AM
  To: solr-user@lucene.apache.org
  Subject: Slow Highlighter Performance Even Using FastVectorHighlighter
 
  I'm providing a search feature in a web app that searches for
 documents
  that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
  etc). Currently there are about 3000 documents and this will continue
 to
  grow. I'm providing full word search and partial word search. For each
  document, there are three source fields that I'm interested in
 searching
  and highlighting on: name, description, and content. Since I'm
 providing
  both full and partial word search, I've created additional fields that
  get tokenized differently: name_par, description_par, and content_par.
  Those are indexed and stored as well for querying and highlighting. As
  suggested in the Solr wiki, I've got two catch all fields text and
  text_par for faster querying.
 
  An average search results page displays 25 results and I provide
 paging.
  I'm just returning the doc ID in my Solr search results and response
  times have been quite good (1 to 10 ms). The problem in performance
  occurs when I turn on highlighting. I'm already using the
  FastVectorHighlighter and depending on the query, it has taken as long
  as 15 seconds to get the highlight snippets. However, this isn't
 always
  the case. Certain query terms result in 1 sec or less response time.
 In
  any case, 15 seconds is way too long.
 
  I'm fairly new to Solr but I've spent days coming up with what

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-22 Thread Andy Brown
After taking your advice on profiling, I didn't see any memory issues. I
wanted to verify this with a small set of data. So I created a new
sandbox core with the exact same schema and config file settings. I
indexed only 25 PDF documents with an average size of 2.8 MB, the
largest is approx 5 MB (39 pages). I run the exact same query on that
core and I'm seeing response times of 7 secs or more. Without
highlighting the response is usually 1 ms. 
 
I don't understand why it's taking 7 secs to return highlights. The size
of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set to
1024 for this verification purpose and that should be more than enough.
The processor is plenty powerful enough as well. 
 
Running VisualVM shows all my CPU time being taken by mainly these 3
methods: 
 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
nfo.getStartOffset() 
org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
nfo.getStartOffset() 
org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
) 
 
My guess is that this has something to do with how I'm handling partial
word matches/highlighting. I have setup another request handler that
only searches the whole word fields and it returns in 850 ms with
highlighting. 
 
Any ideas? 

- Andy


-Original Message-
From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] 
Sent: Monday, May 20, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: RE: Slow Highlighter Performance Even Using
FastVectorHighlighter

My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially a
phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to the
length of the document.

It's hard to do much about the first, but for the second you could
choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your own
snippets from a separate store of indexed files. This would also permit
you to set stored=false, improving your memory/core size ratio, which
I'm guessing could use some improving. It would require some work, and
it
would require you to store a representation of what was indexed outside
the Solr core, in some constant-bytes-to-character representation that
you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need
more
memory for your search machine. Not JVM memory, but memory that the O/S
can use as a file cache. What do you have now? That is, how much memory
do
you have that is not used by the JVM or other apps, and how big is your
Solr core?

One way to start getting a handle on where time is being spent is to set
up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
queries, and look at where the time is being spent. If it's mostly in
methods that are just reading from disk, buy more memory. If you're on
Linux, look at what top is telling you. If the CPU usage is low and the
wa number is above 1% more often than not, buy more memory (I don't
know
why that wa number makes sense, I just know that it has been a good rule
of thumb for us).

-- Bryan

 -Original Message-
 From: Andy Brown [mailto:andy_br...@rhoworld.com]
 Sent: Monday, May 20, 2013 9:53 AM
 To: solr-user@lucene.apache.org
 Subject: Slow Highlighter Performance Even Using FastVectorHighlighter

 I'm providing a search feature in a web app that searches for
documents
 that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
 etc). Currently there are about 3000 documents and this will continue
to
 grow. I'm providing full word search and partial word search. For each
 document, there are three source fields that I'm interested in
searching
 and highlighting on: name, description, and content. Since I'm
providing
 both full and partial word search, I've created additional fields that
 get tokenized differently: name_par, description_par, and content_par.
 Those are indexed and stored as well for querying and highlighting. As
 suggested in the Solr wiki, I've got two catch all fields text and
 text_par for faster querying.

 An average search results page displays 25 results and I provide
paging.
 I'm just returning the doc ID in my Solr search results and response
 times have been quite good (1 to 10 ms). The problem in performance
 occurs when I turn on highlighting. I'm already using the
 FastVectorHighlighter and depending on the query, it has taken as long
 as 15 seconds to get the highlight snippets. However, this isn't
always
 the case. Certain query terms result in 1 sec or less response time.
In
 any case, 15 seconds is way too long.

 I'm fairly new to Solr but I've spent days coming up with what I've
got
 so far. Feel free

RE: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-05-20 Thread Bryan Loofbourrow
My guess is that the problem is those 200M documents.
FastVectorHighlighter is fast at deciding whether a match, especially a
phrase, appears in a document, but it still starts out by walking the
entire list of term vectors, and ends by breaking the document into
candidate-snippet fragments, both processes that are proportional to the
length of the document.

It's hard to do much about the first, but for the second you could choose
to expose FastVectorHighlighter's FieldPhraseList representation, and
return offsets to the caller rather than fragments, building up your own
snippets from a separate store of indexed files. This would also permit
you to set stored=false, improving your memory/core size ratio, which
I'm guessing could use some improving. It would require some work, and it
would require you to store a representation of what was indexed outside
the Solr core, in some constant-bytes-to-character representation that you
can use offsets with (e.g. UTF-16, or ASCII+entity references).

However, you may not need to do this -- it may be that you just need more
memory for your search machine. Not JVM memory, but memory that the O/S
can use as a file cache. What do you have now? That is, how much memory do
you have that is not used by the JVM or other apps, and how big is your
Solr core?

One way to start getting a handle on where time is being spent is to set
up VisualVM. Turn on CPU sampling, send in a bunch of the slow highlight
queries, and look at where the time is being spent. If it's mostly in
methods that are just reading from disk, buy more memory. If you're on
Linux, look at what top is telling you. If the CPU usage is low and the
wa number is above 1% more often than not, buy more memory (I don't know
why that wa number makes sense, I just know that it has been a good rule
of thumb for us).

-- Bryan

 -Original Message-
 From: Andy Brown [mailto:andy_br...@rhoworld.com]
 Sent: Monday, May 20, 2013 9:53 AM
 To: solr-user@lucene.apache.org
 Subject: Slow Highlighter Performance Even Using FastVectorHighlighter

 I'm providing a search feature in a web app that searches for documents
 that range in size from 1KB to 200MB of varying MIME types (PDF, DOC,
 etc). Currently there are about 3000 documents and this will continue to
 grow. I'm providing full word search and partial word search. For each
 document, there are three source fields that I'm interested in searching
 and highlighting on: name, description, and content. Since I'm providing
 both full and partial word search, I've created additional fields that
 get tokenized differently: name_par, description_par, and content_par.
 Those are indexed and stored as well for querying and highlighting. As
 suggested in the Solr wiki, I've got two catch all fields text and
 text_par for faster querying.

 An average search results page displays 25 results and I provide paging.
 I'm just returning the doc ID in my Solr search results and response
 times have been quite good (1 to 10 ms). The problem in performance
 occurs when I turn on highlighting. I'm already using the
 FastVectorHighlighter and depending on the query, it has taken as long
 as 15 seconds to get the highlight snippets. However, this isn't always
 the case. Certain query terms result in 1 sec or less response time. In
 any case, 15 seconds is way too long.

 I'm fairly new to Solr but I've spent days coming up with what I've got
 so far. Feel free to correct any misconceptions I have. Can anyone
 advise me on what I'm doing wrong or offer a better way to setup my core
 to improve highlighting performance?

 A typical query would look like:
 /select?q=foostart=0rows=25fl=idhl=true

 I'm using Solr 4.1. Below the relevant core schema and config details:

 !-- Misc fields --
 field name=_version_ type=long indexed=true stored=true/
 field name=id type=string indexed=true stored=true
 required=true multiValued=false/


 !-- Fields for whole word matches --
 field name=name type=text_general indexed=true stored=true
 multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=description type=text_general indexed=true
 stored=true multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=content type=text_general indexed=true stored=true
 multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=text type=text_general indexed=true stored=false
 multiValued=true/

 !-- Fields for partial word matches --
 field name=name_par type=text_general_partial indexed=true
 stored=true multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=description_par type=text_general_partial indexed=true
 stored=true multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=content_par type=text_general_partial indexed=true
 stored=true multiValued=true termPositions=true termVectors=true
 termOffsets=true/
 field name=text_par type=text_general_partial indexed=true