Re: slow highlighting because of stemming

2011-08-01 Thread Orosz György
Thanks for the answers!
This was the solution! :) (my fault was that I tried to use the on value
instead of true - don't know why..)
Gyuri

2011/7/30 Michael Sokolov soko...@ifactory.com

 On 7/30/2011 3:46 AM, Orosz György wrote:

 Hi,

 Thanks for the answer!
 I am doing some logging about stemming, and what I can see is that a lot
 of
 tokens are stemmed for the highlighting. It is the strange part, since I
 don't understand why does any highlighter need stemming again.

 Consider that the highlighter needs to match terms from the query with
 terms from the document, just like search. If the indexed document has been
 stemmed, then the query also needs to be stemmed, or you won't see matches.

 -Mike



Re: slow highlighting because of stemming

2011-07-30 Thread Orosz György
Hi,

Thanks for the answer!
I am doing some logging about stemming, and what I can see is that a lot of
tokens are stemmed for the highlighting. It is the strange part, since I
don't understand why does any highlighter need stemming again.
Anyway my docments are not really large, just a few kilobytes, but thanks
for this suggestion.

If you could help me in how could I just ignore the stemming for
highlighting thing it would be very great!

Thanks,
Gyuri

2011/7/29 Mike Sokolov soko...@ifactory.com

 I'm not sure I would identify stemming as the culprit here.

 Do you have very large documents?  If so, there is a patch for FVH
 committed to limit the number of phrases it looks at; see hl.phraseLimit,
 but this won't be available until 3.4 is released.


 You can also limit the amount of each document that is analyzed by the
 regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to
 FVH? not sure)

 Using RegexFragmenter is also probably slower than something like
 SimpleFragmenter.

 There is work to implement faster highlighting for Solr/Lucene, but it
 depends on some basic changes to the search architecture so it might be a
 while before that becomes available.  See https://issues.apache.org/**
 jira/browse/LUCENE-3318https://issues.apache.org/jira/browse/LUCENE-3318if 
 you're interested in following that development.

 -Mike


 On 07/29/2011 04:55 AM, Orosz György wrote:

 Dear all,

 I am quite new about using Solr, but would like to ask your help.
 I am developing an application which should be able to highlight the
 results
 of a query. For this I am using regex fragmenter:
 highlighting
fragmenter name=regex
 class=org.apache.solr.**highlight.RegexFragmenter
 lst name=defaults
   int name=hl.fragsize500/int
   float name=hl.regex.slop0.5/**float
   str name=hl.pre![CDATA[b]]**/str
  str name=hl.post![CDATA[/b]]**/str
  str name=hl.**useFastVectorHighlighter**true/str
   str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
   str name=hl.fldokumentum_syn_**query/str
 /lst
/fragmenter
   /highlighting
 The field is indexed with term vectors and offsets:
 field name=dokumentum_syn_query type=huntext_syn indexed=true
 stored=true multiValued=true termVectors=on termPositions=on
  termOffsets=on/
 fieldType name=huntext_syn class=solr.TextField stored=true
 indexed=true positionIncrementGap=100
   analyzer type=index
 tokenizer
 class=com.morphologic.solr.**huntoken.HunTokenizerFactory/**
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_query.txt enablePositionIncrements=**true /
  filter class=com.morphologic.solr.**hunstem.**HumorStemFilterFactory
  lex=/home/oroszgy/workspace/**morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.**LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.**StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_query.txt enablePositionIncrements=**true /
  filter class=com.morphologic.solr.**hunstem.**HumorStemFilterFactory
  lex=/home/oroszgy/workspace/**morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.**SynonymFilterFactory
 synonyms=synonyms_query.txt ignoreCase=true expand=true/
 filter class=solr.**LowerCaseFilterFactory/
   /analyzer
 /fieldType

 The highlighting works well, excepts that its really slow. I realized that
 this is because the highlighter/fragmenter does stemming for all the
 results
 documents again.

 Could you please help me why does it happen an how should I avoid this? (I
 thought that using fastvectorhighlighter will solve my problem, but it
 didn't)

 Thanks in advance!
 Gyuri Orosz






Re: slow highlighting because of stemming

2011-07-30 Thread Ahmet Arslan
 I am doing some logging about stemming, and what I can see
 is that a lot of
 tokens are stemmed for the highlighting. It is the strange
 part, since I
 don't understand why does any highlighter need stemming
 again.

Highlighting do re-analyze the text being highlighted.

 Anyway my docments are not really large, just a few
 kilobytes, but thanks
 for this suggestion.
 
 If you could help me in how could I just ignore the
 stemming for
 highlighting thing it would be very great!

If you store term vectors, the this re-analyze is skipped.
http://wiki.apache.org/solr/FieldOptionsByUseCase


Re: slow highlighting because of stemming

2011-07-30 Thread Michael Sokolov

On 7/30/2011 3:46 AM, Orosz György wrote:

Hi,

Thanks for the answer!
I am doing some logging about stemming, and what I can see is that a lot of
tokens are stemmed for the highlighting. It is the strange part, since I
don't understand why does any highlighter need stemming again.
Consider that the highlighter needs to match terms from the query with 
terms from the document, just like search. If the indexed document has 
been stemmed, then the query also needs to be stemmed, or you won't see 
matches.


-Mike


slow highlighting because of stemming

2011-07-29 Thread Orosz György
Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
highlighting
   fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
lst name=defaults
  int name=hl.fragsize500/int
  float name=hl.regex.slop0.5/float
  str name=hl.pre![CDATA[b]]/str
 str name=hl.post![CDATA[/b]]/str
 str name=hl.useFastVectorHighlightertrue/str
  str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
  str name=hl.fldokumentum_syn_query/str
/lst
   /fragmenter
  /highlighting
The field is indexed with term vectors and offsets:
field name=dokumentum_syn_query type=huntext_syn indexed=true
stored=true multiValued=true termVectors=on termPositions=on
 termOffsets=on/
fieldType name=huntext_syn class=solr.TextField stored=true
indexed=true positionIncrementGap=100
  analyzer type=index
tokenizer
class=com.morphologic.solr.huntoken.HunTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
 filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
 lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
 cache=alma/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
 filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
 lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
 cache=alma/
filter class=solr.SynonymFilterFactory
synonyms=synonyms_query.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz


Re: slow highlighting because of stemming

2011-07-29 Thread Mike Sokolov

I'm not sure I would identify stemming as the culprit here.

Do you have very large documents?  If so, there is a patch for FVH 
committed to limit the number of phrases it looks at; see 
hl.phraseLimit, but this won't be available until 3.4 is released.


You can also limit the amount of each document that is analyzed by the 
regular Highlighter using maxDocCharsToAnalyze (and maybe this applies 
to FVH? not sure)


Using RegexFragmenter is also probably slower than something like 
SimpleFragmenter.


There is work to implement faster highlighting for Solr/Lucene, but it 
depends on some basic changes to the search architecture so it might be 
a while before that becomes available.  See 
https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested 
in following that development.


-Mike

On 07/29/2011 04:55 AM, Orosz György wrote:

Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
highlighting
fragmenter name=regex
class=org.apache.solr.highlight.RegexFragmenter
 lst name=defaults
   int name=hl.fragsize500/int
   float name=hl.regex.slop0.5/float
   str name=hl.pre![CDATA[b]]/str
  str name=hl.post![CDATA[/b]]/str
  str name=hl.useFastVectorHighlightertrue/str
   str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str
   str name=hl.fldokumentum_syn_query/str
 /lst
/fragmenter
   /highlighting
The field is indexed with term vectors and offsets:
field name=dokumentum_syn_query type=huntext_syn indexed=true
stored=true multiValued=true termVectors=on termPositions=on
  termOffsets=on/
 fieldType name=huntext_syn class=solr.TextField stored=true
indexed=true positionIncrementGap=100
   analyzer type=index
 tokenizer
class=com.morphologic.solr.huntoken.HunTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords_query.txt enablePositionIncrements=true /
  filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory
  lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex
  cache=alma/
 filter class=solr.SynonymFilterFactory
synonyms=synonyms_query.txt ignoreCase=true expand=true/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz