Re: slow highlighting because of stemming
Thanks for the answers! This was the solution! :) (my fault was that I tried to use the on value instead of true - don't know why..) Gyuri 2011/7/30 Michael Sokolov soko...@ifactory.com On 7/30/2011 3:46 AM, Orosz György wrote: Hi, Thanks for the answer! I am doing some logging about stemming, and what I can see is that a lot of tokens are stemmed for the highlighting. It is the strange part, since I don't understand why does any highlighter need stemming again. Consider that the highlighter needs to match terms from the query with terms from the document, just like search. If the indexed document has been stemmed, then the query also needs to be stemmed, or you won't see matches. -Mike
Re: slow highlighting because of stemming
Hi, Thanks for the answer! I am doing some logging about stemming, and what I can see is that a lot of tokens are stemmed for the highlighting. It is the strange part, since I don't understand why does any highlighter need stemming again. Anyway my docments are not really large, just a few kilobytes, but thanks for this suggestion. If you could help me in how could I just ignore the stemming for highlighting thing it would be very great! Thanks, Gyuri 2011/7/29 Mike Sokolov soko...@ifactory.com I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each document that is analyzed by the regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to FVH? not sure) Using RegexFragmenter is also probably slower than something like SimpleFragmenter. There is work to implement faster highlighting for Solr/Lucene, but it depends on some basic changes to the search architecture so it might be a while before that becomes available. See https://issues.apache.org/** jira/browse/LUCENE-3318https://issues.apache.org/jira/browse/LUCENE-3318if you're interested in following that development. -Mike On 07/29/2011 04:55 AM, Orosz György wrote: Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: highlighting fragmenter name=regex class=org.apache.solr.**highlight.RegexFragmenter lst name=defaults int name=hl.fragsize500/int float name=hl.regex.slop0.5/**float str name=hl.pre![CDATA[b]]**/str str name=hl.post![CDATA[/b]]**/str str name=hl.**useFastVectorHighlighter**true/str str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str str name=hl.fldokumentum_syn_**query/str /lst /fragmenter /highlighting The field is indexed with term vectors and offsets: field name=dokumentum_syn_query type=huntext_syn indexed=true stored=true multiValued=true termVectors=on termPositions=on termOffsets=on/ fieldType name=huntext_syn class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer type=index tokenizer class=com.morphologic.solr.**huntoken.HunTokenizerFactory/** filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=**true / filter class=com.morphologic.solr.**hunstem.**HumorStemFilterFactory lex=/home/oroszgy/workspace/**morpho/solrplugins/data/lex cache=alma/ filter class=solr.**LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.**StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=**true / filter class=com.morphologic.solr.**hunstem.**HumorStemFilterFactory lex=/home/oroszgy/workspace/**morpho/solrplugins/data/lex cache=alma/ filter class=solr.**SynonymFilterFactory synonyms=synonyms_query.txt ignoreCase=true expand=true/ filter class=solr.**LowerCaseFilterFactory/ /analyzer /fieldType The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz
Re: slow highlighting because of stemming
I am doing some logging about stemming, and what I can see is that a lot of tokens are stemmed for the highlighting. It is the strange part, since I don't understand why does any highlighter need stemming again. Highlighting do re-analyze the text being highlighted. Anyway my docments are not really large, just a few kilobytes, but thanks for this suggestion. If you could help me in how could I just ignore the stemming for highlighting thing it would be very great! If you store term vectors, the this re-analyze is skipped. http://wiki.apache.org/solr/FieldOptionsByUseCase
Re: slow highlighting because of stemming
On 7/30/2011 3:46 AM, Orosz György wrote: Hi, Thanks for the answer! I am doing some logging about stemming, and what I can see is that a lot of tokens are stemmed for the highlighting. It is the strange part, since I don't understand why does any highlighter need stemming again. Consider that the highlighter needs to match terms from the query with terms from the document, just like search. If the indexed document has been stemmed, then the query also needs to be stemmed, or you won't see matches. -Mike
slow highlighting because of stemming
Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: highlighting fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize500/int float name=hl.regex.slop0.5/float str name=hl.pre![CDATA[b]]/str str name=hl.post![CDATA[/b]]/str str name=hl.useFastVectorHighlightertrue/str str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str str name=hl.fldokumentum_syn_query/str /lst /fragmenter /highlighting The field is indexed with term vectors and offsets: field name=dokumentum_syn_query type=huntext_syn indexed=true stored=true multiValued=true termVectors=on termPositions=on termOffsets=on/ fieldType name=huntext_syn class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer type=index tokenizer class=com.morphologic.solr.huntoken.HunTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.SynonymFilterFactory synonyms=synonyms_query.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz
Re: slow highlighting because of stemming
I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each document that is analyzed by the regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to FVH? not sure) Using RegexFragmenter is also probably slower than something like SimpleFragmenter. There is work to implement faster highlighting for Solr/Lucene, but it depends on some basic changes to the search architecture so it might be a while before that becomes available. See https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested in following that development. -Mike On 07/29/2011 04:55 AM, Orosz György wrote: Dear all, I am quite new about using Solr, but would like to ask your help. I am developing an application which should be able to highlight the results of a query. For this I am using regex fragmenter: highlighting fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults int name=hl.fragsize500/int float name=hl.regex.slop0.5/float str name=hl.pre![CDATA[b]]/str str name=hl.post![CDATA[/b]]/str str name=hl.useFastVectorHighlightertrue/str str name=hl.regex.pattern[-\w ,/\n\']{20,300}[.?!]/str str name=hl.fldokumentum_syn_query/str /lst /fragmenter /highlighting The field is indexed with term vectors and offsets: field name=dokumentum_syn_query type=huntext_syn indexed=true stored=true multiValued=true termVectors=on termPositions=on termOffsets=on/ fieldType name=huntext_syn class=solr.TextField stored=true indexed=true positionIncrementGap=100 analyzer type=index tokenizer class=com.morphologic.solr.huntoken.HunTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_query.txt enablePositionIncrements=true / filter class=com.morphologic.solr.hunstem.HumorStemFilterFactory lex=/home/oroszgy/workspace/morpho/solrplugins/data/lex cache=alma/ filter class=solr.SynonymFilterFactory synonyms=synonyms_query.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The highlighting works well, excepts that its really slow. I realized that this is because the highlighter/fragmenter does stemming for all the results documents again. Could you please help me why does it happen an how should I avoid this? (I thought that using fastvectorhighlighter will solve my problem, but it didn't) Thanks in advance! Gyuri Orosz