Re: Highlighting content field problem when using JiebaTokenizerFactory

Scott Chu Mon, 26 Oct 2015 20:15:07 -0700

Take a look at Michael's 2 articles, they might help you calrify the idea of 
highlighting in Solr:


Changing Bits: Lucene's TokenStreams are actually graphs!
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Also take a look at 4th paragraph In his another article:

Changing Bits: A new Lucene highlighter is born
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Currently, I can't figure out the possible cause of your problem unless I got 
spare time to test it on my own, which is not available these days (Got some 
projects to close)!

If you find the solution or workaround, pls. let us know. Good luck again!

Scott Chu，scott....@udngroup.com
2015/10/27 
----- Original Message ----- 
From: Scott Chu 
To: solr-user 
Date: 2015-10-27, 10:27:45
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Edward,

    Took a lot of time to see if there's anything can help you to define the 
cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for 
each item in a list of documents, or returns null if highlighting is disabled. 
- AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* 
attributes are on.

Scott Chu，scott....@udngroup.com
2015/10/27 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com> wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu，scott....@udngroup.com
> 2015/10/23
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highlight in this way: <em> wor<em>d*
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,

> but the same problem persists as well.*
>
>
> Regards,
> Edwin
>
>
> On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>> wrote:
>
> > Hi solr-user,
> >
> > Can't judge the cause on fast glimpse of your definition but some
> > suggestions I can give:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and

> see
> > if this problem disppears?
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >
> >
> > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com>
> > 2015/10/22
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>>
> > *Date: *2015-10-20, 12:04:11
> > *Subject: *Re: Highlighting content field problem when using
>
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Here's my schema.xml for content and title, which uses text_chinese. The
>
> > problem only occurs in content, and not in title.
> >
> > <field name="content" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> > <field name="title" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> >
> >
> > <fieldType name="text_chinese" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="15"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> > Here's my solrconfig.xml on the highlighting portion:
> >
> > <requestHandler name="/highlight" class="solr.SearchHandler">
> > <lst name="defaults">
> > <str name="echoParams">explicit</str>
> > <int name="rows">10</int>
> > <str name="wt">json</str>
> > <str name="indent">true</str>
> > <str name="df">text</str>
> > <str name="fl">id, title, content_type, last_modified, url, score </str>
> >
> > <str name="hl">on</str>
> > <str name="hl.fl">id, title, content, author, tag</str>
> > <str name="hl.highlightMultiTerm">true</str>
> > <str name="hl.preserveMulti">true</str>
> > <str name="hl.encoder">html</str>
> > <str name="hl.fragsize">200</str>
> > <str name="group">true</str>
> > <str name="group.field">signature</str>
> > <str name="group.main">true</str>
> > <str name="group.cache.percent">100</str>
> > </lst>
> > </requestHandler>
> >
> > <boundaryScanner name="breakIterator"
> > class="solr.highlight.BreakIteratorBoundaryScanner">
> > <lst name="defaults">
> > <str name="hl.bs.type">WORD</str>
> > <str name="hl.bs.language">en</str>
> > <str name="hl.bs.country">SG</str>
> > </lst>
> > </boundaryScanner>
> >
> >
> > Meanwhile, I'll take a look at the articles too.
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> >
> > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>
> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote:
> >
> > > Hi Edwin,
> > >
> > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > know) so I didn't experience this problem.
> > >
> > > I'd suggest you post your schema.xml so we can see how you define your
>
> > > content field and the field type it uses?
> > >
> > > In the mean time, refer to these articles, maybe the answer or
> workaround
> > > can be deducted from them.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-3390
> > >
> > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

> > >
> > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >
> > > Good luck!
> > >
> > >
> > >
> > >
> > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
> scott....@udngroup.com <+scott....@udngroup.com>>
> > > 2015/10/20
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>
> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>
>
> > > *Date: *2015-10-13, 17:04:29
> > > *Subject: *Highlighting content field problem when using
> > > JiebaTokenizerFactory
> > >
> > > Hi,
> > >
> > > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> > in
> > >
> > > Solr. It works fine with the segmentation when I'm using
> > > the Analysis function on the Solr Admin UI.
> > >
> > > However, when I tried to do the highlighting in Solr, it is not
> > > highlighting in the correct place. For example, when I search of
> > 自然環境与企業本身,
> > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >
> > > Even when I search for English character like responsibility, it
> > highlight
> > > <em> *responsibilit<em>*y.
> > >
> > > Basically, the highlighting goes off by 1 character/space consistently.
> > >
> > > This problem only happens in content field, and not in any other
> fields.
> >
> > > Does anyone knows what could be causing the issue?
> > >
> > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>
>



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to