Re: Highlighting content field problem when using JiebaTokenizerFactory

Zheng Lin Edwin Yeo Tue, 27 Oct 2015 00:58:45 -0700

Hi Scott,

Thank you for providing the links and references. Will look through them,
and let you know if I find any solutions or workaround.


Regards,
Edwin


On 27 October 2015 at 11:13, Scott Chu <scott....@udngroup.com> wrote:

>
> Take a look at Michael's 2 articles, they might help you calrify the idea
> of highlighting in Solr:
>
> Changing Bits: Lucene's TokenStreams are actually graphs!
>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> Also take a look at 4th paragraph In his another article:
>
> Changing Bits: A new Lucene highlighter is born
>
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>
> Currently, I can't figure out the possible cause of your problem unless I
> got spare time to test it on my own, which is not available these days (Got
> some projects to close)!
>
> If you find the solution or workaround, pls. let us know. Good luck again!
>
> Scott Chu，scott....@udngroup.com
> 2015/10/27
>
> ----- Original Message -----
> *From: *Scott Chu <scott....@udngroup.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-27, 10:27:45
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Edward,
>
>     Took a lot of time to see if there's anything can help you to define
> the cause of your problem. Maybe this might help you a bit:
>
> [SOLR-4722] Highlighter which generates a list of query term position(s)
> for each item in a list of documents, or returns null if highlighting is
> disabled. - AS...
> https://issues.apache.org/jira/browse/SOLR-4722
>
> This one is modified from FastVectorHighLighter, so ensure those 3 term*
> attributes are on.
>
> Scott Chu，scott....@udngroup.com
> 2015/10/27
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-23, 10:42:32
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your respond.
>
> 1. You said the problem only happens on "contents" field, so maybe there're
> something wrong with the contents of that field. Doe it contain any special
> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
> something about HTML stripping will cause highlight problem. Maybe you can
>
> try purify that fields to be closed to pure text and see if highlight comes
> ok.
> *A) I check that the SOLR-42 is mentioning about the
> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
> tokenizer is already deprecated too. I've tried with all kinds of content
> for rich-text documents, and all of them have the same problem.*
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
>
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
>
> see if the problem goes away. However when I'm googling similar problem, I
>
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
> *A) I was thinking about the incompatible issue too, as I previously
> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
> although I'm using Solr 5.3.0 now, the same problem persist. *
>
> I'm looking at the indexing process too, to see if there's any problem
> there. But just can't figure out why it only happen to JiebaTokenizer, and
>
> it only happen for content field.
>
>
> Regards,
> Edwin
>
>
> On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > Since you've tested all my suggestions and the problem is still there, I
>
> > can't think of anything wrong with your configuration. Now I can only
> > suspect two things:
> >
> > 1. You said the problem only happens on "contents" field, so maybe
> > there're something wrong with the contents of that field. Doe it contain
>
> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> > mentions something about HTML stripping will cause highlight problem.
> Maybe
> > you can try purify that fields to be closed to pure text and see if
> > highlight comes ok.
> >
> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > SmartChinese (I don't use this since I am dealing with Traditional
> Chinese
> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
> and
> > see if the problem goes away. However when I'm googling similar problem,
> I
> > saw you asked same question on August at Huaban/Jieba-analysis and
> somebody
> > said he also uses JiebaTokenizer but he doesn't have your problem. So I
> see
> > this could be less suspect.
> >
> > The theory of your problem could be something in indexing process causes
>
> > wrong position info. for that field and when Solr do highlighting, it
> > retrieves wrong position info. and mark wrong position of highlight
> target
> > terms.
> >
> > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com>
> > 2015/10/23
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>>
> > *Date: *2015-10-22, 22:22:14
> > *Subject: *Re: Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Thank you for your response and suggestions.
> >
> > With respond to your questions, here are the answers:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and
> see
> >
> > if this problem disppears?
> > *A) Yes, I have tried commenting out the other filters and only left with
> > Jieba and StopFilter. The problem is still there.*
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> > *A) Yes, the same problem occurs on English words. For example, when I
> > search for "word", it will highlight in this way: <em> wor<em>d*
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
>
> > but the same problem persists as well.*
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>
> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote:
> >
> > > Hi solr-user,
> > >
> > > Can't judge the cause on fast glimpse of your definition but some
> > > suggestions I can give:
> > >
> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
> good
> > > job on CJK. I doubt this problem may be from those filters (note: I can
> > > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> > tried
> > > commenting out those filters, say leave only Jieba and StopFilter, and
>
> > see
> > > if this problem disppears?
> > >
> > > 2.Does this problem occur only on Chinese search words? Does it happen
> on
> > > English search words?
> > >
> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > > parameters in field declaration? I see only one is enabled. Please
> refer
> > to
> > > the answer in this stackoverflow question:
> > >
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >
> > >
> > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
> scott....@udngroup.com <+scott....@udngroup.com>>
> > > 2015/10/22
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>
> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>
> > > *Date: *2015-10-20, 12:04:11
> > > *Subject: *Re: Highlighting content field problem when using
> >
> > > JiebaTokenizerFactory
> > >
> > > Hi Scott,
> > >
> > > Here's my schema.xml for content and title, which uses text_chinese.
> The
> >
> > > problem only occurs in content, and not in title.
> > >
> > > <field name="content" type="text_chinese" indexed="true" stored="true"
> > > omitNorms="true" termVectors="true"/>
> > > <field name="title" type="text_chinese" indexed="true" stored="true"
> > > omitNorms="true" termVectors="true"/>
> > >
> > >
> > > <fieldType name="text_chinese" class="solr.TextField"
> > > positionIncrementGap="100">
> > > <analyzer type="index">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > maxGramSize="15"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > <analyzer type="query">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > </fieldType>
> > >
> > >
> > > Here's my solrconfig.xml on the highlighting portion:
> > >
> > > <requestHandler name="/highlight" class="solr.SearchHandler">
> > > <lst name="defaults">
> > > <str name="echoParams">explicit</str>
> > > <int name="rows">10</int>
> > > <str name="wt">json</str>
> > > <str name="indent">true</str>
> > > <str name="df">text</str>
> > > <str name="fl">id, title, content_type, last_modified, url, score
> </str>
> > >
> > > <str name="hl">on</str>
> > > <str name="hl.fl">id, title, content, author, tag</str>
> > > <str name="hl.highlightMultiTerm">true</str>
> > > <str name="hl.preserveMulti">true</str>
> > > <str name="hl.encoder">html</str>
> > > <str name="hl.fragsize">200</str>
> > > <str name="group">true</str>
> > > <str name="group.field">signature</str>
> > > <str name="group.main">true</str>
> > > <str name="group.cache.percent">100</str>
> > > </lst>
> > > </requestHandler>
> > >
> > > <boundaryScanner name="breakIterator"
> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> > > <lst name="defaults">
> > > <str name="hl.bs.type">WORD</str>
> > > <str name="hl.bs.language">en</str>
> > > <str name="hl.bs.country">SG</str>
> > > </lst>
> > > </boundaryScanner>
> > >
> > >
> > > Meanwhile, I'll take a look at the articles too.
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com
> <+scott....@udngroup.com>
> > <+scott....@udngroup.com <+scott....@udngroup.com>>
> > > <+scott....@udngroup.com <+scott....@udngroup.com> <+
> scott....@udngroup.com <+scott....@udngroup.com>>>> wrote:
> > >
> > > > Hi Edwin,
> > > >
> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > > know) so I didn't experience this problem.
> > > >
> > > > I'd suggest you post your schema.xml so we can see how you define
> your
> >
> > > > content field and the field type it uses?
> > > >
> > > > In the mean time, refer to these articles, maybe the answer or
> > workaround
> > > > can be deducted from them.
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-3390
> > > >
> > > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> > > >
> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > > >
> > > > Good luck!
> > > >
> > > >
> > > >
> > > >
> > > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
> scott....@udngroup.com <+scott....@udngroup.com>> <+
> > scott....@udngroup.com <+scott....@udngroup.com> <+
> scott....@udngroup.com <+scott....@udngroup.com>>>
> > > > 2015/10/20
> > > >
> > > > ----- Original Message -----
> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> <+edwinye...@gmail.com>
> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>
> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+edwinye...@gmail.com
> <+edwinye...@gmail.com>>>>
> > > > *To: *solr-user <solr-user@lucene.apache.org
> <+solr-user@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>
> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+
> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>>
> >
> > > > *Date: *2015-10-13, 17:04:29
> > > > *Subject: *Highlighting content field problem when using
> > > > JiebaTokenizerFactory
> > > >
> > > > Hi,
> > > >
> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> characters
> > > in
> > > >
> > > > Solr. It works fine with the segmentation when I'm using
> > > > the Analysis function on the Solr Admin UI.
> > > >
> > > > However, when I tried to do the highlighting in Solr, it is not
> > > > highlighting in the correct place. For example, when I search of
> > > 自然環境与企業本身,
> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > > >
> > > > Even when I search for English character like responsibility, it
> > > highlight
> > > > <em> *responsibilit<em>*y.
> > > >
> > > > Basically, the highlighting goes off by 1 character/space
> consistently.
> > > >
> > > > This problem only happens in content field, and not in any other
> > fields.
> > >
> > > > Does anyone knows what could be causing the issue?
> > > >
> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > >
> > > > -----
> > > > 未在此訊息中找到病毒。
> > > > 已透過 AVG 檢查 - www.avg.com
> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > > >
> > > >
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to