Re: Highlighting content field problem when using JiebaTokenizerFactory

Zheng Lin Edwin Yeo Mon, 23 Nov 2015 19:37:53 -0800

Hi Scott,

I've created a Jira issue for this, the code is SOLR-8334.


Regards,
Edwin


On 24 November 2015 at 00:36, Scott Stults <
sstu...@opensourceconnections.com> wrote:

> Edwin,
>
> Congrats on getting it to work! Would you please create a Jira issue for
> this and add the patch? You won't need the inline change comments -- a good
> description in the ticket itself will work best.
>
> k/r,
> Scott
>
> On Sun, Nov 22, 2015 at 10:13 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> wrote:
>
> > I've tried to do some minor modification in the code under
> > JiebaSegmenter.java, and the highlighting seems to be fine now.
> >
> > Basically, I created another int called offset2 under process() method.
> > int offset2 = 0;
> >
> > Then I modified the offset to offset2 for this part of the code under
> > process() method.
> >
> >         if (sb.length() > 0)
> >             if (mode == SegMode.SEARCH) {
> >                 for (Word token : sentenceProcess(sb.toString())) {
> >                     // tokens.add(new SegToken(token, offset, offset +=
> > token.length()));
> >                     tokens.add(new SegToken(token, offset2, offset2 +=
> > token.length()));         // Change to offset2 by Edwin
> >                 }
> >             } else {
> >                 for (Word token : sentenceProcess(sb.toString())) {
> >                     if (token.length() > 2) {
> >                         Word gram2;
> >                         int j = 0;
> >                         for (; j < token.length() - 1; ++j) {
> >                             gram2 = token.subSequence(j, j + 2);
> >                             if (wordDict.containsWord(gram2.getToken()))
> >                                 // tokens.add(new SegToken(gram2, offset
> +
> > j, offset + j + 2));
> >                                 tokens.add(new SegToken(gram2, offset2 +
> j,
> > offset2 + j + 2));      // Change to offset2 by Edwin
> >                         }
> >                     }
> >                     if (token.length() > 3) {
> >                         Word gram3;
> >                         int j = 0;
> >                         for (; j < token.length() - 2; ++j) {
> >                             gram3 = token.subSequence(j, j + 3);
> >                             if (wordDict.containsWord(gram3.getToken()))
> >                                 // tokens.add(new SegToken(gram3, offset
> +
> > j, offset + j + 3));
> >                                 tokens.add(new SegToken(gram3, offset2 +
> j,
> > offset2 + j + 3));      // Change to offset2 by Edwin
> >                         }
> >                     }
> >                     // tokens.add(new SegToken(token, offset, offset +=
> > token.length()));
> >                     tokens.add(new SegToken(token, offset2, offset2 +=
> > token.length()));        // Change to offset2 by Edwin
> >                 }
> >             }
> >
> >
> > Not sure if this is just a workaround, or can be used as a permanent
> > solution
> >
> > Regards,
> > Edwin
> >
> >
> > On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> >
> > > Hi Scott,
> > >
> > > I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
> > > package with a +1 at both the startOffset and endOffset value (see code
> > > below), and now the <em> tag of the content is shifted to the correct
> > place
> > > at the content. However, this means that in the title and other fields
> > > where the <em> tag is orignally at the correct place, they will get the
> > "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
> > > exception. I have temporary use another tokenizer for the other fields
> > > first.
> > >
> > >     public SegToken(Word word, int startOffset, int endOffset) {
> > >         this.word = word;
> > >         this.startOffset = startOffset+1;
> > >         this.endOffset = endOffset+1;
> > >     }
> > >
> > > However, I don't think this can be a permanent solution, so I'm trying
> to
> > > zoom in further to the code, to see what's the difference with the
> > content
> > > and other fields.
> > >
> > > I have also find that althought JiebaTokenizer works better for Chinese
> > > characters, it doesn't work well for English characters. For example,
> if
> > I
> > > search for "water", the JiebaTokenizer will cut it as follow:
> > > w|at|er
> > > It can't cut it as a full word, which HMMChineseTokenizer is able to.
> > >
> > > Here's my configuration in schema.xml:
> > >
> > > <fieldType name="text_chinese2" class="solr.TextField"
> > > positionIncrementGap="100">
> > >  <analyzer type="index">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >  segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > maxGramSize="15"/>
> > >  </analyzer>
> > >  <analyzer type="query">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >  segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > >           </analyzer>
> > >   </fieldType>
> > >
> > > Does anyone knows if JiebaTokenizer is optimised to take in English
> > > characters as well?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Scott,
> > >>
> > >> Thank you for providing the links and references. Will look through
> > them,
> > >> and let you know if I find any solutions or workaround.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >>
> > >> On 27 October 2015 at 11:13, Scott Chu <scott....@udngroup.com>
> wrote:
> > >>
> > >>>
> > >>> Take a look at Michael's 2 articles, they might help you calrify the
> > >>> idea of highlighting in Solr:
> > >>>
> > >>> Changing Bits: Lucene's TokenStreams are actually graphs!
> > >>>
> > >>>
> >
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> > >>>
> > >>> Also take a look at 4th paragraph In his another article:
> > >>>
> > >>> Changing Bits: A new Lucene highlighter is born
> > >>>
> > >>>
> >
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
> > >>>
> > >>> Currently, I can't figure out the possible cause of your problem
> unless
> > >>> I got spare time to test it on my own, which is not available these
> > days
> > >>> (Got some projects to close)!
> > >>>
> > >>> If you find the solution or workaround, pls. let us know. Good luck
> > >>> again!
> > >>>
> > >>> Scott Chu，scott....@udngroup.com
> > >>> 2015/10/27
> > >>>
> > >>> ----- Original Message -----
> > >>> *From: *Scott Chu <scott....@udngroup.com>
> > >>> *To: *solr-user <solr-user@lucene.apache.org>
> > >>> *Date: *2015-10-27, 10:27:45
> > >>> *Subject: *Re: Highlighting content field problem when using
> > >>> JiebaTokenizerFactory
> > >>>
> > >>> Hi Edward,
> > >>>
> > >>>     Took a lot of time to see if there's anything can help you to
> > >>> define the cause of your problem. Maybe this might help you a bit:
> > >>>
> > >>> [SOLR-4722] Highlighter which generates a list of query term
> > position(s)
> > >>> for each item in a list of documents, or returns null if highlighting
> > is
> > >>> disabled. - AS...
> > >>> https://issues.apache.org/jira/browse/SOLR-4722
> > >>>
> > >>> This one is modified from FastVectorHighLighter, so ensure those 3
> > term*
> > >>> attributes are on.
> > >>>
> > >>> Scott Chu，scott....@udngroup.com
> > >>> 2015/10/27
> > >>>
> > >>> ----- Original Message -----
> > >>> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > >>> *To: *solr-user <solr-user@lucene.apache.org>
> > >>> *Date: *2015-10-23, 10:42:32
> > >>> *Subject: *Re: Highlighting content field problem when using
> > >>> JiebaTokenizerFactory
> > >>>
> > >>> Hi Scott,
> > >>>
> > >>> Thank you for your respond.
> > >>>
> > >>> 1. You said the problem only happens on "contents" field, so maybe
> > >>> there're
> > >>> something wrong with the contents of that field. Doe it contain any
> > >>> special
> > >>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
> > >>> something about HTML stripping will cause highlight problem. Maybe
> you
> > >>> can
> > >>>
> > >>> try purify that fields to be closed to pure text and see if highlight
> > >>> comes
> > >>> ok.
> > >>> *A) I check that the SOLR-42 is mentioning about the
> > >>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe
> > that
> > >>> tokenizer is already deprecated too. I've tried with all kinds of
> > content
> > >>> for rich-text documents, and all of them have the same problem.*
> > >>>
> > >>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > >>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > >>> SmartChinese (I don't use this since I am dealing with Traditional
> > >>> Chinese
> > >>>
> > >>> but I see you are dealing with Simplified Chinese), or 3rd-party
> MMSeg
> > >>> and
> > >>>
> > >>> see if the problem goes away. However when I'm googling similar
> > problem,
> > >>> I
> > >>>
> > >>> saw you asked same question on August at Huaban/Jieba-analysis and
> > >>> somebody
> > >>> said he also uses JiebaTokenizer but he doesn't have your problem.
> So I
> > >>> see
> > >>> this could be less suspect.
> > >>> *A) I was thinking about the incompatible issue too, as I previously
> > >>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
> > >>> issue
> > >>> in 5.x. But the person from Hunban/Jieba-analysis said that he
> doesn't
> > >>> have
> > >>> this problem in Solr 5.1. I also face the same problem in Solr 5.1,
> and
> > >>> although I'm using Solr 5.3.0 now, the same problem persist. *
> > >>>
> > >>> I'm looking at the indexing process too, to see if there's any
> problem
> > >>> there. But just can't figure out why it only happen to
> JiebaTokenizer,
> > >>> and
> > >>>
> > >>> it only happen for content field.
> > >>>
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>>
> > >>> On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com
> > >>> <+scott....@udngroup.com>> wrote:
> > >>>
> > >>> > Hi Edwin,
> > >>> >
> > >>> > Since you've tested all my suggestions and the problem is still
> > there,
> > >>> I
> > >>>
> > >>> > can't think of anything wrong with your configuration. Now I can
> only
> > >>> > suspect two things:
> > >>> >
> > >>> > 1. You said the problem only happens on "contents" field, so maybe
> > >>> > there're something wrong with the contents of that field. Doe it
> > >>> contain
> > >>>
> > >>> > any special thing in them, e.g. HTML tags or symbols. I recall
> > SOLR-42
> > >>> > mentions something about HTML stripping will cause highlight
> problem.
> > >>> Maybe
> > >>> > you can try purify that fields to be closed to pure text and see if
> > >>> > highlight comes ok.
> > >>> >
> > >>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > >>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > >>> > SmartChinese (I don't use this since I am dealing with Traditional
> > >>> Chinese
> > >>> > but I see you are dealing with Simplified Chinese), or 3rd-party
> > MMSeg
> > >>> and
> > >>> > see if the problem goes away. However when I'm googling similar
> > >>> problem, I
> > >>> > saw you asked same question on August at Huaban/Jieba-analysis and
> > >>> somebody
> > >>> > said he also uses JiebaTokenizer but he doesn't have your problem.
> So
> > >>> I see
> > >>> > this could be less suspect.
> > >>> >
> > >>> > The theory of your problem could be something in indexing process
> > >>> causes
> > >>>
> > >>> > wrong position info. for that field and when Solr do highlighting,
> it
> > >>> > retrieves wrong position info. and mark wrong position of highlight
> > >>> target
> > >>> > terms.
> > >>> >
> > >>> > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com>
> > >>> > 2015/10/23
> > >>> >
> > >>> > ----- Original Message -----
> > >>> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> > >>> <+edwinye...@gmail.com>>
> > >>> > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+solr-user@lucene.apache.org>>
> > >>> > *Date: *2015-10-22, 22:22:14
> > >>> > *Subject: *Re: Highlighting content field problem when using
> > >>> > JiebaTokenizerFactory
> > >>> >
> > >>> > Hi Scott,
> > >>> >
> > >>> > Thank you for your response and suggestions.
> > >>> >
> > >>> > With respond to your questions, here are the answers:
> > >>> >
> > >>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do
> a
> > >>> good
> > >>> > job on CJK. I doubt this problem may be from those filters (note: I
> > can
> > >>> > understand you may use CJKWidthFilter to convert Japanese but
> doesn't
> > >>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have
> you
> > >>> tried
> > >>> > commenting out those filters, say leave only Jieba and StopFilter,
> > and
> > >>> see
> > >>> >
> > >>> > if this problem disppears?
> > >>> > *A) Yes, I have tried commenting out the other filters and only
> left
> > >>> with
> > >>> > Jieba and StopFilter. The problem is still there.*
> > >>> >
> > >>> > 2.Does this problem occur only on Chinese search words? Does it
> > happen
> > >>> on
> > >>> > English search words?
> > >>> > *A) Yes, the same problem occurs on English words. For example,
> when
> > I
> > >>> > search for "word", it will highlight in this way: <em> wor<em>d*
> > >>> >
> > >>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > >>> > parameters in field declaration? I see only one is enabled. Please
> > >>> refer to
> > >>> > the answer in this stackoverflow question:
> > >>> >
> > >>> >
> > >>>
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter
> > >>> too,
> > >>>
> > >>> > but the same problem persists as well.*
> > >>> >
> > >>> >
> > >>> > Regards,
> > >>> > Edwin
> > >>> >
> > >>> >
> > >>> > On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com
> > >>> <+scott....@udngroup.com>
> > >>> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote:
> > >>> >
> > >>> > > Hi solr-user,
> > >>> > >
> > >>> > > Can't judge the cause on fast glimpse of your definition but some
> > >>> > > suggestions I can give:
> > >>> > >
> > >>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to
> do
> > a
> > >>> good
> > >>> > > job on CJK. I doubt this problem may be from those filters
> (note: I
> > >>> can
> > >>> > > understand you may use CJKWidthFilter to convert Japanese but
> > doesn't
> > >>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have
> > you
> > >>> > tried
> > >>> > > commenting out those filters, say leave only Jieba and
> StopFilter,
> > >>> and
> > >>>
> > >>> > see
> > >>> > > if this problem disppears?
> > >>> > >
> > >>> > > 2.Does this problem occur only on Chinese search words? Does it
> > >>> happen on
> > >>> > > English search words?
> > >>> > >
> > >>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3
> term*
> > >>> > > parameters in field declaration? I see only one is enabled.
> Please
> > >>> refer
> > >>> > to
> > >>> > > the answer in this stackoverflow question:
> > >>> > >
> > >>> >
> > >>>
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >>> > >
> > >>> > >
> > >>> > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
> > >>> scott....@udngroup.com <+scott....@udngroup.com>>
> > >>> > > 2015/10/22
> > >>> > >
> > >>> > > ----- Original Message -----
> > >>> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> > >>> <+edwinye...@gmail.com>
> > >>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>>
> > >>> > > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+solr-user@lucene.apache.org>
> > >>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>
> > >>> > > *Date: *2015-10-20, 12:04:11
> > >>> > > *Subject: *Re: Highlighting content field problem when using
> > >>> >
> > >>> > > JiebaTokenizerFactory
> > >>> > >
> > >>> > > Hi Scott,
> > >>> > >
> > >>> > > Here's my schema.xml for content and title, which uses
> > text_chinese.
> > >>> The
> > >>> >
> > >>> > > problem only occurs in content, and not in title.
> > >>> > >
> > >>> > > <field name="content" type="text_chinese" indexed="true"
> > >>> stored="true"
> > >>> > > omitNorms="true" termVectors="true"/>
> > >>> > > <field name="title" type="text_chinese" indexed="true"
> > stored="true"
> > >>> > > omitNorms="true" termVectors="true"/>
> > >>> > >
> > >>> > >
> > >>> > > <fieldType name="text_chinese" class="solr.TextField"
> > >>> > > positionIncrementGap="100">
> > >>> > > <analyzer type="index">
> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >>> > > segMode="SEARCH"/>
> > >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> > >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> > >>> > > <filter class="solr.StopFilterFactory"
> > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > >>> > > maxGramSize="15"/>
> > >>> > > <filter class="solr.PorterStemFilterFactory"/>
> > >>> > > </analyzer>
> > >>> > > <analyzer type="query">
> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >>> > > segMode="SEARCH"/>
> > >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> > >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> > >>> > > <filter class="solr.StopFilterFactory"
> > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >>> > > <filter class="solr.PorterStemFilterFactory"/>
> > >>> > > </analyzer>
> > >>> > > </fieldType>
> > >>> > >
> > >>> > >
> > >>> > > Here's my solrconfig.xml on the highlighting portion:
> > >>> > >
> > >>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
> > >>> > > <lst name="defaults">
> > >>> > > <str name="echoParams">explicit</str>
> > >>> > > <int name="rows">10</int>
> > >>> > > <str name="wt">json</str>
> > >>> > > <str name="indent">true</str>
> > >>> > > <str name="df">text</str>
> > >>> > > <str name="fl">id, title, content_type, last_modified, url, score
> > >>> </str>
> > >>> > >
> > >>> > > <str name="hl">on</str>
> > >>> > > <str name="hl.fl">id, title, content, author, tag</str>
> > >>> > > <str name="hl.highlightMultiTerm">true</str>
> > >>> > > <str name="hl.preserveMulti">true</str>
> > >>> > > <str name="hl.encoder">html</str>
> > >>> > > <str name="hl.fragsize">200</str>
> > >>> > > <str name="group">true</str>
> > >>> > > <str name="group.field">signature</str>
> > >>> > > <str name="group.main">true</str>
> > >>> > > <str name="group.cache.percent">100</str>
> > >>> > > </lst>
> > >>> > > </requestHandler>
> > >>> > >
> > >>> > > <boundaryScanner name="breakIterator"
> > >>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> > >>> > > <lst name="defaults">
> > >>> > > <str name="hl.bs.type">WORD</str>
> > >>> > > <str name="hl.bs.language">en</str>
> > >>> > > <str name="hl.bs.country">SG</str>
> > >>> > > </lst>
> > >>> > > </boundaryScanner>
> > >>> > >
> > >>> > >
> > >>> > > Meanwhile, I'll take a look at the articles too.
> > >>> > >
> > >>> > > Thank you.
> > >>> > >
> > >>> > > Regards,
> > >>> > > Edwin
> > >>> > >
> > >>> > >
> > >>> > > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com
> > >>> <+scott....@udngroup.com>
> > >>> > <+scott....@udngroup.com <+scott....@udngroup.com>>
> > >>> > > <+scott....@udngroup.com <+scott....@udngroup.com> <+
> > >>> scott....@udngroup.com <+scott....@udngroup.com>>>> wrote:
> > >>> > >
> > >>> > > > Hi Edwin,
> > >>> > > >
> > >>> > > > I didn't use Jieba on Chinese (I use only CJK, very
> > foundamental, I
> > >>> > > > know) so I didn't experience this problem.
> > >>> > > >
> > >>> > > > I'd suggest you post your schema.xml so we can see how you
> define
> > >>> your
> > >>> >
> > >>> > > > content field and the field type it uses?
> > >>> > > >
> > >>> > > > In the mean time, refer to these articles, maybe the answer or
> > >>> > workaround
> > >>> > > > can be deducted from them.
> > >>> > > >
> > >>> > > > https://issues.apache.org/jira/browse/SOLR-3390
> > >>> > > >
> > >>> > > >
> > >>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> > >>>
> > >>> > > >
> > >>> > > >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >>> > > >
> > >>> > > > Good luck!
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
> > >>> scott....@udngroup.com <+scott....@udngroup.com>> <+
> > >>> > scott....@udngroup.com <+scott....@udngroup.com> <+
> > >>> scott....@udngroup.com <+scott....@udngroup.com>>>
> > >>> > > > 2015/10/20
> > >>> > > >
> > >>> > > > ----- Original Message -----
> > >>> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
> > >>> <+edwinye...@gmail.com>
> > >>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>
> > >>> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+
> > >>> edwinye...@gmail.com <+edwinye...@gmail.com>>>>
> > >>> > > > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+solr-user@lucene.apache.org>
> > >>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>
> > >>> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+
> > >>> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>>
> > >>> >
> > >>> > > > *Date: *2015-10-13, 17:04:29
> > >>> > > > *Subject: *Highlighting content field problem when using
> > >>> > > > JiebaTokenizerFactory
> > >>> > > >
> > >>> > > > Hi,
> > >>> > > >
> > >>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> > >>> characters
> > >>> > > in
> > >>> > > >
> > >>> > > > Solr. It works fine with the segmentation when I'm using
> > >>> > > > the Analysis function on the Solr Admin UI.
> > >>> > > >
> > >>> > > > However, when I tried to do the highlighting in Solr, it is not
> > >>> > > > highlighting in the correct place. For example, when I search
> of
> > >>> > > 自然環境与企業本身,
> > >>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >>> > > >
> > >>> > > > Even when I search for English character like responsibility,
> it
> > >>> > > highlight
> > >>> > > > <em> *responsibilit<em>*y.
> > >>> > > >
> > >>> > > > Basically, the highlighting goes off by 1 character/space
> > >>> consistently.
> > >>> > > >
> > >>> > > > This problem only happens in content field, and not in any
> other
> > >>> > fields.
> > >>> > >
> > >>> > > > Does anyone knows what could be causing the issue?
> > >>> > > >
> > >>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >>> > > >
> > >>> > > >
> > >>> > > > Regards,
> > >>> > > > Edwin
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > -----
> > >>> > > > 未在此訊息中找到病毒。
> > >>> > > > 已透過 AVG 檢查 - www.avg.com
> > >>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > -----
> > >>> > > 未在此訊息中找到病毒。
> > >>> > > 已透過 AVG 檢查 - www.avg.com
> > >>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> > >>> > >
> > >>> > >
> > >>> >
> > >>> >
> > >>> >
> > >>> > -----
> > >>> > 未在此訊息中找到病毒。
> > >>> > 已透過 AVG 檢查 - www.avg.com
> > >>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> > >>> >
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> -----
> > >>> 未在此訊息中找到病毒。
> > >>> 已透過 AVG 檢查 - www.avg.com
> > >>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
> > >>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to