Re: Highlighting content field problem when using JiebaTokenizerFactory

Zheng Lin Edwin Yeo Sun, 22 Nov 2015 19:14:17 -0800

I've tried to do some minor modification in the code under
JiebaSegmenter.java, and the highlighting seems to be fine now.


Basically, I created another int called offset2 under process() method.
int offset2 = 0;

Then I modified the offset to offset2 for this part of the code under
process() method.

        if (sb.length() > 0)
            if (mode == SegMode.SEARCH) {
                for (Word token : sentenceProcess(sb.toString())) {
                    // tokens.add(new SegToken(token, offset, offset +=
token.length()));
                    tokens.add(new SegToken(token, offset2, offset2 +=
token.length()));         // Change to offset2 by Edwin
                }
            } else {
                for (Word token : sentenceProcess(sb.toString())) {
                    if (token.length() > 2) {
                        Word gram2;
                        int j = 0;
                        for (; j < token.length() - 1; ++j) {
                            gram2 = token.subSequence(j, j + 2);
                            if (wordDict.containsWord(gram2.getToken()))
                                // tokens.add(new SegToken(gram2, offset +
j, offset + j + 2));
                                tokens.add(new SegToken(gram2, offset2 + j,
offset2 + j + 2));      // Change to offset2 by Edwin
                        }
                    }
                    if (token.length() > 3) {
                        Word gram3;
                        int j = 0;
                        for (; j < token.length() - 2; ++j) {
                            gram3 = token.subSequence(j, j + 3);
                            if (wordDict.containsWord(gram3.getToken()))
                                // tokens.add(new SegToken(gram3, offset +
j, offset + j + 3));
                                tokens.add(new SegToken(gram3, offset2 + j,
offset2 + j + 3));      // Change to offset2 by Edwin
                        }
                    }
                    // tokens.add(new SegToken(token, offset, offset +=
token.length()));
                    tokens.add(new SegToken(token, offset2, offset2 +=
token.length()));        // Change to offset2 by Edwin
                }
            }


Not sure if this is just a workaround, or can be used as a permanent
solution

Regards,
Edwin


On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Scott,
>
> I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
> package with a +1 at both the startOffset and endOffset value (see code
> below), and now the <em> tag of the content is shifted to the correct place
> at the content. However, this means that in the title and other fields
> where the <em> tag is orignally at the correct place, they will get the 
> "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
> exception. I have temporary use another tokenizer for the other fields
> first.
>
>     public SegToken(Word word, int startOffset, int endOffset) {
>         this.word = word;
>         this.startOffset = startOffset+1;
>         this.endOffset = endOffset+1;
>     }
>
> However, I don't think this can be a permanent solution, so I'm trying to
> zoom in further to the code, to see what's the difference with the content
> and other fields.
>
> I have also find that althought JiebaTokenizer works better for Chinese
> characters, it doesn't work well for English characters. For example, if I
> search for "water", the JiebaTokenizer will cut it as follow:
> w|at|er
> It can't cut it as a full word, which HMMChineseTokenizer is able to.
>
> Here's my configuration in schema.xml:
>
> <fieldType name="text_chinese2" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="15"/>
>  </analyzer>
>  <analyzer type="query">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
>           </analyzer>
>   </fieldType>
>
> Does anyone knows if JiebaTokenizer is optimised to take in English
> characters as well?
>
> Regards,
> Edwin
>
>
> On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
>
>> Hi Scott,
>>
>> Thank you for providing the links and references. Will look through them,
>> and let you know if I find any solutions or workaround.
>>
>> Regards,
>> Edwin
>>
>>
>> On 27 October 2015 at 11:13, Scott Chu <scott....@udngroup.com> wrote:
>>
>>>
>>> Take a look at Michael's 2 articles, they might help you calrify the
>>> idea of highlighting in Solr:
>>>
>>> Changing Bits: Lucene's TokenStreams are actually graphs!
>>>
>>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>>
>>> Also take a look at 4th paragraph In his another article:
>>>
>>> Changing Bits: A new Lucene highlighter is born
>>>
>>> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>>>
>>> Currently, I can't figure out the possible cause of your problem unless
>>> I got spare time to test it on my own, which is not available these days
>>> (Got some projects to close)!
>>>
>>> If you find the solution or workaround, pls. let us know. Good luck
>>> again!
>>>
>>> Scott Chu，scott....@udngroup.com
>>> 2015/10/27
>>>
>>> ----- Original Message -----
>>> *From: *Scott Chu <scott....@udngroup.com>
>>> *To: *solr-user <solr-user@lucene.apache.org>
>>> *Date: *2015-10-27, 10:27:45
>>> *Subject: *Re: Highlighting content field problem when using
>>> JiebaTokenizerFactory
>>>
>>> Hi Edward,
>>>
>>>     Took a lot of time to see if there's anything can help you to
>>> define the cause of your problem. Maybe this might help you a bit:
>>>
>>> [SOLR-4722] Highlighter which generates a list of query term position(s)
>>> for each item in a list of documents, or returns null if highlighting is
>>> disabled. - AS...
>>> https://issues.apache.org/jira/browse/SOLR-4722
>>>
>>> This one is modified from FastVectorHighLighter, so ensure those 3 term*
>>> attributes are on.
>>>
>>> Scott Chu，scott....@udngroup.com
>>> 2015/10/27
>>>
>>> ----- Original Message -----
>>> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>>> *To: *solr-user <solr-user@lucene.apache.org>
>>> *Date: *2015-10-23, 10:42:32
>>> *Subject: *Re: Highlighting content field problem when using
>>> JiebaTokenizerFactory
>>>
>>> Hi Scott,
>>>
>>> Thank you for your respond.
>>>
>>> 1. You said the problem only happens on "contents" field, so maybe
>>> there're
>>> something wrong with the contents of that field. Doe it contain any
>>> special
>>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
>>> something about HTML stripping will cause highlight problem. Maybe you
>>> can
>>>
>>> try purify that fields to be closed to pure text and see if highlight
>>> comes
>>> ok.
>>> *A) I check that the SOLR-42 is mentioning about the
>>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
>>> tokenizer is already deprecated too. I've tried with all kinds of content
>>> for rich-text documents, and all of them have the same problem.*
>>>
>>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
>>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>>> SmartChinese (I don't use this since I am dealing with Traditional
>>> Chinese
>>>
>>> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
>>> and
>>>
>>> see if the problem goes away. However when I'm googling similar problem,
>>> I
>>>
>>> saw you asked same question on August at Huaban/Jieba-analysis and
>>> somebody
>>> said he also uses JiebaTokenizer but he doesn't have your problem. So I
>>> see
>>> this could be less suspect.
>>> *A) I was thinking about the incompatible issue too, as I previously
>>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
>>> issue
>>> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't
>>> have
>>> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
>>> although I'm using Solr 5.3.0 now, the same problem persist. *
>>>
>>> I'm looking at the indexing process too, to see if there's any problem
>>> there. But just can't figure out why it only happen to JiebaTokenizer,
>>> and
>>>
>>> it only happen for content field.
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com
>>> <+scott....@udngroup.com>> wrote:
>>>
>>> > Hi Edwin,
>>> >
>>> > Since you've tested all my suggestions and the problem is still there,
>>> I
>>>
>>> > can't think of anything wrong with your configuration. Now I can only
>>> > suspect two things:
>>> >
>>> > 1. You said the problem only happens on "contents" field, so maybe
>>> > there're something wrong with the contents of that field. Doe it
>>> contain
>>>
>>> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
>>> > mentions something about HTML stripping will cause highlight problem.
>>> Maybe
>>> > you can try purify that fields to be closed to pure text and see if
>>> > highlight comes ok.
>>> >
>>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
>>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>>> > SmartChinese (I don't use this since I am dealing with Traditional
>>> Chinese
>>> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
>>> and
>>> > see if the problem goes away. However when I'm googling similar
>>> problem, I
>>> > saw you asked same question on August at Huaban/Jieba-analysis and
>>> somebody
>>> > said he also uses JiebaTokenizer but he doesn't have your problem. So
>>> I see
>>> > this could be less suspect.
>>> >
>>> > The theory of your problem could be something in indexing process
>>> causes
>>>
>>> > wrong position info. for that field and when Solr do highlighting, it
>>> > retrieves wrong position info. and mark wrong position of highlight
>>> target
>>> > terms.
>>> >
>>> > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com>
>>> > 2015/10/23
>>> >
>>> > ----- Original Message -----
>>> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
>>> <+edwinye...@gmail.com>>
>>> > *To: *solr-user <solr-user@lucene.apache.org
>>> <+solr-user@lucene.apache.org>>
>>> > *Date: *2015-10-22, 22:22:14
>>> > *Subject: *Re: Highlighting content field problem when using
>>> > JiebaTokenizerFactory
>>> >
>>> > Hi Scott,
>>> >
>>> > Thank you for your response and suggestions.
>>> >
>>> > With respond to your questions, here are the answers:
>>> >
>>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>>> good
>>> > job on CJK. I doubt this problem may be from those filters (note: I can
>>> > understand you may use CJKWidthFilter to convert Japanese but doesn't
>>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>>> tried
>>> > commenting out those filters, say leave only Jieba and StopFilter, and
>>> see
>>> >
>>> > if this problem disppears?
>>> > *A) Yes, I have tried commenting out the other filters and only left
>>> with
>>> > Jieba and StopFilter. The problem is still there.*
>>> >
>>> > 2.Does this problem occur only on Chinese search words? Does it happen
>>> on
>>> > English search words?
>>> > *A) Yes, the same problem occurs on English words. For example, when I
>>> > search for "word", it will highlight in this way: <em> wor<em>d*
>>> >
>>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>>> > parameters in field declaration? I see only one is enabled. Please
>>> refer to
>>> > the answer in this stackoverflow question:
>>> >
>>> >
>>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter
>>> too,
>>>
>>> > but the same problem persists as well.*
>>> >
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> >
>>> > On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com
>>> <+scott....@udngroup.com>
>>> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote:
>>> >
>>> > > Hi solr-user,
>>> > >
>>> > > Can't judge the cause on fast glimpse of your definition but some
>>> > > suggestions I can give:
>>> > >
>>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>>> good
>>> > > job on CJK. I doubt this problem may be from those filters (note: I
>>> can
>>> > > understand you may use CJKWidthFilter to convert Japanese but doesn't
>>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>>> > tried
>>> > > commenting out those filters, say leave only Jieba and StopFilter,
>>> and
>>>
>>> > see
>>> > > if this problem disppears?
>>> > >
>>> > > 2.Does this problem occur only on Chinese search words? Does it
>>> happen on
>>> > > English search words?
>>> > >
>>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>>> > > parameters in field declaration? I see only one is enabled. Please
>>> refer
>>> > to
>>> > > the answer in this stackoverflow question:
>>> > >
>>> >
>>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>>> > >
>>> > >
>>> > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
>>> scott....@udngroup.com <+scott....@udngroup.com>>
>>> > > 2015/10/22
>>> > >
>>> > > ----- Original Message -----
>>> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
>>> <+edwinye...@gmail.com>
>>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>>
>>> > > *To: *solr-user <solr-user@lucene.apache.org
>>> <+solr-user@lucene.apache.org>
>>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>
>>> > > *Date: *2015-10-20, 12:04:11
>>> > > *Subject: *Re: Highlighting content field problem when using
>>> >
>>> > > JiebaTokenizerFactory
>>> > >
>>> > > Hi Scott,
>>> > >
>>> > > Here's my schema.xml for content and title, which uses text_chinese.
>>> The
>>> >
>>> > > problem only occurs in content, and not in title.
>>> > >
>>> > > <field name="content" type="text_chinese" indexed="true"
>>> stored="true"
>>> > > omitNorms="true" termVectors="true"/>
>>> > > <field name="title" type="text_chinese" indexed="true" stored="true"
>>> > > omitNorms="true" termVectors="true"/>
>>> > >
>>> > >
>>> > > <fieldType name="text_chinese" class="solr.TextField"
>>> > > positionIncrementGap="100">
>>> > > <analyzer type="index">
>>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>>> > > segMode="SEARCH"/>
>>> > > <filter class="solr.CJKWidthFilterFactory"/>
>>> > > <filter class="solr.CJKBigramFilterFactory"/>
>>> > > <filter class="solr.StopFilterFactory"
>>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> > > maxGramSize="15"/>
>>> > > <filter class="solr.PorterStemFilterFactory"/>
>>> > > </analyzer>
>>> > > <analyzer type="query">
>>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>>> > > segMode="SEARCH"/>
>>> > > <filter class="solr.CJKWidthFilterFactory"/>
>>> > > <filter class="solr.CJKBigramFilterFactory"/>
>>> > > <filter class="solr.StopFilterFactory"
>>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>>> > > <filter class="solr.PorterStemFilterFactory"/>
>>> > > </analyzer>
>>> > > </fieldType>
>>> > >
>>> > >
>>> > > Here's my solrconfig.xml on the highlighting portion:
>>> > >
>>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
>>> > > <lst name="defaults">
>>> > > <str name="echoParams">explicit</str>
>>> > > <int name="rows">10</int>
>>> > > <str name="wt">json</str>
>>> > > <str name="indent">true</str>
>>> > > <str name="df">text</str>
>>> > > <str name="fl">id, title, content_type, last_modified, url, score
>>> </str>
>>> > >
>>> > > <str name="hl">on</str>
>>> > > <str name="hl.fl">id, title, content, author, tag</str>
>>> > > <str name="hl.highlightMultiTerm">true</str>
>>> > > <str name="hl.preserveMulti">true</str>
>>> > > <str name="hl.encoder">html</str>
>>> > > <str name="hl.fragsize">200</str>
>>> > > <str name="group">true</str>
>>> > > <str name="group.field">signature</str>
>>> > > <str name="group.main">true</str>
>>> > > <str name="group.cache.percent">100</str>
>>> > > </lst>
>>> > > </requestHandler>
>>> > >
>>> > > <boundaryScanner name="breakIterator"
>>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
>>> > > <lst name="defaults">
>>> > > <str name="hl.bs.type">WORD</str>
>>> > > <str name="hl.bs.language">en</str>
>>> > > <str name="hl.bs.country">SG</str>
>>> > > </lst>
>>> > > </boundaryScanner>
>>> > >
>>> > >
>>> > > Meanwhile, I'll take a look at the articles too.
>>> > >
>>> > > Thank you.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >
>>> > > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com
>>> <+scott....@udngroup.com>
>>> > <+scott....@udngroup.com <+scott....@udngroup.com>>
>>> > > <+scott....@udngroup.com <+scott....@udngroup.com> <+
>>> scott....@udngroup.com <+scott....@udngroup.com>>>> wrote:
>>> > >
>>> > > > Hi Edwin,
>>> > > >
>>> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
>>> > > > know) so I didn't experience this problem.
>>> > > >
>>> > > > I'd suggest you post your schema.xml so we can see how you define
>>> your
>>> >
>>> > > > content field and the field type it uses?
>>> > > >
>>> > > > In the mean time, refer to these articles, maybe the answer or
>>> > workaround
>>> > > > can be deducted from them.
>>> > > >
>>> > > > https://issues.apache.org/jira/browse/SOLR-3390
>>> > > >
>>> > > >
>>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>>>
>>> > > >
>>> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>>> > > >
>>> > > > Good luck!
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > Scott Chu，scott....@udngroup.com <+scott....@udngroup.com> <+
>>> scott....@udngroup.com <+scott....@udngroup.com>> <+
>>> > scott....@udngroup.com <+scott....@udngroup.com> <+
>>> scott....@udngroup.com <+scott....@udngroup.com>>>
>>> > > > 2015/10/20
>>> > > >
>>> > > > ----- Original Message -----
>>> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com
>>> <+edwinye...@gmail.com>
>>> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>
>>> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+
>>> edwinye...@gmail.com <+edwinye...@gmail.com>>>>
>>> > > > *To: *solr-user <solr-user@lucene.apache.org
>>> <+solr-user@lucene.apache.org>
>>> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>
>>> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+
>>> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>>
>>> >
>>> > > > *Date: *2015-10-13, 17:04:29
>>> > > > *Subject: *Highlighting content field problem when using
>>> > > > JiebaTokenizerFactory
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
>>> characters
>>> > > in
>>> > > >
>>> > > > Solr. It works fine with the segmentation when I'm using
>>> > > > the Analysis function on the Solr Admin UI.
>>> > > >
>>> > > > However, when I tried to do the highlighting in Solr, it is not
>>> > > > highlighting in the correct place. For example, when I search of
>>> > > 自然環境与企業本身,
>>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>>> > > >
>>> > > > Even when I search for English character like responsibility, it
>>> > > highlight
>>> > > > <em> *responsibilit<em>*y.
>>> > > >
>>> > > > Basically, the highlighting goes off by 1 character/space
>>> consistently.
>>> > > >
>>> > > > This problem only happens in content field, and not in any other
>>> > fields.
>>> > >
>>> > > > Does anyone knows what could be causing the issue?
>>> > > >
>>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>>> > > >
>>> > > >
>>> > > > Regards,
>>> > > > Edwin
>>> > > >
>>> > > >
>>> > > >
>>> > > > -----
>>> > > > 未在此訊息中找到病毒。
>>> > > > 已透過 AVG 檢查 - www.avg.com
>>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>>> > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > -----
>>> > > 未在此訊息中找到病毒。
>>> > > 已透過 AVG 檢查 - www.avg.com
>>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>>> > >
>>> > >
>>> >
>>> >
>>> >
>>> > -----
>>> > 未在此訊息中找到病毒。
>>> > 已透過 AVG 檢查 - www.avg.com
>>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>> >
>>> >
>>>
>>>
>>>
>>> -----
>>> 未在此訊息中找到病毒。
>>> 已透過 AVG 檢查 - www.avg.com
>>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
>>>
>>>
>>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to