Hi Scott, I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0 package with a +1 at both the startOffset and endOffset value (see code below), and now the <em> tag of the content is shifted to the correct place at the content. However, this means that in the title and other fields where the <em> tag is orignally at the correct place, they will get the "org.apache.lucene.search.highlight.InvalidTokenOffsetsException" exception. I have temporary use another tokenizer for the other fields first.
public SegToken(Word word, int startOffset, int endOffset) { this.word = word; this.startOffset = startOffset+1; this.endOffset = endOffset+1; } However, I don't think this can be a permanent solution, so I'm trying to zoom in further to the code, to see what's the difference with the content and other fields. I have also find that althought JiebaTokenizer works better for Chinese characters, it doesn't work well for English characters. For example, if I search for "water", the JiebaTokenizer will cut it as follow: w|at|er It can't cut it as a full word, which HMMChineseTokenizer is able to. Here's my configuration in schema.xml: <fieldType name="text_chinese2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15"/> </analyzer> <analyzer type="query"> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> Does anyone knows if JiebaTokenizer is optimised to take in English characters as well? Regards, Edwin On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Scott, > > Thank you for providing the links and references. Will look through them, > and let you know if I find any solutions or workaround. > > Regards, > Edwin > > > On 27 October 2015 at 11:13, Scott Chu <scott....@udngroup.com> wrote: > >> >> Take a look at Michael's 2 articles, they might help you calrify the idea >> of highlighting in Solr: >> >> Changing Bits: Lucene's TokenStreams are actually graphs! >> >> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html >> >> Also take a look at 4th paragraph In his another article: >> >> Changing Bits: A new Lucene highlighter is born >> >> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html >> >> Currently, I can't figure out the possible cause of your problem unless I >> got spare time to test it on my own, which is not available these days (Got >> some projects to close)! >> >> If you find the solution or workaround, pls. let us know. Good luck again! >> >> Scott Chu,scott....@udngroup.com >> 2015/10/27 >> >> ----- Original Message ----- >> *From: *Scott Chu <scott....@udngroup.com> >> *To: *solr-user <solr-user@lucene.apache.org> >> *Date: *2015-10-27, 10:27:45 >> *Subject: *Re: Highlighting content field problem when using >> JiebaTokenizerFactory >> >> Hi Edward, >> >> Took a lot of time to see if there's anything can help you to define >> the cause of your problem. Maybe this might help you a bit: >> >> [SOLR-4722] Highlighter which generates a list of query term position(s) >> for each item in a list of documents, or returns null if highlighting is >> disabled. - AS... >> https://issues.apache.org/jira/browse/SOLR-4722 >> >> This one is modified from FastVectorHighLighter, so ensure those 3 term* >> attributes are on. >> >> Scott Chu,scott....@udngroup.com >> 2015/10/27 >> >> ----- Original Message ----- >> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> *To: *solr-user <solr-user@lucene.apache.org> >> *Date: *2015-10-23, 10:42:32 >> *Subject: *Re: Highlighting content field problem when using >> JiebaTokenizerFactory >> >> Hi Scott, >> >> Thank you for your respond. >> >> 1. You said the problem only happens on "contents" field, so maybe >> there're >> something wrong with the contents of that field. Doe it contain any >> special >> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions >> something about HTML stripping will cause highlight problem. Maybe you can >> >> try purify that fields to be closed to pure text and see if highlight >> comes >> ok. >> *A) I check that the SOLR-42 is mentioning about the >> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that >> tokenizer is already deprecated too. I've tried with all kinds of content >> for rich-text documents, and all of them have the same problem.* >> >> 2. Maybe something imcompatible between JiebaTokenizer and Solr >> highlighter. If you switch to other tokenizers, e.g. Standard, CJK, >> SmartChinese (I don't use this since I am dealing with Traditional Chinese >> >> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and >> >> see if the problem goes away. However when I'm googling similar problem, I >> >> saw you asked same question on August at Huaban/Jieba-analysis and >> somebody >> said he also uses JiebaTokenizer but he doesn't have your problem. So I >> see >> this could be less suspect. >> *A) I was thinking about the incompatible issue too, as I previously >> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have >> issue >> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't >> have >> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and >> although I'm using Solr 5.3.0 now, the same problem persist. * >> >> I'm looking at the indexing process too, to see if there's any problem >> there. But just can't figure out why it only happen to JiebaTokenizer, and >> >> it only happen for content field. >> >> >> Regards, >> Edwin >> >> >> On 23 October 2015 at 09:41, Scott Chu <scott....@udngroup.com >> <+scott....@udngroup.com>> wrote: >> >> > Hi Edwin, >> > >> > Since you've tested all my suggestions and the problem is still there, I >> >> > can't think of anything wrong with your configuration. Now I can only >> > suspect two things: >> > >> > 1. You said the problem only happens on "contents" field, so maybe >> > there're something wrong with the contents of that field. Doe it contain >> >> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42 >> > mentions something about HTML stripping will cause highlight problem. >> Maybe >> > you can try purify that fields to be closed to pure text and see if >> > highlight comes ok. >> > >> > 2. Maybe something imcompatible between JiebaTokenizer and Solr >> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK, >> > SmartChinese (I don't use this since I am dealing with Traditional >> Chinese >> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg >> and >> > see if the problem goes away. However when I'm googling similar >> problem, I >> > saw you asked same question on August at Huaban/Jieba-analysis and >> somebody >> > said he also uses JiebaTokenizer but he doesn't have your problem. So I >> see >> > this could be less suspect. >> > >> > The theory of your problem could be something in indexing process causes >> >> > wrong position info. for that field and when Solr do highlighting, it >> > retrieves wrong position info. and mark wrong position of highlight >> target >> > terms. >> > >> > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> >> > 2015/10/23 >> > >> > ----- Original Message ----- >> > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com >> <+edwinye...@gmail.com>> >> > *To: *solr-user <solr-user@lucene.apache.org >> <+solr-user@lucene.apache.org>> >> > *Date: *2015-10-22, 22:22:14 >> > *Subject: *Re: Highlighting content field problem when using >> > JiebaTokenizerFactory >> > >> > Hi Scott, >> > >> > Thank you for your response and suggestions. >> > >> > With respond to your questions, here are the answers: >> > >> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a >> good >> > job on CJK. I doubt this problem may be from those filters (note: I can >> > understand you may use CJKWidthFilter to convert Japanese but doesn't >> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you >> tried >> > commenting out those filters, say leave only Jieba and StopFilter, and >> see >> > >> > if this problem disppears? >> > *A) Yes, I have tried commenting out the other filters and only left >> with >> > Jieba and StopFilter. The problem is still there.* >> > >> > 2.Does this problem occur only on Chinese search words? Does it happen >> on >> > English search words? >> > *A) Yes, the same problem occurs on English words. For example, when I >> > search for "word", it will highlight in this way: <em> wor<em>d* >> > >> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term* >> > parameters in field declaration? I see only one is enabled. Please >> refer to >> > the answer in this stackoverflow question: >> > >> > >> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only >> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter too, >> >> > but the same problem persists as well.* >> > >> > >> > Regards, >> > Edwin >> > >> > >> > On 22 October 2015 at 16:25, Scott Chu <scott....@udngroup.com >> <+scott....@udngroup.com> >> > <+scott....@udngroup.com <+scott....@udngroup.com>>> wrote: >> > >> > > Hi solr-user, >> > > >> > > Can't judge the cause on fast glimpse of your definition but some >> > > suggestions I can give: >> > > >> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a >> good >> > > job on CJK. I doubt this problem may be from those filters (note: I >> can >> > > understand you may use CJKWidthFilter to convert Japanese but doesn't >> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you >> > tried >> > > commenting out those filters, say leave only Jieba and StopFilter, and >> >> > see >> > > if this problem disppears? >> > > >> > > 2.Does this problem occur only on Chinese search words? Does it >> happen on >> > > English search words? >> > > >> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term* >> > > parameters in field declaration? I see only one is enabled. Please >> refer >> > to >> > > the answer in this stackoverflow question: >> > > >> > >> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only >> > > >> > > >> > > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> <+ >> scott....@udngroup.com <+scott....@udngroup.com>> >> > > 2015/10/22 >> > > >> > > ----- Original Message ----- >> > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com >> <+edwinye...@gmail.com> >> > <+edwinye...@gmail.com <+edwinye...@gmail.com>>> >> > > *To: *solr-user <solr-user@lucene.apache.org >> <+solr-user@lucene.apache.org> >> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>> >> > > *Date: *2015-10-20, 12:04:11 >> > > *Subject: *Re: Highlighting content field problem when using >> > >> > > JiebaTokenizerFactory >> > > >> > > Hi Scott, >> > > >> > > Here's my schema.xml for content and title, which uses text_chinese. >> The >> > >> > > problem only occurs in content, and not in title. >> > > >> > > <field name="content" type="text_chinese" indexed="true" stored="true" >> > > omitNorms="true" termVectors="true"/> >> > > <field name="title" type="text_chinese" indexed="true" stored="true" >> > > omitNorms="true" termVectors="true"/> >> > > >> > > >> > > <fieldType name="text_chinese" class="solr.TextField" >> > > positionIncrementGap="100"> >> > > <analyzer type="index"> >> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" >> > > segMode="SEARCH"/> >> > > <filter class="solr.CJKWidthFilterFactory"/> >> > > <filter class="solr.CJKBigramFilterFactory"/> >> > > <filter class="solr.StopFilterFactory" >> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> >> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" >> > > maxGramSize="15"/> >> > > <filter class="solr.PorterStemFilterFactory"/> >> > > </analyzer> >> > > <analyzer type="query"> >> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" >> > > segMode="SEARCH"/> >> > > <filter class="solr.CJKWidthFilterFactory"/> >> > > <filter class="solr.CJKBigramFilterFactory"/> >> > > <filter class="solr.StopFilterFactory" >> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> >> > > <filter class="solr.PorterStemFilterFactory"/> >> > > </analyzer> >> > > </fieldType> >> > > >> > > >> > > Here's my solrconfig.xml on the highlighting portion: >> > > >> > > <requestHandler name="/highlight" class="solr.SearchHandler"> >> > > <lst name="defaults"> >> > > <str name="echoParams">explicit</str> >> > > <int name="rows">10</int> >> > > <str name="wt">json</str> >> > > <str name="indent">true</str> >> > > <str name="df">text</str> >> > > <str name="fl">id, title, content_type, last_modified, url, score >> </str> >> > > >> > > <str name="hl">on</str> >> > > <str name="hl.fl">id, title, content, author, tag</str> >> > > <str name="hl.highlightMultiTerm">true</str> >> > > <str name="hl.preserveMulti">true</str> >> > > <str name="hl.encoder">html</str> >> > > <str name="hl.fragsize">200</str> >> > > <str name="group">true</str> >> > > <str name="group.field">signature</str> >> > > <str name="group.main">true</str> >> > > <str name="group.cache.percent">100</str> >> > > </lst> >> > > </requestHandler> >> > > >> > > <boundaryScanner name="breakIterator" >> > > class="solr.highlight.BreakIteratorBoundaryScanner"> >> > > <lst name="defaults"> >> > > <str name="hl.bs.type">WORD</str> >> > > <str name="hl.bs.language">en</str> >> > > <str name="hl.bs.country">SG</str> >> > > </lst> >> > > </boundaryScanner> >> > > >> > > >> > > Meanwhile, I'll take a look at the articles too. >> > > >> > > Thank you. >> > > >> > > Regards, >> > > Edwin >> > > >> > > >> > > On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com >> <+scott....@udngroup.com> >> > <+scott....@udngroup.com <+scott....@udngroup.com>> >> > > <+scott....@udngroup.com <+scott....@udngroup.com> <+ >> scott....@udngroup.com <+scott....@udngroup.com>>>> wrote: >> > > >> > > > Hi Edwin, >> > > > >> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I >> > > > know) so I didn't experience this problem. >> > > > >> > > > I'd suggest you post your schema.xml so we can see how you define >> your >> > >> > > > content field and the field type it uses? >> > > > >> > > > In the mean time, refer to these articles, maybe the answer or >> > workaround >> > > > can be deducted from them. >> > > > >> > > > https://issues.apache.org/jira/browse/SOLR-3390 >> > > > >> > > > >> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words >> >> > > > >> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words >> > > > >> > > > Good luck! >> > > > >> > > > >> > > > >> > > > >> > > > Scott Chu,scott....@udngroup.com <+scott....@udngroup.com> <+ >> scott....@udngroup.com <+scott....@udngroup.com>> <+ >> > scott....@udngroup.com <+scott....@udngroup.com> <+ >> scott....@udngroup.com <+scott....@udngroup.com>>> >> > > > 2015/10/20 >> > > > >> > > > ----- Original Message ----- >> > > > *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com >> <+edwinye...@gmail.com> >> > <+edwinye...@gmail.com <+edwinye...@gmail.com>> >> > > <+edwinye...@gmail.com <+edwinye...@gmail.com> <+edwinye...@gmail.com >> <+edwinye...@gmail.com>>>> >> > > > *To: *solr-user <solr-user@lucene.apache.org >> <+solr-user@lucene.apache.org> >> > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org>> >> > > <+solr-user@lucene.apache.org <+solr-user@lucene.apache.org> <+ >> solr-user@lucene.apache.org <+solr-user@lucene.apache.org>>>> >> > >> > > > *Date: *2015-10-13, 17:04:29 >> > > > *Subject: *Highlighting content field problem when using >> > > > JiebaTokenizerFactory >> > > > >> > > > Hi, >> > > > >> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese >> characters >> > > in >> > > > >> > > > Solr. It works fine with the segmentation when I'm using >> > > > the Analysis function on the Solr Admin UI. >> > > > >> > > > However, when I tried to do the highlighting in Solr, it is not >> > > > highlighting in the correct place. For example, when I search of >> > > 自然環境与企業本身, >> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的 >> > > > >> > > > Even when I search for English character like responsibility, it >> > > highlight >> > > > <em> *responsibilit<em>*y. >> > > > >> > > > Basically, the highlighting goes off by 1 character/space >> consistently. >> > > > >> > > > This problem only happens in content field, and not in any other >> > fields. >> > > >> > > > Does anyone knows what could be causing the issue? >> > > > >> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0. >> > > > >> > > > >> > > > Regards, >> > > > Edwin >> > > > >> > > > >> > > > >> > > > ----- >> > > > 未在此訊息中找到病毒。 >> > > > 已透過 AVG 檢查 - www.avg.com >> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15 >> > > > >> > > > >> > > >> > > >> > > >> > > ----- >> > > 未在此訊息中找到病毒。 >> > > 已透過 AVG 檢查 - www.avg.com >> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15 >> > > >> > > >> > >> > >> > >> > ----- >> > 未在此訊息中找到病毒。 >> > 已透過 AVG 檢查 - www.avg.com >> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15 >> > >> > >> >> >> >> ----- >> 未在此訊息中找到病毒。 >> 已透過 AVG 檢查 - www.avg.com >> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15 >> >> >