Re: Highlighting content field problem when using JiebaTokenizerFactory

Zheng Lin Edwin Yeo Mon, 19 Oct 2015 21:04:42 -0700

Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.


<field name="content" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
   <field name="title" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>


  <fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100">
 <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
<filter class="solr.PorterStemFilterFactory"/>
 </analyzer>
 <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
   </fieldType>


Here's my solrconfig.xml on the highlighting portion:

  <requestHandler name="/highlight" class="solr.SearchHandler">
      <lst name="defaults">
           <str name="echoParams">explicit</str>
           <int name="rows">10</int>
           <str name="wt">json</str>
           <str name="indent">true</str>
  <str name="df">text</str>
  <str name="fl">id, title, content_type, last_modified, url, score </str>

  <str name="hl">on</str>
           <str name="hl.fl">id, title, content, author, tag</str>
  <str name="hl.highlightMultiTerm">true</str>
           <str name="hl.preserveMulti">true</str>
           <str name="hl.encoder">html</str>
  <str name="hl.fragsize">200</str>
<str name="group">true</str>
<str name="group.field">signature</str>
<str name="group.main">true</str>
<str name="group.cache.percent">100</str>
      </lst>
  </requestHandler>

    <boundaryScanner name="breakIterator"
class="solr.highlight.BreakIteratorBoundaryScanner">
 <lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">SG</str>
 </lst>
    </boundaryScanner>


Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com> wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu，scott....@udngroup.com
> 2015/10/20
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>
> Even when I search for English character like responsibility, it highlight
>  <em> *responsibilit<em>*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Reply via email to