Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.

<field name="content" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
   <field name="title" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>

  <fieldType name="text_chinese" class="solr.TextField"
 <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
<filter class="solr.PorterStemFilterFactory"/>
 <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
<filter class="solr.PorterStemFilterFactory"/>

Here's my solrconfig.xml on the highlighting portion:

  <requestHandler name="/highlight" class="solr.SearchHandler">
      <lst name="defaults">
           <str name="echoParams">explicit</str>
           <int name="rows">10</int>
           <str name="wt">json</str>
           <str name="indent">true</str>
  <str name="df">text</str>
  <str name="fl">id, title, content_type, last_modified, url, score </str>

  <str name="hl">on</str>
           <str name="hl.fl">id, title, content, author, tag</str>
  <str name="hl.highlightMultiTerm">true</str>
           <str name="hl.preserveMulti">true</str>
           <str name="hl.encoder">html</str>
  <str name="hl.fragsize">200</str>
<str name="group">true</str>
<str name="group.field">signature</str>
<str name="group.main">true</str>
<str name="group.cache.percent">100</str>

    <boundaryScanner name="breakIterator"
 <lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">SG</str>

Meanwhile, I'll take a look at the articles too.

Thank you.


On 20 October 2015 at 11:32, Scott Chu <scott....@udngroup.com> wrote:

> Hi Edwin,
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
> https://issues.apache.org/jira/browse/SOLR-3390
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> Good luck!
> Scott Chu,scott....@udngroup.com
> 2015/10/20
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> *To: *solr-user <solr-user@lucene.apache.org>
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
> Hi,
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> Even when I search for English character like responsibility, it highlight
>  <em> *responsibilit<em>*y.
> Basically, the highlighting goes off by 1 character/space consistently.
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> Regards,
> Edwin
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15

Reply via email to