As far as I can tell, that is how it's currently set-up (does the same on mine 
at least). The HTML Stripper seems to exclude the pre tag, but include the post 
tag when it generates the start and end offsets of each text token. I couldn't 
say why though... (This may just have avoided needing to backtrack).

Play around in the analysis section of the admin ui to verify this.

Geraint


-----Original Message-----
From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
Sent: 07 September 2016 18:16
To: solr-user@lucene.apache.org
Subject: AW: Wrong highlighting in stripped HTML field

Hello,
can anyone confirm this behavior of the highlighter? Otherwise my Solr 
installation might be misconfigured or something.
Or does anyone know if this is a known issue? In that case I probably should 
ask on the dev mailing list.

Thanks and cheers,
Dennis


________________________________________
Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
Gesendet: Montag, 5. September 2016 18:00
An: solr-user@lucene.apache.org
Betreff: Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 
5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so 
everything is set to defaults. I add the following in schema.xml:


    <field name="testfield" type="mytype" indexed="true" stored="true" 
required="false" multiValued="false" />

    <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
      </analyzer>
    </fieldType>


Now I add this document (in the admin interface):

{"id":"1","testfield":"<span>bla</span>"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "testfield": "<span>bla</span>",
        "_version_": 1544645963570741200
      }
    ]
  },
  "highlighting": {
    "1": {
      "testfield": [
        "<span><em>bla</span></em>"
      ]
    }
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want 
to get

<span><em>bla</em></span>


Best regards
Dennis


________________________________


Syngenta Limited, Registered in England No 2710846; Registered Office : 
Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
RG42 6EY, United Kingdom
________________________________
 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.

Reply via email to