As far as I can tell, that is how it's currently set-up (does the same on mine at least). The HTML Stripper seems to exclude the pre tag, but include the post tag when it generates the start and end offsets of each text token. I couldn't say why though... (This may just have avoided needing to backtrack).
Play around in the analysis section of the admin ui to verify this. Geraint -----Original Message----- From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de] Sent: 07 September 2016 18:16 To: solr-user@lucene.apache.org Subject: AW: Wrong highlighting in stripped HTML field Hello, can anyone confirm this behavior of the highlighter? Otherwise my Solr installation might be misconfigured or something. Or does anyone know if this is a known issue? In that case I probably should ask on the dev mailing list. Thanks and cheers, Dennis ________________________________________ Von: Neumann, Dennis [neum...@sub.uni-goettingen.de] Gesendet: Montag, 5. September 2016 18:00 An: solr-user@lucene.apache.org Betreff: Wrong highlighting in stripped HTML field Hi guys I am having a problem with the standard highlighter. I'm working with Solr 5.4.1. The problem appears in my project, but it is easy to replicate: I create a new core with the conf directory from configsets/basic_configs, so everything is set to defaults. I add the following in schema.xml: <field name="testfield" type="mytype" indexed="true" stored="true" required="false" multiValued="false" /> <fieldType name="mytype" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory" /> <tokenizer class="solr.StandardTokenizerFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> </analyzer> </fieldType> Now I add this document (in the admin interface): {"id":"1","testfield":"<span>bla</span>"} I search for: testfield:bla with hl=on&hl.fl=testfield What I get is a response with an incorrectly formatted HTML snippet: "response": { "numFound": 1, "start": 0, "docs": [ { "id": "1", "testfield": "<span>bla</span>", "_version_": 1544645963570741200 } ] }, "highlighting": { "1": { "testfield": [ "<span><em>bla</span></em>" ] } } Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want to get <span><em>bla</em></span> Best regards Dennis ________________________________ Syngenta Limited, Registered in England No 2710846; Registered Office : Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, RG42 6EY, United Kingdom ________________________________ This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.