Re: Wrong highlighting in stripped HTML field

2016-09-08 Thread Alan Woodward
Hi, see https://issues.apache.org/jira/browse/SOLR-4686 
 - this is an ongoing point of 
contention!

Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH  
> wrote:
> 
> As far as I can tell, that is how it's currently set-up (does the same on 
> mine at least). The HTML Stripper seems to exclude the pre tag, but include 
> the post tag when it generates the start and end offsets of each text token. 
> I couldn't say why though... (This may just have avoided needing to 
> backtrack).
> 
> Play around in the analysis section of the admin ui to verify this.
> 
> Geraint
> 
> 
> -Original Message-
> From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
> 
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr 
> installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should 
> ask on the dev mailing list.
> 
> Thanks and cheers,
> Dennis
> 
> 
> 
> Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
> 
> Hi guys
> 
> I am having a problem with the standard highlighter. I'm working with Solr 
> 5.4.1. The problem appears in my project, but it is easy to replicate:
> 
> I create a new core with the conf directory from configsets/basic_configs, so 
> everything is set to defaults. I add the following in schema.xml:
> 
> 
> required="false" multiValued="false" />
> 
>
>  
>
>
>  
>  
>
>  
>
> 
> 
> Now I add this document (in the admin interface):
> 
> {"id":"1","testfield":"bla"}
> 
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
> 
> What I get is a response with an incorrectly formatted HTML snippet:
> 
> 
>  "response": {
>"numFound": 1,
>"start": 0,
>"docs": [
>  {
>"id": "1",
>"testfield": "bla",
>"_version_": 1544645963570741200
>  }
>]
>  },
>  "highlighting": {
>"1": {
>  "testfield": [
>"bla"
>  ]
>}
>  }
> 
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I 
> want to get
> 
> bla
> 
> 
> Best regards
> Dennis
> 
> 
> 
> 
> 
> Syngenta Limited, Registered in England No 2710846; Registered Office : 
> Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
> RG42 6EY, United Kingdom
> 
> This message may contain confidential information. If you are not the 
> designated recipient, please notify the sender immediately, and delete the 
> original and any copies. Any use of the message by you is prohibited.



RE: Wrong highlighting in stripped HTML field

2016-09-08 Thread Duck Geraint (ext) GBJH
As far as I can tell, that is how it's currently set-up (does the same on mine 
at least). The HTML Stripper seems to exclude the pre tag, but include the post 
tag when it generates the start and end offsets of each text token. I couldn't 
say why though... (This may just have avoided needing to backtrack).

Play around in the analysis section of the admin ui to verify this.

Geraint


-Original Message-
From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
Sent: 07 September 2016 18:16
To: solr-user@lucene.apache.org
Subject: AW: Wrong highlighting in stripped HTML field

Hello,
can anyone confirm this behavior of the highlighter? Otherwise my Solr 
installation might be misconfigured or something.
Or does anyone know if this is a known issue? In that case I probably should 
ask on the dev mailing list.

Thanks and cheers,
Dennis



Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
Gesendet: Montag, 5. September 2016 18:00
An: solr-user@lucene.apache.org
Betreff: Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 
5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so 
everything is set to defaults. I add the following in schema.xml:





  


  
  

  



Now I add this document (in the admin interface):

{"id":"1","testfield":"bla"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"id": "1",
"testfield": "bla",
"_version_": 1544645963570741200
  }
]
  },
  "highlighting": {
"1": {
  "testfield": [
"bla"
  ]
}
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want 
to get

bla


Best regards
Dennis





Syngenta Limited, Registered in England No 2710846; Registered Office : 
Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
RG42 6EY, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.