: I have several xml files that contains html entities in some fields.

        ...

: If I set my field like this:
: 
: <field name="au">Brown &amp; Gammon</field>
: 
: Solr generates error "Undeclared general entity"

...because that's not valid XML...

: if I add CDATA like this:
: 
: <field name="au"><![CDATA[Brown &amp; Gammon]]></field>
: 
: it seems that I can't search with the &

...because that is valid xml, and tells solr you want the literal string 
"Brown &amp; Gammon" to be indexed -- given a typical analyzer you are 
probably getting either "&amp;" or "amp" as a term in your index.

: Could you help me to find the right syntax ?

the client code you are using for indexing can either "parse" these HTML 
snippets using an HTML parser, and then send solr the *real* string you 
want to index, or you can configure solr with something like 
HTMLStripFieldUpdateProcessorFactory (if you want both the indexed form 
and the stored form to be plain text) or HTMLStripCharFilterFactory (if 
you wnat to preserve the html markup in the stored value, but strip it as 
part of the analysis chain for indexing.


http://lucene.apache.org/solr/6_1_0/solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
http://lucene.apache.org/core/6_1_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html


-Hoss
http://www.lucidworks.com/

Reply via email to