Hi Mike,
When I add the following test to TestHTMLStripCharFilterFactory.java on Solr
trunk, it passes:
public void testNumericCharacterEntities() throws Exception {
final String text = "Bose® ™"; // |Bose® ™|
HTMLStripCharFilterFactory htmlStripFactory = new
HTMLStripCharFilterFactory();
htmlStripFactory.init(Collections.<String,String>emptyMap());
CharStream charStream = htmlStripFactory.create(CharReader.get(new
StringReader(text)));
StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
stdTokFactory.init(DEFAULT_VERSION_PARAM);
Tokenizer stream = stdTokFactory.create(charStream);
assertTokenStreamContents(stream, new String[] { "Bose" });
}
What's happening:
First, htmlStripFactory converts "®" to "®" and "™" to "™". Then
stdTokFactory declines to tokenize "®" and "™", because they are belong to the
Unicode general category "Symbol, Other", and so are not included in any of the
output tokens.
StandardTokenizer uses the Word Break rules find UAX#29
<http://unicode.org/reports/tr29/> to find token boundaries, and then outputs
only alphanumeric tokens. See the JFlex grammar for details:
<http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>.
The behavior you're seeing is not consistent with the above test.
Steve
> -----Original Message-----
> From: Mike Hugo [mailto:[email protected]]
> Sent: Tuesday, January 24, 2012 1:34 PM
> To: [email protected]
> Subject: HTMLStripCharFilterFactory not working in Solr4?
>
> We recently updated to the latest build of Solr4 and everything is working
> really well so far! There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted. If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
> SolrInputDocument inputDocument = new SolrInputDocument()
> inputDocument.addField('label', 'Bose® ™')
>
> solrServer.add(inputDocument)
> solrServer.commit()
>
> QueryResponse response = solrServer.query(new SolrQuery('bose'))
> assert 1 == response.results.numFound
>
> SolrQuery facetQuery = new SolrQuery('bose')
> facetQuery.facet = true
> facetQuery.set(FacetParams.FACET_FIELD, 'label')
> facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
> response = solrServer.query(facetQuery)
> FacetField ff = response.facetFields.find {it.name == 'label'}
>
> List suggestResponse = []
>
> for (FacetField.Count facetField in ff?.values) {
> suggestResponse << facetField.name
> }
>
> assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms. Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
> | |
> | false
> [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike