RE: HTMLStripCharFilterFactory not working in Solr4?

Steven A Rowe Tue, 24 Jan 2012 11:57:57 -0800

Hi Mike,

When I add the following test to TestHTMLStripCharFilterFactory.java on Solr 
trunk, it passes:
  
public void testNumericCharacterEntities() throws Exception {
  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new 
HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.<String,String>emptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new 
StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { "Bose" });
}


What's happening: 

First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".  Then 
stdTokFactory declines to tokenize "®" and "™", because they are belong to the 
Unicode general category "Symbol, Other", and so are not included in any of the 
output tokens.

StandardTokenizer uses the Word Break rules find UAX#29 
<http://unicode.org/reports/tr29/> to find token boundaries, and then outputs 
only alphanumeric tokens.  See the JFlex grammar for details: 
<http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>.

The behavior you're seeing is not consistent with the above test.

Steve

> -----Original Message-----
> From: Mike Hugo [mailto:[email protected]]
> Sent: Tuesday, January 24, 2012 1:34 PM
> To: [email protected]
> Subject: HTMLStripCharFilterFactory not working in Solr4?
> 
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
> 
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
> 
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
> 
> 
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
> 
> 
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>     SolrInputDocument inputDocument = new SolrInputDocument()
>     inputDocument.addField('label', 'Bose&#174; &#8482;')
> 
>     solrServer.add(inputDocument)
>     solrServer.commit()
> 
>     QueryResponse response = solrServer.query(new SolrQuery('bose'))
>     assert 1 == response.results.numFound
> 
>     SolrQuery facetQuery = new SolrQuery('bose')
>     facetQuery.facet = true
>     facetQuery.set(FacetParams.FACET_FIELD, 'label')
>     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> 
>     response = solrServer.query(facetQuery)
>     FacetField ff = response.facetFields.find {it.name == 'label'}
> 
>     List suggestResponse = []
> 
>     for (FacetField.Count facetField in ff?.values) {
>         suggestResponse << facetField.name
>     }
> 
>     assert suggestResponse == ['bose']
> }
> 
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
> 
> Assertion failed:
> 
> assert suggestResponse == ['bose']
>        |               |
>        |               false
>        [174, 8482, bose]
> 
> 
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
> 
> Thanks in advance for any tips!
> 
> Mike

RE: HTMLStripCharFilterFactory not working in Solr4?

Reply via email to