Thanks for the responses everyone. Steve, the test method you provided also works for me. However, when I try a more end to end test with the HTMLStripCharFilterFactory configured for a field I am still having the same problem. I attached a failing unit test and configuration to the following issue in JIRA:
https://issues.apache.org/jira/browse/LUCENE-3721 I appreciate all the prompt responses! Looking forward to finding the root cause of this guy :) If there's something I'm doing incorrectly in the configuration, please let me know! Mike On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Mike, > > When I add the following test to TestHTMLStripCharFilterFactory.java on > Solr trunk, it passes: > > public void testNumericCharacterEntities() throws Exception { > final String text = "Bose® ™"; // |Bose® ™| > HTMLStripCharFilterFactory htmlStripFactory = new > HTMLStripCharFilterFactory(); > htmlStripFactory.init(Collections.<String,String>emptyMap()); > CharStream charStream = htmlStripFactory.create(CharReader.get(new > StringReader(text))); > StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory(); > stdTokFactory.init(DEFAULT_VERSION_PARAM); > Tokenizer stream = stdTokFactory.create(charStream); > assertTokenStreamContents(stream, new String[] { "Bose" }); > } > > What's happening: > > First, htmlStripFactory converts "®" to "®" and "™" to "™". > Then stdTokFactory declines to tokenize "®" and "™", because they are > belong to the Unicode general category "Symbol, Other", and so are not > included in any of the output tokens. > > StandardTokenizer uses the Word Break rules find UAX#29 < > http://unicode.org/reports/tr29/> to find token boundaries, and then > outputs only alphanumeric tokens. See the JFlex grammar for details: < > http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup > >. > > The behavior you're seeing is not consistent with the above test. > > Steve > > > -----Original Message----- > > From: Mike Hugo [mailto:m...@piragua.com] > > Sent: Tuesday, January 24, 2012 1:34 PM > > To: solr-user@lucene.apache.org > > Subject: HTMLStripCharFilterFactory not working in Solr4? > > > > We recently updated to the latest build of Solr4 and everything is > working > > really well so far! There is one case that is not working the same way > it > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark > and > > registered, for example) in a field as defined below - it was working in > > Solr3.4 with the configuration shown here, but is not working the same > way > > in Solr4. > > > > The label field is defined as type="text_general" > > <field name="label" type="text_general" indexed="true" stored="false" > > required="false" multiValued="true"/> > > > > Here's the type definition for text_general field: > > <fieldType name="text_general" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" > > enablePositionIncrements="true"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > > > In Solr 3.4, that configuration was completely stripping html constructs > > out of the indexed field which is exactly what we wanted. If for > example, > > we then do a facet on the label field, like in the test below, we're > > getting some terms in the response that we would not like to be there. > > > > > > // test case (groovy) > > void specialHtmlConstructsGetStripped() { > > SolrInputDocument inputDocument = new SolrInputDocument() > > inputDocument.addField('label', 'Bose® ™') > > > > solrServer.add(inputDocument) > > solrServer.commit() > > > > QueryResponse response = solrServer.query(new SolrQuery('bose')) > > assert 1 == response.results.numFound > > > > SolrQuery facetQuery = new SolrQuery('bose') > > facetQuery.facet = true > > facetQuery.set(FacetParams.FACET_FIELD, 'label') > > facetQuery.set(FacetParams.FACET_MINCOUNT, '1') > > > > response = solrServer.query(facetQuery) > > FacetField ff = response.facetFields.find {it.name == 'label'} > > > > List suggestResponse = [] > > > > for (FacetField.Count facetField in ff?.values) { > > suggestResponse << facetField.name > > } > > > > assert suggestResponse == ['bose'] > > } > > > > With the upgrade to Solr4, the assertion fails, the suggested response > > contains 174 and 8482 as terms. Test output is: > > > > Assertion failed: > > > > assert suggestResponse == ['bose'] > > | | > > | false > > [174, 8482, bose] > > > > > > I just tried again using the latest build from today, namely: > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're > still > > getting the failing assertion. Is there a different way to configure the > > HTMLStripCharFilterFactory in Solr4? > > > > Thanks in advance for any tips! > > > > Mike >