Thanks for the responses everyone.

Steve, the test method you provided also works for me.  However, when I try
a more end to end test with the HTMLStripCharFilterFactory configured for a
field I am still having the same problem.  I attached a failing unit test
and configuration to the following issue in JIRA:

https://issues.apache.org/jira/browse/LUCENE-3721

I appreciate all the prompt responses!  Looking forward to finding the root
cause of this guy :)  If there's something I'm doing incorrectly in the
configuration, please let me know!

Mike

On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Mike,
>
> When I add the following test to TestHTMLStripCharFilterFactory.java on
> Solr trunk, it passes:
>
> public void testNumericCharacterEntities() throws Exception {
>  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
>  HTMLStripCharFilterFactory htmlStripFactory = new
> HTMLStripCharFilterFactory();
>  htmlStripFactory.init(Collections.<String,String>emptyMap());
>  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> StringReader(text)));
>  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
>  stdTokFactory.init(DEFAULT_VERSION_PARAM);
>  Tokenizer stream = stdTokFactory.create(charStream);
>  assertTokenStreamContents(stream, new String[] { "Bose" });
> }
>
> What's happening:
>
> First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
>  Then stdTokFactory declines to tokenize "®" and "™", because they are
> belong to the Unicode general category "Symbol, Other", and so are not
> included in any of the output tokens.
>
> StandardTokenizer uses the Word Break rules find UAX#29 <
> http://unicode.org/reports/tr29/> to find token boundaries, and then
> outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup
> >.
>
> The behavior you're seeing is not consistent with the above test.
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:m...@piragua.com]
> > Sent: Tuesday, January 24, 2012 1:34 PM
> > To: solr-user@lucene.apache.org
> > Subject: HTMLStripCharFilterFactory not working in Solr4?
> >
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >     SolrInputDocument inputDocument = new SolrInputDocument()
> >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >     solrServer.add(inputDocument)
> >     solrServer.commit()
> >
> >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >     assert 1 == response.results.numFound
> >
> >     SolrQuery facetQuery = new SolrQuery('bose')
> >     facetQuery.facet = true
> >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >     response = solrServer.query(facetQuery)
> >     FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >     List suggestResponse = []
> >
> >     for (FacetField.Count facetField in ff?.values) {
> >         suggestResponse << facetField.name
> >     }
> >
> >     assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >        |               |
> >        |               false
> >        [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>

Reply via email to