RE: HTMLStripCharFilterFactory not working in Solr4?

Steven A Rowe Wed, 25 Jan 2012 09:01:37 -0800

Hi Mike,

Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds for me 
now.  (On Solr trunk, *all* CharFilters have been non-functional since 
LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's fix today 
in r1235810; Solr 3.x was not affected - CharFilters have been working there 
all along.)


Steve

> -----Original Message-----
> From: Mike Hugo [mailto:m...@piragua.com]
> Sent: Tuesday, January 24, 2012 3:56 PM
> To: solr-user@lucene.apache.org
> Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
> 
> Thanks for the responses everyone.
> 
> Steve, the test method you provided also works for me.  However, when I
> try
> a more end to end test with the HTMLStripCharFilterFactory configured for
> a
> field I am still having the same problem.  I attached a failing unit test
> and configuration to the following issue in JIRA:
> 
> https://issues.apache.org/jira/browse/LUCENE-3721
> 
> I appreciate all the prompt responses!  Looking forward to finding the
> root
> cause of this guy :)  If there's something I'm doing incorrectly in the
> configuration, please let me know!
> 
> Mike
> 
> On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sar...@syr.edu> wrote:
> 
> > Hi Mike,
> >
> > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > Solr trunk, it passes:
> >
> > public void testNumericCharacterEntities() throws Exception {
> >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> >  HTMLStripCharFilterFactory htmlStripFactory = new
> > HTMLStripCharFilterFactory();
> >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > StringReader(text)));
> >  StandardTokenizerFactory stdTokFactory = new
> StandardTokenizerFactory();
> >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> >  Tokenizer stream = stdTokFactory.create(charStream);
> >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > }
> >
> > What's happening:
> >
> > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > belong to the Unicode general category "Symbol, Other", and so are not
> > included in any of the output tokens.
> >
> > StandardTokenizer uses the Word Break rules find UAX#29 <
> > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> markup
> > >.
> >
> > The behavior you're seeing is not consistent with the above test.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Mike Hugo [mailto:m...@piragua.com]
> > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > >
> > > We recently updated to the latest build of Solr4 and everything is
> > working
> > > really well so far!  There is one case that is not working the same
> way
> > it
> > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> > and
> > > registered, for example) in a field as defined below - it was working
> in
> > > Solr3.4 with the configuration shown here, but is not working the same
> > way
> > > in Solr4.
> > >
> > > The label field is defined as type="text_general"
> > > <field name="label" type="text_general" indexed="true" stored="false"
> > > required="false" multiValued="true"/>
> > >
> > > Here's the type definition for text_general field:
> > > <fieldType name="text_general" class="solr.TextField"
> > > positionIncrementGap="100">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >         </fieldType>
> > >
> > >
> > > In Solr 3.4, that configuration was completely stripping html
> constructs
> > > out of the indexed field which is exactly what we wanted.  If for
> > example,
> > > we then do a facet on the label field, like in the test below, we're
> > > getting some terms in the response that we would not like to be there.
> > >
> > >
> > > // test case (groovy)
> > > void specialHtmlConstructsGetStripped() {
> > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > >
> > >     solrServer.add(inputDocument)
> > >     solrServer.commit()
> > >
> > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > >     assert 1 == response.results.numFound
> > >
> > >     SolrQuery facetQuery = new SolrQuery('bose')
> > >     facetQuery.facet = true
> > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > >
> > >     response = solrServer.query(facetQuery)
> > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > >
> > >     List suggestResponse = []
> > >
> > >     for (FacetField.Count facetField in ff?.values) {
> > >         suggestResponse << facetField.name
> > >     }
> > >
> > >     assert suggestResponse == ['bose']
> > > }
> > >
> > > With the upgrade to Solr4, the assertion fails, the suggested response
> > > contains 174 and 8482 as terms.  Test output is:
> > >
> > > Assertion failed:
> > >
> > > assert suggestResponse == ['bose']
> > >        |               |
> > >        |               false
> > >        [174, 8482, bose]
> > >
> > >
> > > I just tried again using the latest build from today, namely:
> > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > still
> > > getting the failing assertion. Is there a different way to configure
> the
> > > HTMLStripCharFilterFactory in Solr4?
> > >
> > > Thanks in advance for any tips!
> > >
> > > Mike
> >

RE: HTMLStripCharFilterFactory not working in Solr4?

Reply via email to