Hi Mike, Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds for me now. (On Solr trunk, *all* CharFilters have been non-functional since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's fix today in r1235810; Solr 3.x was not affected - CharFilters have been working there all along.)
Steve > -----Original Message----- > From: Mike Hugo [mailto:m...@piragua.com] > Sent: Tuesday, January 24, 2012 3:56 PM > To: solr-user@lucene.apache.org > Subject: Re: HTMLStripCharFilterFactory not working in Solr4? > > Thanks for the responses everyone. > > Steve, the test method you provided also works for me. However, when I > try > a more end to end test with the HTMLStripCharFilterFactory configured for > a > field I am still having the same problem. I attached a failing unit test > and configuration to the following issue in JIRA: > > https://issues.apache.org/jira/browse/LUCENE-3721 > > I appreciate all the prompt responses! Looking forward to finding the > root > cause of this guy :) If there's something I'm doing incorrectly in the > configuration, please let me know! > > Mike > > On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sar...@syr.edu> wrote: > > > Hi Mike, > > > > When I add the following test to TestHTMLStripCharFilterFactory.java on > > Solr trunk, it passes: > > > > public void testNumericCharacterEntities() throws Exception { > > final String text = "Bose® ™"; // |Bose® ™| > > HTMLStripCharFilterFactory htmlStripFactory = new > > HTMLStripCharFilterFactory(); > > htmlStripFactory.init(Collections.<String,String>emptyMap()); > > CharStream charStream = htmlStripFactory.create(CharReader.get(new > > StringReader(text))); > > StandardTokenizerFactory stdTokFactory = new > StandardTokenizerFactory(); > > stdTokFactory.init(DEFAULT_VERSION_PARAM); > > Tokenizer stream = stdTokFactory.create(charStream); > > assertTokenStreamContents(stream, new String[] { "Bose" }); > > } > > > > What's happening: > > > > First, htmlStripFactory converts "®" to "®" and "™" to "™". > > Then stdTokFactory declines to tokenize "®" and "™", because they are > > belong to the Unicode general category "Symbol, Other", and so are not > > included in any of the output tokens. > > > > StandardTokenizer uses the Word Break rules find UAX#29 < > > http://unicode.org/reports/tr29/> to find token boundaries, and then > > outputs only alphanumeric tokens. See the JFlex grammar for details: < > > > http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/ > java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view= > markup > > >. > > > > The behavior you're seeing is not consistent with the above test. > > > > Steve > > > > > -----Original Message----- > > > From: Mike Hugo [mailto:m...@piragua.com] > > > Sent: Tuesday, January 24, 2012 1:34 PM > > > To: solr-user@lucene.apache.org > > > Subject: HTMLStripCharFilterFactory not working in Solr4? > > > > > > We recently updated to the latest build of Solr4 and everything is > > working > > > really well so far! There is one case that is not working the same > way > > it > > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark > > and > > > registered, for example) in a field as defined below - it was working > in > > > Solr3.4 with the configuration shown here, but is not working the same > > way > > > in Solr4. > > > > > > The label field is defined as type="text_general" > > > <field name="label" type="text_general" indexed="true" stored="false" > > > required="false" multiValued="true"/> > > > > > > Here's the type definition for text_general field: > > > <fieldType name="text_general" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer type="index"> > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > > <filter class="solr.StopFilterFactory" > ignoreCase="true" > > > words="stopwords.txt" > > > enablePositionIncrements="true"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > </analyzer> > > > <analyzer type="query"> > > > <tokenizer class="solr.StandardTokenizerFactory"/> > > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > > > <filter class="solr.StopFilterFactory" > ignoreCase="true" > > > words="stopwords.txt" > > > enablePositionIncrements="true"/> > > > <filter class="solr.LowerCaseFilterFactory"/> > > > </analyzer> > > > </fieldType> > > > > > > > > > In Solr 3.4, that configuration was completely stripping html > constructs > > > out of the indexed field which is exactly what we wanted. If for > > example, > > > we then do a facet on the label field, like in the test below, we're > > > getting some terms in the response that we would not like to be there. > > > > > > > > > // test case (groovy) > > > void specialHtmlConstructsGetStripped() { > > > SolrInputDocument inputDocument = new SolrInputDocument() > > > inputDocument.addField('label', 'Bose® ™') > > > > > > solrServer.add(inputDocument) > > > solrServer.commit() > > > > > > QueryResponse response = solrServer.query(new SolrQuery('bose')) > > > assert 1 == response.results.numFound > > > > > > SolrQuery facetQuery = new SolrQuery('bose') > > > facetQuery.facet = true > > > facetQuery.set(FacetParams.FACET_FIELD, 'label') > > > facetQuery.set(FacetParams.FACET_MINCOUNT, '1') > > > > > > response = solrServer.query(facetQuery) > > > FacetField ff = response.facetFields.find {it.name == 'label'} > > > > > > List suggestResponse = [] > > > > > > for (FacetField.Count facetField in ff?.values) { > > > suggestResponse << facetField.name > > > } > > > > > > assert suggestResponse == ['bose'] > > > } > > > > > > With the upgrade to Solr4, the assertion fails, the suggested response > > > contains 174 and 8482 as terms. Test output is: > > > > > > Assertion failed: > > > > > > assert suggestResponse == ['bose'] > > > | | > > > | false > > > [174, 8482, bose] > > > > > > > > > I just tried again using the latest build from today, namely: > > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're > > still > > > getting the failing assertion. Is there a different way to configure > the > > > HTMLStripCharFilterFactory in Solr4? > > > > > > Thanks in advance for any tips! > > > > > > Mike > >