Words that need protection from stemming, i.e., protwords.txt

2009-01-16 Thread David Woodward
Hi. Any good protwords.txt out there? In a fairly standard solr analyzer chain, we use the English Porter analyzer like so: For most purposes the porter does just fine, but occasionally words come along that really don't work out to well, e.g., "maine" is stemmed to "main" - clearly goofing

Unicode Normalization

2007-04-11 Thread David Woodward
Hi. I have encountered a problem searching in my application because of inconsistant unicode normalization forms in the corpus (and the queries). I would like to normalize to form NFKD in an analyzer (I think). I was thinking about creating a filter similar to the lowercasefilter that would do