On Tue, 2012-09-11 at 17:23 +0200, Robert Muir wrote: > Just a concern where things could act a little funky: > > today for example, If I set strength=primary, then its going to fold > Test and test to the same unique term, > but under this scheme you would have <bytes>Test and <bytes>test as two terms. > > this could be undesirable in the typical case that you just want > case-insensitive facets: but we don't provide > any way to preprocess the text to avoid this.
I seem to be missing something here. The ICUCollationKeyFilter can be at the end of the analyzer chain, so why can't the input be normalized before entering this filter? > Really a lot of this is because factory-based analysis chains have no > way to specify the AttributeFactory, > e.g. i guess if we really wanted to fix this right we would need to > pass in the AttributeFactory to TokenizerFactory's create() method. Sounds like a larger change. > But for now from Solr it would be a little hacky, e.g. someone is > gonna have to fold the case client-side or whatever > if they don't want these problems. That would be a serious impediment. For some of our uncontrolled fields, the same word can be cased very differently: CD, cd, Cd. To be of the safe side, the client would have to ask for 3 times the wanted amount of facet information. But if we cannot normalize at index time, de-duplication on the server would require changes to the faceting code. Regardless, it sounds that the idea passes the initial sanity check. Should I open a JIRA issue for it? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org