On Tue, 2012-09-11 at 17:23 +0200, Robert Muir wrote:
> Just a concern where things could act a little funky:
> 
> today for example, If I set strength=primary, then its going to fold
> Test and test to the same unique term,
> but under this scheme you would have <bytes>Test and <bytes>test as two terms.
>
> this could be undesirable in the typical case that you just want
> case-insensitive facets: but we don't provide
> any way to preprocess the text to avoid this.

I seem to be missing something here. The ICUCollationKeyFilter can be at
the end of the analyzer chain, so why can't the input be normalized
before entering this filter?

> Really a lot of this is because factory-based analysis chains have no
> way to specify the AttributeFactory,
> e.g. i guess if we really wanted to fix this right we would need to
> pass in the AttributeFactory to TokenizerFactory's create() method.

Sounds like a larger change.

> But for now from Solr it would be a little hacky, e.g. someone is
> gonna have to fold the case client-side or whatever
> if they don't want these problems.

That would be a serious impediment. For some of our uncontrolled fields,
the same word can be cased very differently: CD, cd, Cd. To be of the
safe side, the client would have to ask for 3 times the wanted amount of
facet information. But if we cannot normalize at index time,
de-duplication on the server would require changes to the faceting code.


Regardless, it sounds that the idea passes the initial sanity check.
Should I open a JIRA issue for it?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to