Tokenization is fine with facets, that caution is about, say, faceting on the tokenized body of a document where you have potentially a huge number of unique tokens.
But if there is a controlled number of distinct values, you shouldn't have to do anything except index to a tokenized field. I'd remove stemming, WordDelimiterFactory, etc though, in fact I'd probably just go with WhiteSpaceTokenizer and, maybe, LowerCaseFilter. But if you have a huge number of unique values, it doesn't matter whether they are tokenized or strings, it'll still be a problem. One note: when faceting for the first time on a newly-started Solr instance, the caches are filled and the *first* query will be slower, so measure subsequent queries. Best Erick On Thu, Jan 27, 2011 at 9:09 AM, Dennis Schafroth <den...@indexdata.com>wrote: > Hi, > > Pretty novice into SOLR coding, but looking for hints about how (if not > already done) to implement a PatternTokenizer, that would index this into > multivalie fields of solr.StrField for facetting. Ex. > > Water -- Irrigation ; Water -- Sewage > > should be tokenized into > > Water > Irrigation > Sewage > > in multi-valued non-tokenized fields due to performance. I could do it from > the outside, but I would this as a opportunity to learn about SOLR. > > It "works" as I want with the PatternTokenizerFactory when I am using > solr.TextField, but not when I am using the non-tokenized solr.StrField. But > according to reading, facets performance is better on non-tokenized fields. > We need better performance on our faceted searches on these multi-value > fields. (25 million documents, three multi-valued facets) > > I would also need to have a filter that filter out identical values as the > feeds have redundant data as shown above. > > Can anyone point point me in the right direction.. > > cheers, > :-Dennis