Best way to facets with value preprocessing (w/ docValues)

Konstantin Gribov Wed, 08 Jul 2015 16:53:53 -0700

Hi, folks.

Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase
etc) on some fields for search and faceting. But on larger index it takes
several minutes to uninvert that fields for faceting (I use fieldValueCache
& warmup queries with facets). It becomes too expensive in case of frequent
soft commits (5-10 mins), so I want to migrate to docValues to avoid
uninvert phase.


Documentation[1] says that only Trie*Field, StrField and UUIDField (which
itself is subtype of StrField) support docValues="true".

I have tried two ways to workaround this issue:
1. Make a subtype of TextField which overrides `checkSchemaField`
efficiently turning docValues for this "TextField" on. All preprocessing is
specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it
produces exactly one token for each value in this multivalued field),
defined via schema.xml. It seems to work but I haven't tested it under
load. What are potential caveats in such scheme? Why it isn't used in trunk
Solr?
2. Make subtype of StrField which will perform hardcoded preprocessing
(like ASCII folding, lowercasing) but I can't find appropriate point to
insert this behavior. The only working method was to override both
toInternal and createFields (since creating BytesRef for docValues don't
use toInternal there) and do value preprocessing there. What are potential
caveats? Search becomes case-insensitive (since toInternal is used by
createField and default tokenizer), facets become lowercase because
docValues created lowercase by createFields override.

StrField-based variant should be faster than TextField-based since
TokenStream is reused internally in first case and recreated on each doc
with TokenizedChain in second one. But StrField-based approach hardcodes
preprocessing.

Next issue is that I want to use prefix and suffix wildcard search for some
fields. As I understood from code it works only on TextField (because it
requires Analyzer to be an instance of TokenizerChain with
ReversedWildcardFilterFactory in TokenFilter chain). Should I use it in
StrField-based variant by overriding getIndexAnalyzer/getQueryAnalyzer or
it would break something?

[1]: https://cwiki.apache.org/confluence/display/solr/DocValues

-- 
Best regards,
Konstantin Gribov

Best way to facets with value preprocessing (w/ docValues)

Reply via email to