Hello again. My phrase about TokenStream reuse was incorrect since it's reused even in TextField case, so Analyzer#createComponents() is called infrequently. But analysis itself takes some time (in case of StrField it's trivial, just calling `toInternal`).
One suspicious moment is that solr.StrField doesn't call `FieldType#toInternal` for creating docValues and calls it in DefaultAnalyzer for indexing/storing of the field. It's currently not a problem since toInternal is no-op. I had to choose first workaround mentioned in previous message because I want to have effective prefix and suffix wildcard search (and have facet on same field). My solution[1] is to subclass `solr.TextField` and override `FieldType#checkSchemaField` (make it no-op) and `FieldType#createFields` to produce appropriate docValues fields. Resulting `createFields` method is similar to StrField's one but it pass value through analyzer chain and can produce several `SortedSetDocValuesField` in case analyzer returns several tokens and multiValued="true" is set for field. Some code: @Override public List<IndexableField> createFields(SchemaField field, Object value, float boost) { if (field.hasDocValues()) { List<IndexableField> fields = new ArrayList<>(); fields.add(createField(field, value, boost)); List<String> data = analyzedField(field, value); if (field.multiValued()) { for (String datum : data) { final BytesRef bytes = new BytesRef(datum); fields.add(new SortedSetDocValuesField(field.getName(), bytes)); } } else { if (data.size() > 1) { throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Field analysis for " + field + " returned multiple analyzed values"); } final BytesRef bytes = new BytesRef(data.get(0)); fields.add(new SortedDocValuesField(field.getName(), bytes)); } return fields; } else { return Collections.singletonList(createField(field, value, boost)); } } private List<String> analyzedField(SchemaField field, Object value) { try { List<String> result = new ArrayList<>(); TokenStream ts = field.getType().getIndexAnalyzer().tokenStream(field.getName(), value.toString()); CharTermAttribute term = ts.addAttribute(CharTermAttribute.class); try { ts.reset(); while (ts.incrementToken()) { result.add(term.toString()); } ts.end(); } finally { ts.close(); } return result; } catch (IOException e) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can't analyze " + value.toString() + " in " + field); } } I've tested it on my laptop with single solr core (on solr 5.2.1) with ~1M documents (mostly fields are stored and indexed, about 20 fields for faceting using this class), it works as expected. Indexing was done with java app with 4 parallel threads, disk write is about 10 MiB/s, ~25k docs/minute. Schema part example can be found on github[1]. Any thoughts on it? Can something like this be merged into trunk to support docValues on `solr.TextField`? [1]: https://github.com/grossws/solr-dvtf чт, 9 июля 2015 г. в 2:52, Konstantin Gribov <gros...@gmail.com>: > Hi, folks. > > Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase > etc) on some fields for search and faceting. But on larger index it takes > several minutes to uninvert that fields for faceting (I use fieldValueCache > & warmup queries with facets). It becomes too expensive in case of frequent > soft commits (5-10 mins), so I want to migrate to docValues to avoid > uninvert phase. > > Documentation[1] says that only Trie*Field, StrField and UUIDField (which > itself is subtype of StrField) support docValues="true". > > I have tried two ways to workaround this issue: > 1. Make a subtype of TextField which overrides `checkSchemaField` > efficiently turning docValues for this "TextField" on. All preprocessing is > specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it > produces exactly one token for each value in this multivalued field), > defined via schema.xml. It seems to work but I haven't tested it under > load. What are potential caveats in such scheme? Why it isn't used in trunk > Solr? > 2. Make subtype of StrField which will perform hardcoded preprocessing > (like ASCII folding, lowercasing) but I can't find appropriate point to > insert this behavior. The only working method was to override both > toInternal and createFields (since creating BytesRef for docValues don't > use toInternal there) and do value preprocessing there. What are potential > caveats? Search becomes case-insensitive (since toInternal is used by > createField and default tokenizer), facets become lowercase because > docValues created lowercase by createFields override. > > StrField-based variant should be faster than TextField-based since > TokenStream is reused internally in first case and recreated on each doc > with TokenizedChain in second one. But StrField-based approach hardcodes > preprocessing. > > Next issue is that I want to use prefix and suffix wildcard search for > some fields. As I understood from code it works only on TextField (because > it requires Analyzer to be an instance of TokenizerChain with > ReversedWildcardFilterFactory in TokenFilter chain). Should I use it in > StrField-based variant by overriding getIndexAnalyzer/getQueryAnalyzer or > it would break something? > > [1]: https://cwiki.apache.org/confluence/display/solr/DocValues > > -- > Best regards, > Konstantin Gribov > -- Best regards, Konstantin Gribov