Hello again.

My phrase about TokenStream reuse was incorrect since it's reused even in
TextField case, so Analyzer#createComponents() is called infrequently. But
analysis itself takes some time (in case of StrField it's trivial, just
calling `toInternal`).

One suspicious moment is that solr.StrField doesn't call
`FieldType#toInternal` for creating docValues and calls it in
DefaultAnalyzer for indexing/storing of the field. It's currently not a
problem since toInternal is no-op.

I had to choose first workaround mentioned in previous message because I
want to have effective prefix and suffix wildcard search (and have facet on
same field).

My solution[1] is to subclass `solr.TextField` and override
`FieldType#checkSchemaField` (make it no-op) and `FieldType#createFields`
to produce appropriate docValues fields. Resulting `createFields` method is
similar to StrField's one but it pass value through analyzer chain and can
produce several `SortedSetDocValuesField` in case analyzer returns several
tokens and multiValued="true" is set for field.

Some code:

  @Override
  public List<IndexableField> createFields(SchemaField field, Object value,
float boost) {
    if (field.hasDocValues()) {
      List<IndexableField> fields = new ArrayList<>();
      fields.add(createField(field, value, boost));

      List<String> data = analyzedField(field, value);
      if (field.multiValued()) {
        for (String datum : data) {
          final BytesRef bytes = new BytesRef(datum);
          fields.add(new SortedSetDocValuesField(field.getName(), bytes));
        }
      } else {
        if (data.size() > 1) {
          throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
"Field analysis for " + field + " returned multiple analyzed values");
        }
        final BytesRef bytes = new BytesRef(data.get(0));
        fields.add(new SortedDocValuesField(field.getName(), bytes));
      }

      return fields;
    } else {
      return Collections.singletonList(createField(field, value, boost));
    }
  }

  private List<String> analyzedField(SchemaField field, Object value) {
    try {
      List<String> result = new ArrayList<>();
      TokenStream ts =
field.getType().getIndexAnalyzer().tokenStream(field.getName(),
value.toString());
      CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
      try {
        ts.reset();
        while (ts.incrementToken()) {
          result.add(term.toString());
        }
        ts.end();
      } finally {
        ts.close();
      }
      return result;
    } catch (IOException e) {
      throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can't
analyze " + value.toString() + " in " + field);
    }
  }

I've tested it on my laptop with single solr core (on solr 5.2.1) with ~1M
documents (mostly fields are stored and indexed, about 20 fields for
faceting using this class), it works as expected. Indexing was done with
java app with 4 parallel threads, disk write is about 10 MiB/s, ~25k
docs/minute. Schema part example can be found on github[1].

Any thoughts on it? Can something like this be merged into trunk to support
docValues on `solr.TextField`?

[1]: https://github.com/grossws/solr-dvtf


чт, 9 июля 2015 г. в 2:52, Konstantin Gribov <gros...@gmail.com>:

> Hi, folks.
>
> Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase
> etc) on some fields for search and faceting. But on larger index it takes
> several minutes to uninvert that fields for faceting (I use fieldValueCache
> & warmup queries with facets). It becomes too expensive in case of frequent
> soft commits (5-10 mins), so I want to migrate to docValues to avoid
> uninvert phase.
>
> Documentation[1] says that only Trie*Field, StrField and UUIDField (which
> itself is subtype of StrField) support docValues="true".
>
> I have tried two ways to workaround this issue:
> 1. Make a subtype of TextField which overrides `checkSchemaField`
> efficiently turning docValues for this "TextField" on. All preprocessing is
> specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it
> produces exactly one token for each value in this multivalued field),
> defined via schema.xml. It seems to work but I haven't tested it under
> load. What are potential caveats in such scheme? Why it isn't used in trunk
> Solr?
> 2. Make subtype of StrField which will perform hardcoded preprocessing
> (like ASCII folding, lowercasing) but I can't find appropriate point to
> insert this behavior. The only working method was to override both
> toInternal and createFields (since creating BytesRef for docValues don't
> use toInternal there) and do value preprocessing there. What are potential
> caveats? Search becomes case-insensitive (since toInternal is used by
> createField and default tokenizer), facets become lowercase because
> docValues created lowercase by createFields override.
>
> StrField-based variant should be faster than TextField-based since
> TokenStream is reused internally in first case and recreated on each doc
> with TokenizedChain in second one. But StrField-based approach hardcodes
> preprocessing.
>
> Next issue is that I want to use prefix and suffix wildcard search for
> some fields. As I understood from code it works only on TextField (because
> it requires Analyzer to be an instance of TokenizerChain with
> ReversedWildcardFilterFactory in TokenFilter chain). Should I use it in
> StrField-based variant by overriding getIndexAnalyzer/getQueryAnalyzer or
> it would break something?
>
> [1]: https://cwiki.apache.org/confluence/display/solr/DocValues
>
> --
> Best regards,
> Konstantin Gribov
>
-- 
Best regards,
Konstantin Gribov

Reply via email to