Hello again.
My phrase about TokenStream reuse was incorrect since it's reused even in
TextField case, so Analyzer#createComponents() is called infrequently. But
analysis itself takes some time (in case of StrField it's trivial, just
calling `toInternal`).
One suspicious moment is that solr.StrField doesn't call
`FieldType#toInternal` for creating docValues and calls it in
DefaultAnalyzer for indexing/storing of the field. It's currently not a
problem since toInternal is no-op.
I had to choose first workaround mentioned in previous message because I
want to have effective prefix and suffix wildcard search (and have facet on
same field).
My solution[1] is to subclass `solr.TextField` and override
`FieldType#checkSchemaField` (make it no-op) and `FieldType#createFields`
to produce appropriate docValues fields. Resulting `createFields` method is
similar to StrField's one but it pass value through analyzer chain and can
produce several `SortedSetDocValuesField` in case analyzer returns several
tokens and multiValued="true" is set for field.
Some code:
@Override
public List createFields(SchemaField field, Object value,
float boost) {
if (field.hasDocValues()) {
List fields = new ArrayList<>();
fields.add(createField(field, value, boost));
List data = analyzedField(field, value);
if (field.multiValued()) {
for (String datum : data) {
final BytesRef bytes = new BytesRef(datum);
fields.add(new SortedSetDocValuesField(field.getName(), bytes));
}
} else {
if (data.size() > 1) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
"Field analysis for " + field + " returned multiple analyzed values");
}
final BytesRef bytes = new BytesRef(data.get(0));
fields.add(new SortedDocValuesField(field.getName(), bytes));
}
return fields;
} else {
return Collections.singletonList(createField(field, value, boost));
}
}
private List analyzedField(SchemaField field, Object value) {
try {
List result = new ArrayList<>();
TokenStream ts =
field.getType().getIndexAnalyzer().tokenStream(field.getName(),
value.toString());
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset();
while (ts.incrementToken()) {
result.add(term.toString());
}
ts.end();
} finally {
ts.close();
}
return result;
} catch (IOException e) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can't
analyze " + value.toString() + " in " + field);
}
}
I've tested it on my laptop with single solr core (on solr 5.2.1) with ~1M
documents (mostly fields are stored and indexed, about 20 fields for
faceting using this class), it works as expected. Indexing was done with
java app with 4 parallel threads, disk write is about 10 MiB/s, ~25k
docs/minute. Schema part example can be found on github[1].
Any thoughts on it? Can something like this be merged into trunk to support
docValues on `solr.TextField`?
[1]: https://github.com/grossws/solr-dvtf
чт, 9 июля 2015 г. в 2:52, Konstantin Gribov :
> Hi, folks.
>
> Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase
> etc) on some fields for search and faceting. But on larger index it takes
> several minutes to uninvert that fields for faceting (I use fieldValueCache
> & warmup queries with facets). It becomes too expensive in case of frequent
> soft commits (5-10 mins), so I want to migrate to docValues to avoid
> uninvert phase.
>
> Documentation[1] says that only Trie*Field, StrField and UUIDField (which
> itself is subtype of StrField) support docValues="true".
>
> I have tried two ways to workaround this issue:
> 1. Make a subtype of TextField which overrides `checkSchemaField`
> efficiently turning docValues for this "TextField" on. All preprocessing is
> specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it
> produces exactly one token for each value in this multivalued field),
> defined via schema.xml. It seems to work but I haven't tested it under
> load. What are potential caveats in such scheme? Why it isn't used in trunk
> Solr?
> 2. Make subtype of StrField which will perform hardcoded preprocessing
> (like ASCII folding, lowercasing) but I can't find appropriate point to
> insert this behavior. The only working method was to override both
> toInternal and createFields (since creating BytesRef for docValues don't
> use toInternal there) and do value preprocessing there. What are potential
> caveats? Search becomes case-insensitive (since toInternal is used by
> createField and default tokenizer), facets become lowercase because
> docValues created lowercase by createFields override.
>
> StrField-based variant should be faster than TextField-based since
> TokenStream is reused internally in first case and recreated on each doc
> with TokenizedChain in second one.