Re: Best way to facets with value preprocessing (w/ docValues)

2015-07-14 Thread Harry Yoo
I had a same issue and here is my solution.

Basically, option #1 that Konstantin suggested,

public class TextDocValueField extends TextField {

  @Override
  public List createFields(SchemaField field, Object value, 
float boost) {
if (field.hasDocValues()) {
  List fields = new ArrayList<>();
  fields.add(createField(field, value, boost));
  final BytesRef bytes = new BytesRef(value.toString());
  if (field.multiValued()) {
fields.add(new SortedSetDocValuesField(field.getName(), bytes));
  } else {
fields.add(new SortedDocValuesField(field.getName(), bytes));
  }
  return fields;
} else {
//  return Collections.singletonList(createField(field, value, boost));
  return super.createFields(field, value, boost);
}
  }

  @Override
  public void checkSchemaField(final SchemaField field) {
// do nothing
  }

  @Override
  public boolean multiValuedFieldCache() {
return false;
  }
}


I had no problem so far, but I haven’t compared performance. I wish Solr allows 
docValue on TextField

Best,
Harry






Re: Best way to facets with value preprocessing (w/ docValues)

2015-07-12 Thread Toke Eskildsen
Konstantin Gribov  wrote:
> Any thoughts on it? Can something like this be merged into trunk to support
> docValues on `solr.TextField`?

> [1]: https://github.com/grossws/solr-dvtf

This looks like a perfect fit for one of our setups, where current index 
re-open time is several minutes due to 2 analyzed-but-single-token text fields 
with 10-20M values that we use for faceting.

I am not a committer and on vacation anyway, so this is just a thumbs up to the 
initiative.

- Toke Eskildsen


Re: Best way to facets with value preprocessing (w/ docValues)

2015-07-10 Thread Konstantin Gribov
Hello again.

My phrase about TokenStream reuse was incorrect since it's reused even in
TextField case, so Analyzer#createComponents() is called infrequently. But
analysis itself takes some time (in case of StrField it's trivial, just
calling `toInternal`).

One suspicious moment is that solr.StrField doesn't call
`FieldType#toInternal` for creating docValues and calls it in
DefaultAnalyzer for indexing/storing of the field. It's currently not a
problem since toInternal is no-op.

I had to choose first workaround mentioned in previous message because I
want to have effective prefix and suffix wildcard search (and have facet on
same field).

My solution[1] is to subclass `solr.TextField` and override
`FieldType#checkSchemaField` (make it no-op) and `FieldType#createFields`
to produce appropriate docValues fields. Resulting `createFields` method is
similar to StrField's one but it pass value through analyzer chain and can
produce several `SortedSetDocValuesField` in case analyzer returns several
tokens and multiValued="true" is set for field.

Some code:

  @Override
  public List createFields(SchemaField field, Object value,
float boost) {
if (field.hasDocValues()) {
  List fields = new ArrayList<>();
  fields.add(createField(field, value, boost));

  List data = analyzedField(field, value);
  if (field.multiValued()) {
for (String datum : data) {
  final BytesRef bytes = new BytesRef(datum);
  fields.add(new SortedSetDocValuesField(field.getName(), bytes));
}
  } else {
if (data.size() > 1) {
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
"Field analysis for " + field + " returned multiple analyzed values");
}
final BytesRef bytes = new BytesRef(data.get(0));
fields.add(new SortedDocValuesField(field.getName(), bytes));
  }

  return fields;
} else {
  return Collections.singletonList(createField(field, value, boost));
}
  }

  private List analyzedField(SchemaField field, Object value) {
try {
  List result = new ArrayList<>();
  TokenStream ts =
field.getType().getIndexAnalyzer().tokenStream(field.getName(),
value.toString());
  CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
  try {
ts.reset();
while (ts.incrementToken()) {
  result.add(term.toString());
}
ts.end();
  } finally {
ts.close();
  }
  return result;
} catch (IOException e) {
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Can't
analyze " + value.toString() + " in " + field);
}
  }

I've tested it on my laptop with single solr core (on solr 5.2.1) with ~1M
documents (mostly fields are stored and indexed, about 20 fields for
faceting using this class), it works as expected. Indexing was done with
java app with 4 parallel threads, disk write is about 10 MiB/s, ~25k
docs/minute. Schema part example can be found on github[1].

Any thoughts on it? Can something like this be merged into trunk to support
docValues on `solr.TextField`?

[1]: https://github.com/grossws/solr-dvtf


чт, 9 июля 2015 г. в 2:52, Konstantin Gribov :

> Hi, folks.
>
> Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase
> etc) on some fields for search and faceting. But on larger index it takes
> several minutes to uninvert that fields for faceting (I use fieldValueCache
> & warmup queries with facets). It becomes too expensive in case of frequent
> soft commits (5-10 mins), so I want to migrate to docValues to avoid
> uninvert phase.
>
> Documentation[1] says that only Trie*Field, StrField and UUIDField (which
> itself is subtype of StrField) support docValues="true".
>
> I have tried two ways to workaround this issue:
> 1. Make a subtype of TextField which overrides `checkSchemaField`
> efficiently turning docValues for this "TextField" on. All preprocessing is
> specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it
> produces exactly one token for each value in this multivalued field),
> defined via schema.xml. It seems to work but I haven't tested it under
> load. What are potential caveats in such scheme? Why it isn't used in trunk
> Solr?
> 2. Make subtype of StrField which will perform hardcoded preprocessing
> (like ASCII folding, lowercasing) but I can't find appropriate point to
> insert this behavior. The only working method was to override both
> toInternal and createFields (since creating BytesRef for docValues don't
> use toInternal there) and do value preprocessing there. What are potential
> caveats? Search becomes case-insensitive (since toInternal is used by
> createField and default tokenizer), facets become lowercase because
> docValues created lowercase by createFields override.
>
> StrField-based variant should be faster than TextField-based since
> TokenStream is reused internally in first case and recreated on each doc
> with TokenizedChain in second one. 

Best way to facets with value preprocessing (w/ docValues)

2015-07-08 Thread Konstantin Gribov
Hi, folks.

Earlier I used solr.TextField with preprocessing (ASCII folding, lowercase
etc) on some fields for search and faceting. But on larger index it takes
several minutes to uninvert that fields for faceting (I use fieldValueCache
& warmup queries with facets). It becomes too expensive in case of frequent
soft commits (5-10 mins), so I want to migrate to docValues to avoid
uninvert phase.

Documentation[1] says that only Trie*Field, StrField and UUIDField (which
itself is subtype of StrField) support docValues="true".

I have tried two ways to workaround this issue:
1. Make a subtype of TextField which overrides `checkSchemaField`
efficiently turning docValues for this "TextField" on. All preprocessing is
specified in TokenizeChain analyzer with KeywordTokenizerFactory (so it
produces exactly one token for each value in this multivalued field),
defined via schema.xml. It seems to work but I haven't tested it under
load. What are potential caveats in such scheme? Why it isn't used in trunk
Solr?
2. Make subtype of StrField which will perform hardcoded preprocessing
(like ASCII folding, lowercasing) but I can't find appropriate point to
insert this behavior. The only working method was to override both
toInternal and createFields (since creating BytesRef for docValues don't
use toInternal there) and do value preprocessing there. What are potential
caveats? Search becomes case-insensitive (since toInternal is used by
createField and default tokenizer), facets become lowercase because
docValues created lowercase by createFields override.

StrField-based variant should be faster than TextField-based since
TokenStream is reused internally in first case and recreated on each doc
with TokenizedChain in second one. But StrField-based approach hardcodes
preprocessing.

Next issue is that I want to use prefix and suffix wildcard search for some
fields. As I understood from code it works only on TextField (because it
requires Analyzer to be an instance of TokenizerChain with
ReversedWildcardFilterFactory in TokenFilter chain). Should I use it in
StrField-based variant by overriding getIndexAnalyzer/getQueryAnalyzer or
it would break something?

[1]: https://cwiki.apache.org/confluence/display/solr/DocValues

-- 
Best regards,
Konstantin Gribov