Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
in case this helps someone, here is a solution (probably very
efficient already, but i didn't profile it); it can deal with DocValues and
with FieldCache (the old 'stored' values)



private void unInvertedTheDamnThing(
  SolrIndexSearcher searcher,
  List fields,
  KVSetter setter) throws IOException {

LeafReader reader = searcher.getLeafReader();
  IndexSchema schema = searcher.getCore().getLatestSchema();
  List leaves = reader.getContext().leaves();

  Bits liveDocs;
  LeafReader lr;
  Transformer transformer;
for (LeafReaderContext leave: leaves) {
   int docBase = leave.docBase;
   liveDocs = leave.reader().getLiveDocs();
   lr = leave.reader();
   FieldInfos fInfo = lr.getFieldInfos();

   for (String field: fields) {

 FieldInfo fi = fInfo.fieldInfo(field);
 SchemaField fSchema = schema.getField(field);
 DocValuesType fType = fi.getDocValuesType();
 Map mapping = new HashMap();
 final LeafReader unReader;

 if (fType.equals(DocValuesType.NONE)) {
   Class c = fType.getClass();
  if (c.isAssignableFrom(TextField.class) ||
c.isAssignableFrom(StrField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED);
}
else {
  mapping.put(field, Type.BINARY);
}
  }
  else if (c.isAssignableFrom(TrieIntField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED_SET_INTEGER);
}
else {
  mapping.put(field, Type.INTEGER_POINT);
}
  }
  else {
continue;
  }
  unReader = new UninvertingReader(lr, mapping);
 }
 else {
   unReader = lr;
 }

switch(fType) {
   case NUMERIC:
 transformer = new Transformer() {
   NumericDocValues dv = unReader.getNumericDocValues(field);
   @Override
  public void process(int docBase, int docId) {
int v = (int) dv.get(docId);
setter.set(docBase, docId, v);
  }
 };
 break;
   case SORTED_NUMERIC:
 transformer = new Transformer() {
  SortedNumericDocValues dv =
unReader.getSortedNumericDocValues(field);
  @Override
  public void process(int docBase, int docId) {
dv.setDocument(docId);
int max = dv.count();
int v;
for (int i=0; i 5)
  return;
dv.setDocument(docId);
for (long ord = dv.nextOrd(); ord !=
SortedSetDocValues.NO_MORE_ORDS; ord = dv.nextOrd()) {
  final BytesRef value = dv.lookupOrd(ord);
  setter.set(docBase, docId, value.utf8ToString());
}
  }
};
 break;
   case SORTED:
 transformer = new Transformer() {
   SortedDocValues dv = unReader.getSortedDocValues(field);
  TermsEnum te;
  @Override
  public void process(int docBase, int docId) {
BytesRef v = dv.get(docId);
if (v.length == 0)
  return;
setter.set(docBase, docId, v.utf8ToString());
  }
};
 break;
   default:
 throw new IllegalArgumentException("The field " + field + "
is of type that cannot be un-inverted");
 }

 int i = 0;
while(i < lr.maxDoc()) {
  if (liveDocs != null && !(i < liveDocs.length() && liveDocs.get(i))) {
i++;
continue;
  }
  transformer.process(docBase, i);
  i++;
}
   }

  }
}

On Wed, Aug 17, 2016 at 1:22 PM, Roman Chyla  wrote:
> Joel, thanks, but which of them? I've counted at least 4, if not more,
> different ways of how to get DocValues. Are there many functionally
> equal approaches just because devs can't agree on using one api? Or is
> there a deeper reason?
>
> Btw, the FieldCache is still there - both in lucene (to be deprecated)
> and in solr; but became package accessible only
>
> This is what removed the FieldCache:
> https://issues.apache.org/jira/browse/LUCENE-5666
> This is what followed: https://issues.apache.org/jira/browse/SOLR-8096
>
> And there is still code which un-inverts data from an index if no
> doc-values are available.
>
> --roman
>
> On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein 

Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
Joel, thanks, but which of them? I've counted at least 4, if not more,
different ways of how to get DocValues. Are there many functionally
equal approaches just because devs can't agree on using one api? Or is
there a deeper reason?

Btw, the FieldCache is still there - both in lucene (to be deprecated)
and in solr; but became package accessible only

This is what removed the FieldCache:
https://issues.apache.org/jira/browse/LUCENE-5666
This is what followed: https://issues.apache.org/jira/browse/SOLR-8096

And there is still code which un-inverts data from an index if no
doc-values are available.

--roman

On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein  wrote:
> You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
> replaced the field cache.
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla  wrote:
>
>> I need to read data from the index in order to build a special cache.
>> Previously, in SOLR4, this was accomplished with FieldCache or
>> DocTermOrds
>>
>> Now, I'm struggling to see what API to use, there is many of them:
>>
>> on lucene level:
>>
>> UninvertingReader.getNumericDocValues (and others)
>> .getNumericValues()
>> MultiDocValues.getNumericValues()
>> MultiFields.getTerms()
>>
>> on solr level:
>>
>> reader.getNumericValues()
>> UninvertingReader.getNumericDocValues()
>> and extensions to FilterLeafReader - eg. very intersting, but
>> undocumented facet accumulators (ex: NumericAcc)
>>
>>
>> I need this for solr, and ideally re-use the existing cache [ie. the
>> special cache is using another fields so those get loaded only once
>> and reused in the old solr; which is a win-win situation]
>>
>> If I use reader.getValues() or FilterLeafReader will I be reading data
>> every time the object is created? What would be the best way to read
>> data only once?
>>
>> Thanks,
>>
>> --roman
>>


Re: The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Joel Bernstein
You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
replaced the field cache.





Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla  wrote:

> I need to read data from the index in order to build a special cache.
> Previously, in SOLR4, this was accomplished with FieldCache or
> DocTermOrds
>
> Now, I'm struggling to see what API to use, there is many of them:
>
> on lucene level:
>
> UninvertingReader.getNumericDocValues (and others)
> .getNumericValues()
> MultiDocValues.getNumericValues()
> MultiFields.getTerms()
>
> on solr level:
>
> reader.getNumericValues()
> UninvertingReader.getNumericDocValues()
> and extensions to FilterLeafReader - eg. very intersting, but
> undocumented facet accumulators (ex: NumericAcc)
>
>
> I need this for solr, and ideally re-use the existing cache [ie. the
> special cache is using another fields so those get loaded only once
> and reused in the old solr; which is a win-win situation]
>
> If I use reader.getValues() or FilterLeafReader will I be reading data
> every time the object is created? What would be the best way to read
> data only once?
>
> Thanks,
>
> --roman
>


The most efficient way to get un-inverted view of the index?

2016-08-16 Thread Roman Chyla
I need to read data from the index in order to build a special cache.
Previously, in SOLR4, this was accomplished with FieldCache or
DocTermOrds

Now, I'm struggling to see what API to use, there is many of them:

on lucene level:

UninvertingReader.getNumericDocValues (and others)
.getNumericValues()
MultiDocValues.getNumericValues()
MultiFields.getTerms()

on solr level:

reader.getNumericValues()
UninvertingReader.getNumericDocValues()
and extensions to FilterLeafReader - eg. very intersting, but
undocumented facet accumulators (ex: NumericAcc)


I need this for solr, and ideally re-use the existing cache [ie. the
special cache is using another fields so those get loaded only once
and reused in the old solr; which is a win-win situation]

If I use reader.getValues() or FilterLeafReader will I be reading data
every time the object is created? What would be the best way to read
data only once?

Thanks,

--roman