UnInvertedField implementation details

Kydryavtsev Andrey Mon, 13 Oct 2014 14:12:02 -0700

Hi,

I have a question about UnInvertedField internals. I use it for multi terms 
field faceting. And once got an exception "Too many values for UnInvertedField 
faceting on field ...". I googled it a bit 
(http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/68000) and 
looked into the code. According to UnInvertedField 
(http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/request/UnInvertedField.java)
  JavaDoc:


There is a single int[maxDoc()] which either contains a pointer into a byte[] 
for the termNumber lists, or directly contains the termNumber list if it fits 
in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 
bytes are a pointer into a byte[] where the termNumber list starts. There are 
actually 256 byte arrays, to compensate for the fact that the pointers into the 
byte arrays are only 3 bytes long. The correct byte array for a document is a 
function of it's id.

It means thats UnInvertedField contains int[maxDoc()] for each document and 
double[256][] for termNumber pointers. To map docId and one of 256 arrays next 
formula is used:

int whichArray = (doc >>> 16) & 0xff;

So it's not "mod" for uniform distribution. And for larger max doc more number 
of arrays are used. Exception happens when one this 256 arrays became too big. 

 I could be sure that my documents don't have >  4 million unique terms. But I 
do index "updates":
     - I have one segment only with max_doc=N
     - Load new batch of documents (new document versions) with already 
existing in index unique ID every 15 minutes
     - Index them. Now I have two segments - old one with "deleted" documents 
and new one with new "versions" of deleted documents.
     - Merge segments - execute "index optimazed" to have one segment with same 
max_doc=N

I started to check UnInvertedField state during daily updates and found strange 
thing. My max_doc is about 700000. According to formula above, pointers are 
stored into first 10 arrays. And during daily updates last array (tenth array) 
became bigger and bigger, even if max_doc after merge is constant. And finally 
it exceeded limit. 

How could it happen? Could it be my "update" problem - somehow I increase 
number of terms in document instead of updating them? Or it could be 
UnInvertedField implementation details? And what will happen if I change that 
formula for uniform distribution? What is point to use less big arrays instead 
of many relatively small? 

Thanks for any help.

UnInvertedField implementation details

Reply via email to