[ 
https://issues.apache.org/jira/browse/SOLR-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-475:
------------------------------

    Attachment: UnInvertedField.java

Prototype attached.
This is completely untested code, and is still missing the solr interface + 
caching.
The approach is described in the comments (cut-n-pasted here).
Any thoughts or comments on the approach?

I may not have time to immediately work on this (fix the bugs, add tests, hook 
up to solr, add caching of un-inverted field, etc), so additional contributions 
in this direction are welcome!

{code}
/**
 * Final form of the un-inverted field:
 *   Each document points to a list of term numbers that are contained in that 
document.
 *
 *   Term numbers are in sorted order, and are encoded as variable-length 
deltas from the
 *   previous term number.  Real term numbers start at 2 since 0 and 1 are 
reserved.  A
 *   term number of 0 signals the end of the termNumber list.
 *
 *   There is a singe int[maxDoc()] which either contains a pointer into a 
byte[] for
 *   the termNumber lists, or directly contains the termNumber list if it fits 
in the 4
 *   bytes of an integer.  If the first byte in the integer is 1, the next 3 
bytes
 *   are a pointer into a byte[] where the termNumber list starts.
 *
 *   There are actually 256 byte arrays, to compensate for the fact that the 
pointers
 *   into the byte arrays are only 3 bytes long.  The correct byte array for a 
document
 *   is a function of it's id.
 *
 *   To save space and speed up faceting, any term that matches enough 
documents will
 *   not be un-inverted... it will be skipped while building the un-inverted 
field structore,
 *   and will use a set intersection method during faceting.
 *
 *   To further save memory, the terms (the actual string values) are not all 
stored in
 *   memory, but a TermIndex is used to convert term numbers to term values only
 *   for the terms needed after faceting has completed.  Only every 128th term 
value
 *   is stored, along with it's corresponding term number, and this is used as 
an
 *   index to find the closest term and iterate until the desired number is hit 
(very
 *   much like Lucene's own internal term index).
 */
{code}

> multi-valued faceting via un-inverted field
> -------------------------------------------
>
>                 Key: SOLR-475
>                 URL: https://issues.apache.org/jira/browse/SOLR-475
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>         Attachments: UnInvertedField.java
>
>
> Facet multi-valued fields via a counting method (like the FieldCache method) 
> on an un-inverted representation of the field.  For each doc, look at it's 
> terms and increment a count for that term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to