[
https://issues.apache.org/jira/browse/SOLR-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yonik Seeley updated SOLR-475:
------------------------------
Attachment: UnInvertedField.java
Prototype attached.
This is completely untested code, and is still missing the solr interface +
caching.
The approach is described in the comments (cut-n-pasted here).
Any thoughts or comments on the approach?
I may not have time to immediately work on this (fix the bugs, add tests, hook
up to solr, add caching of un-inverted field, etc), so additional contributions
in this direction are welcome!
{code}
/**
* Final form of the un-inverted field:
* Each document points to a list of term numbers that are contained in that
document.
*
* Term numbers are in sorted order, and are encoded as variable-length
deltas from the
* previous term number. Real term numbers start at 2 since 0 and 1 are
reserved. A
* term number of 0 signals the end of the termNumber list.
*
* There is a singe int[maxDoc()] which either contains a pointer into a
byte[] for
* the termNumber lists, or directly contains the termNumber list if it fits
in the 4
* bytes of an integer. If the first byte in the integer is 1, the next 3
bytes
* are a pointer into a byte[] where the termNumber list starts.
*
* There are actually 256 byte arrays, to compensate for the fact that the
pointers
* into the byte arrays are only 3 bytes long. The correct byte array for a
document
* is a function of it's id.
*
* To save space and speed up faceting, any term that matches enough
documents will
* not be un-inverted... it will be skipped while building the un-inverted
field structore,
* and will use a set intersection method during faceting.
*
* To further save memory, the terms (the actual string values) are not all
stored in
* memory, but a TermIndex is used to convert term numbers to term values only
* for the terms needed after faceting has completed. Only every 128th term
value
* is stored, along with it's corresponding term number, and this is used as
an
* index to find the closest term and iterate until the desired number is hit
(very
* much like Lucene's own internal term index).
*/
{code}
> multi-valued faceting via un-inverted field
> -------------------------------------------
>
> Key: SOLR-475
> URL: https://issues.apache.org/jira/browse/SOLR-475
> Project: Solr
> Issue Type: New Feature
> Reporter: Yonik Seeley
> Attachments: UnInvertedField.java
>
>
> Facet multi-valued fields via a counting method (like the FieldCache method)
> on an un-inverted representation of the field. For each doc, look at it's
> terms and increment a count for that term.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.