[jira] Commented: (SOLR-106) new facet params: facet.sort, facet.mincount, facet.offset

J.J. Larrea (JIRA) Mon, 15 Jan 2007 00:57:48 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464698
 ]


J.J. Larrea commented on SOLR-106:
----------------------------------

Case for Facet Count Caching: Paging through the hitlist (as well as paging 
through the facet list).  In some cases it appears that generating the facet 
counts takes much longer than generating the hitlist.  And that's certainly the 
case when the hitlist is retrieved from cache.

Case or Facet Paging:  The UI design I'm doing back-end for has a list of 
facets with 5 top values each, and a "More..." link when there are indeed more 
than 5 facet values.  Traversing that link is supposed to show a page with all 
facet values which fit, and Prev and Next paging buttons to access those which 
don't.  This browser shows counts and can be sorted by count but by default is 
sorted alphabetically by term.  Next to each term is a checkbox; after browsing 
and checking, a button returns to the hitlist but adds a big OR of the checked 
terms as an fq. So for example if a user searches and gets 437 hits with 
rutabaga in the title, having 264 unique author names, they might want to 
browse the list looking for friends.  Then after browsing and checking they can 
see a hitlist of all articles written by friends with rutabage in the title.

I don't have any idea what the proportion of facet queries would have offset > 
0 e.g. where the user has moved to the next page, but I assume it's non-rare.

It occurs to me that facet.limit should NOT do double-duty for paging: In a 
world where facet counts are cached, facet.limit should continue to play its 
current role, and limit the number of ranked values that make it into the 
BoundedTreeSet and thus the cache.  Then facet.offset and facet.count could be 
used to return a subset.  facet.limit==0 --> no limit, but can still be paged.

Case for pulling response generation out of getFieldCacheCounts and 
getFacetTermEnumCounts:  I (truly) have a 37 million document index which I 
need to facet on Author, of which there are millions.  The TermEnum algorithm 
is clearly unsuited, and the FieldCache algorithm requires an inordinate amount 
of memory; I had to disable it.  So rather than tell management "can't be 
done", I think I need to plug in at least one more algorithm, e.g. using 
TermFreqVectors, to SimpleFacets.  Would love not to have to replicate the 
response generation code.

Or the sorting code.  Just had an idea:  It would be even nicer if the counting 
logic could be passed some object, say an implementation of TermCountRecorder, 
which has an add(String term, int count) method.
 - That object would encapsulate and isolate the generation of CountPair 
objects, the filtering for mincount, and whatever varieties of sorting are 
supported.
  - Rather than have one object with multiple pathways e.g. for term vs. count 
vs. no sorting, a static factory method could take the field, sort, and 
mincount arguments and return an anonymous implementation based on a List or a 
TreeSet or whatever.
  - The factory could also be told whether the counting logic guarantees adding 
terms in term (index) order, and if not but if term order were requested it 
could return an implementation which sorts by term text, otherwise a simple 
List.
  - It could be the object that gets cached for that query for that field.
  - It could have a generateResponse(offset, count) method which generates the 
<list name="<facetfield>">
  - It could optimize memory when multiple TermCountRecorders corresponding to 
different queries are cached for a field, by maintaining a single WeakHashMap 
of term strings for the field, so each TermCountRecorder with the same term has 
a pointer to the same String object -- essentially like String.intern() but the 
scope is the field and the master value would disappear once all cached 
TermCountRecorders referencing it disappear.
  - It would make life much easier for a faceting approache where rather than 
iterating field->document it might be more efficient to iterate document->field 
(e.g. TermFreqVectors?): A TermCountRecorder could be allocated for each 
faceting field using that algorithm and have add(...) called in a round-robin 
fashion as documents are iterated. At the end all could be added to the cache 
and, whether added or retrieved, would have generateResponse called. 


> new facet params: facet.sort, facet.mincount, facet.offset
> ----------------------------------------------------------
>
>                 Key: SOLR-106
>                 URL: https://issues.apache.org/jira/browse/SOLR-106
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Yonik Seeley
>         Attachments: facet_params.patch
>
>
> a couple of new facet params:
> facet lists become pageable with facet.offset, facet.limit  (idea from Erik)
> facet.sort explicitly specifies sort order (true for count descending, false 
> for natural index order)
> facet.mincount: minimum count for facets included in response (idea from JJ, 
> deprecate zeros)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-106) new facet params: facet.sort, facet.mincount, facet.offset

Reply via email to