[
https://issues.apache.org/jira/browse/SOLR-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464698
]
J.J. Larrea commented on SOLR-106:
----------------------------------
Case for Facet Count Caching: Paging through the hitlist (as well as paging
through the facet list). In some cases it appears that generating the facet
counts takes much longer than generating the hitlist. And that's certainly the
case when the hitlist is retrieved from cache.
Case or Facet Paging: The UI design I'm doing back-end for has a list of
facets with 5 top values each, and a "More..." link when there are indeed more
than 5 facet values. Traversing that link is supposed to show a page with all
facet values which fit, and Prev and Next paging buttons to access those which
don't. This browser shows counts and can be sorted by count but by default is
sorted alphabetically by term. Next to each term is a checkbox; after browsing
and checking, a button returns to the hitlist but adds a big OR of the checked
terms as an fq. So for example if a user searches and gets 437 hits with
rutabaga in the title, having 264 unique author names, they might want to
browse the list looking for friends. Then after browsing and checking they can
see a hitlist of all articles written by friends with rutabage in the title.
I don't have any idea what the proportion of facet queries would have offset >
0 e.g. where the user has moved to the next page, but I assume it's non-rare.
It occurs to me that facet.limit should NOT do double-duty for paging: In a
world where facet counts are cached, facet.limit should continue to play its
current role, and limit the number of ranked values that make it into the
BoundedTreeSet and thus the cache. Then facet.offset and facet.count could be
used to return a subset. facet.limit==0 --> no limit, but can still be paged.
Case for pulling response generation out of getFieldCacheCounts and
getFacetTermEnumCounts: I (truly) have a 37 million document index which I
need to facet on Author, of which there are millions. The TermEnum algorithm
is clearly unsuited, and the FieldCache algorithm requires an inordinate amount
of memory; I had to disable it. So rather than tell management "can't be
done", I think I need to plug in at least one more algorithm, e.g. using
TermFreqVectors, to SimpleFacets. Would love not to have to replicate the
response generation code.
Or the sorting code. Just had an idea: It would be even nicer if the counting
logic could be passed some object, say an implementation of TermCountRecorder,
which has an add(String term, int count) method.
- That object would encapsulate and isolate the generation of CountPair
objects, the filtering for mincount, and whatever varieties of sorting are
supported.
- Rather than have one object with multiple pathways e.g. for term vs. count
vs. no sorting, a static factory method could take the field, sort, and
mincount arguments and return an anonymous implementation based on a List or a
TreeSet or whatever.
- The factory could also be told whether the counting logic guarantees adding
terms in term (index) order, and if not but if term order were requested it
could return an implementation which sorts by term text, otherwise a simple
List.
- It could be the object that gets cached for that query for that field.
- It could have a generateResponse(offset, count) method which generates the
<list name="<facetfield>">
- It could optimize memory when multiple TermCountRecorders corresponding to
different queries are cached for a field, by maintaining a single WeakHashMap
of term strings for the field, so each TermCountRecorder with the same term has
a pointer to the same String object -- essentially like String.intern() but the
scope is the field and the master value would disappear once all cached
TermCountRecorders referencing it disappear.
- It would make life much easier for a faceting approache where rather than
iterating field->document it might be more efficient to iterate document->field
(e.g. TermFreqVectors?): A TermCountRecorder could be allocated for each
faceting field using that algorithm and have add(...) called in a round-robin
fashion as documents are iterated. At the end all could be added to the cache
and, whether added or retrieved, would have generateResponse called.
> new facet params: facet.sort, facet.mincount, facet.offset
> ----------------------------------------------------------
>
> Key: SOLR-106
> URL: https://issues.apache.org/jira/browse/SOLR-106
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Yonik Seeley
> Attachments: facet_params.patch
>
>
> a couple of new facet params:
> facet lists become pageable with facet.offset, facet.limit (idea from Erik)
> facet.sort explicitly specifies sort order (true for count descending, false
> for natural index order)
> facet.mincount: minimum count for facets included in response (idea from JJ,
> deprecate zeros)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira