Varun Thacker created SOLR-12820: ------------------------------------ Summary: Auto pick method:dvhash based on thresholds Key: SOLR-12820 URL: https://issues.apache.org/jira/browse/SOLR-12820 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: Facet Module Reporter: Varun Thacker
I've worked with two users last week where explicitly using method:dvhash improved the faceting speeds drastically. The common theme in both the use-cases were: One collection hosting data for multiple users. We always filter documents for one user ( therby limiting the number of documents drastically ) and then perfoming a complex nested JSON facet. Both use-cases fit perfectly in this criteria that [~yo...@apache.org] mentioed on SOLR-9142 {quote}faceting on a string field with a high cardinality compared to it's domain is less efficient than it could be. {quote} And DVHASH was the perfect optimization for these use-cases. We are using the facet stream expression in one of the use-cases which doesn't expose the method param. We could expose the method param to facet stream but I feel the better approach to solve this problem would be to address this TODO in the code withing the JSON Facet Module {code:java} if (mincount > 0 && prefix == null && (ntype != null || method == FacetMethod.DVHASH)) { // TODO can we auto-pick for strings when term cardinality is much greater than DocSet cardinality? // or if we don't know cardinality but DocSet size is very small return new FacetFieldProcessorByHashDV(fcontext, this, sf);{code} I thought about this a little and this was the approach I am thinking currently to tackle this problem {code:java} int matchingDocs = fcontext.base.size(); int totalDocs = fcontext.searcher.getIndexReader().maxDoc(); //if matchingDocs is close to the totalDocs then we aren't filtering many documents. //that means the array approach would probably be better than the dvhash approach //Trying to find the cardinality for the matchingDocs would be expensive. //Also for totalDocs we don't have a global cardinality present at index time but we have a per segment cardinality //So using the number of matches as an alternate heuristic would do the job here?{code} Any thoughts if this approach makes sense? it could be I'm thinking of this approach just because both the users I worked with last week fell in this cateogory. cc [~dsmiley] [~joel.bernstein] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org