[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915650#comment-13915650 ]
Rob Audenaerde commented on LUCENE-5476: ---------------------------------------- I'm currently expermenting with this. To increase the speed it seems logical to me the {{FacetsCollector}} needs to return less hits. I have a slighly modified version that I will attach. It uses a sampling technique that divides the total hits in to 'bins' of a given size; and takes one sample of that bin. I have implemented it as keeping that one sample as 'hit' of the search if it was a hit, and clearing all other bits. See the attached file. By using this technique the distribution of the results should not be altered too much, while the performance gains can be significant. A quick test revealed that for 1M results and binsize 500, the sampled version is twice as fast. The problem it that the resulting {{FacetResult}}s are not correct, as the number of hits is reduced. This can be fixed afterwards for counting facets by multiplying with the binsize; but for other facets it will be more difficult or will require other approaches. What do you think? > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org