[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930111#comment-13930111 ]
Shai Erera commented on LUCENE-5476: ------------------------------------ * Javadocs: ** From the class javadocs: _Note: the original set of hits will be available as documents..._ I think we should just write "the original set of hits can be retrieved from getOriginal.." - I don't want anyone to be confused with the wording "will be available as documents". ** Can you make NOT_CALCULATED static? ** Typo: samplingRato ** randomSeed: I think it should say "if 0..." not "if null". * getMatchingDocs -- can you move the totalHits calculation to {{getTotalHits()}}? And then call it only {{if (sampledDocs==null)}}? * needsSampling -- I know it was suggested to make it protected for inheritance purposes, but as it looks now, all members are private so I don't see how can one extend the class only to override this method (e.g. he won't have access to sampleSize even). Maybe we should keep it private and when someone asks to extend, we know better what needs to be protected? For instance, I think it's also important that we allow overriding createSampledDocList, but for now let's keep it simple. * I think that we need to document somewhere (maybe in class javadocs) that the returned sampled docs may include empty MatchingDocs instances (i.e. when no docs were "sampled" from a segment). Just so that we don't surprise anyone with empty instances. If people work w/ MatchingDocs as they should, by obtaining an iterator, it shouldn't be a problem, but better document it explicitly. ** Perhaps we should also say something about the returned MatchingDocs.totalHits, which are the original totalHits and not the sampled set size? * About carryOver: ** Doesn't it handle the TODO at the beginning of createSample? ** Why does it need to always include the first document of the first segment? Rather you could initialize it to -1 and if {{carryOver == -1}} set it to a random index within that bin? Seems more "random" to me. * amortizedFacetCounts: ** It's a matter of style so I don't mind if you keep this like you wrote, but I would write it as {{if (!needsSampling() || res == null)}} and then the rest of the method isn't indented. Your call. ** I think it's better to allocate {{childPath}} so that the first element is already the dimension. See what FacetsConfig.pathToString currently does. Currently it results in re-allocating the String[] for every label. Then you can call the pathToString variant which takes the array and the length. *** Separately, would be good if FacetsConfig had an appendElement(Appendable, int idx) to append a specific element to the appendable. Then you could use a StringBuilder once, and skip the encoding done for the entire path except the last element. ** Perhaps we should cap this {{(int) (res.value.doubleValue() / this.samplingRate)}} by e.g. the number of non-deleted documents? * About test: ** This comment is wrong: _//there will be 5000 hits._ ** Why do you initialize your own Random? It's impossible to debug the test like that. You should use {{random()}} instead. ** This comment is wrong? _//because a randomindexer is used, the results will vary each time._ -- it's not because the random indexer, but because of random sampling no? ** Would you mind formatting the test code? I see empty lines, curly braces that are sometimes open in the next line etc. I can do it too before I commit it. * I see that createSample still has the scores array bug -- it returns the original scores array, irrespective of the sampled docs. Before you fix it, can you please add a testcase which covers scores and fails? I think we're very close! > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, > SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org