[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530144#comment-17530144 ]
Michael Sokolov commented on LUCENE-10541: ------------------------------------------ I think the probability of choosing a particular item is based on the length of the item just prior (ignoring the skip points, which I don't understand what Mike is talking about!). But if there is no correlation among length(line N) and length(line N+1) we could probably ignore that. In other words, the item following the longest line L is the most likely item to be chosen. However its expected length is no different from the expected length of all the lines, right? In which case I don't think the seek-and-scan method changes the probabilities at all. So I think we can simply look at the number of lines of a given length (or above some threshold) and divide by the total number of lines to get the P(line length). > What to do about massive terms in our Wikipedia EN LineFileDocs? > ---------------------------------------------------------------- > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org