[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529416#comment-17529416 ]
Michael McCandless commented on LUCENE-10541: --------------------------------------------- {quote}enwiki lines contains 2 million lines. It'd be nice to calculate the probability of any of the k faulty (long-term) lines being drawn in n tries and distribute it over time - this would address Mike's question about why it took so long to discover them. :) {quote} LOL this is indeed fun to work out. There are a couple wrinkles to modeling this though :) First, it's not really "randomly picking N lines for each test run", it's seeking to one spot and then reading N sequential lines from there. Assuming the file is well shuffled (I think it is), this is maybe not changing the result over picking N random lines, since those N sequential lines were already randomized. Second, the way the seeking works is to pick a random spot (byte location), seek there, scan to the end of that line, and start reading from the following line forwards. Many of the lines are very short, but some of them are longer, and even fewer of them are truly massive (and might have an evil Darth Term in there). One wrinkle here is that if you seek into the middle of one of the Darth Terms, you'll then seek to end of line and skip that large term entirely. Given that these massive lines take more bytes it seems more likely the seeking will then skip the Darth Term lines? Finally, there is one more crazy wrinkle – the nightly LineFileDocs is no longer a simple text file – it also has a pre-chunked "index" so test randomization can jump to one the pre-computed known skip points. Maybe that chunking introduced some sort of bias? Fun to think about the Darth Terms!! > What to do about massive terms in our Wikipedia EN LineFileDocs? > ---------------------------------------------------------------- > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org