[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Michael McCandless (Jira) Thu, 28 Apr 2022 05:48:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529416#comment-17529416
 ]


Michael McCandless commented on LUCENE-10541:
---------------------------------------------

{quote}enwiki lines contains 2 million lines. It'd be nice to calculate the 
probability of any of the k faulty (long-term) lines being drawn in n tries and 
distribute it over time - this would address Mike's question about why it took 
so long to discover them. :)
{quote}
LOL this is indeed fun to work out.

There are a couple wrinkles to modeling this though :)

First, it's not really "randomly picking N lines for each test run", it's 
seeking to one spot and then reading N sequential lines from there.  Assuming 
the file is well shuffled (I think it is), this is maybe not changing the 
result over picking N random lines, since those N sequential lines were already 
randomized.

Second, the way the seeking works is to pick a random spot (byte location), 
seek there, scan to the end of that line, and start reading from the following 
line forwards.  Many of the lines are very short, but some of them are longer, 
and even fewer of them are truly massive (and might have an evil Darth Term in 
there).  One wrinkle here is that if you seek into the middle of one of the 
Darth Terms, you'll then seek to end of line and skip that large term entirely. 
 Given that these massive lines take more bytes it seems more likely the 
seeking will then skip the Darth Term lines?

Finally, there is one more crazy wrinkle – the nightly LineFileDocs is no 
longer a simple text file – it also has a pre-chunked "index" so test 
randomization can jump to one the pre-computed known skip points.  Maybe that 
chunking introduced some sort of bias?

Fun to think about the Darth Terms!!

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10541
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Reply via email to