Michael McCandless created LUCENE-10541:
-------------------------------------------

             Summary: What to do about massive terms in our Wikipedia EN 
LineFileDocs?
                 Key: LUCENE-10541
                 URL: https://issues.apache.org/jira/browse/LUCENE-10541
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael McCandless


Spinoff from this fun build failure that [~dweiss] root caused: 
[https://lucene.markmail.org/thread/pculfuazll4oebra]

Thank you and sorry [~dweiss]!!

This test failure happened because the test case randomly indexed a chunk of 
the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
test.

It's crazy that it took so long for Lucene's randomized tests to discover this 
too-massive term in Lucene's nightly benchmarks.  It's like searching for 
Nessie, or 
[SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].

We need to prevent such false failures, somehow, and there are multiple 
options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
{{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to