Michael McCandless created LUCENE-10541: -------------------------------------------
Summary: What to do about massive terms in our Wikipedia EN LineFileDocs? Key: LUCENE-10541 URL: https://issues.apache.org/jira/browse/LUCENE-10541 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Spinoff from this fun build failure that [~dweiss] root caused: [https://lucene.markmail.org/thread/pculfuazll4oebra] Thank you and sorry [~dweiss]!! This test failure happened because the test case randomly indexed a chunk of the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the test. It's crazy that it took so long for Lucene's randomized tests to discover this too-massive term in Lucene's nightly benchmarks. It's like searching for Nessie, or [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. We need to prevent such false failures, somehow, and there are multiple options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix {{MockTokenizer}} to trim such ridiculous terms (I think this is the best option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org