[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Dawid Weiss (Jira) Thu, 28 Apr 2022 00:49:09 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529261#comment-17529261
 ]


Dawid Weiss commented on LUCENE-10541:
--------------------------------------

Filed a PR at https://github.com/apache/lucene/pull/850. Picked the default 
from CharTokenizer.DEFAULT_MAX_WORD_LEN, although can't reference that directly 
(not accessible from the test framework). Had to tweak the defaults in one or 
two failing tests that expected the tokenizer to return longer tokens, so a 
second set of eyes would be good.

enwiki lines contains 2 million lines. It'd be nice to calculate the 
probability of any of the k faulty (long-term) lines being drawn in n tries and 
distribute it over time - this would address Mike's question about why it took 
so long to discover them. :)

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10541
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Reply via email to