[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Michael Sokolov (Jira) Fri, 29 Apr 2022 10:22:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530144#comment-17530144
 ]


Michael Sokolov commented on LUCENE-10541:
------------------------------------------

I think the probability of choosing a particular item is based on the length of 
the item just prior (ignoring the skip points, which I don't understand what 
Mike is talking about!). But if there is no correlation among length(line N) 
and length(line N+1) we could probably ignore that. In other words, the item 
following the longest line L is the most likely item to be chosen. However its 
expected length is no different from the expected length of all the lines, 
right? In which case I don't think the seek-and-scan method changes the 
probabilities at all. So I think we can simply look at the number of lines of a 
given length (or above some threshold) and divide by the total number of lines 
to get the P(line length).

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10541
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10541
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

Reply via email to