[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

Robert Muir (Commented) (JIRA) Tue, 31 Jan 2012 15:36:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197383#comment-13197383
 ]


Robert Muir commented on LUCENE-3738:
-------------------------------------

after investigating: its difficult to prevent negative offsets, even after 
fixing term vectors writer (LUCENE-3739)

At first i tried a simple assert in BaseTokenStreamTestCase:
{code}
        assertTrue("offsets must not go backwards", offsetAtt.startOffset() >= 
lastStartOffset);
        lastStartOffset = offsetAtt.startOffset();
{code}

Then these analyzers failed:
* MockCharFilter itself had a bug, but thats easy to fix (LUCENE-3741)
* synonymsfilter failed sometimes (LUCENE-3742) because it wrote zeros for 
offsets in situations like "a -> b c"
* (edge)ngramtokenizers failed, because ngrams(1,2) of "ABCD" are not A, AB, B, 
BC, C, CD, D but instead A, B, C, D, AB, BC, CD, ...
* (edge)ngramfilters failed for similar reasons.
* worddelimiterfilter failed, because it doesnt break "AB" into A, AB, B but 
instead A, B, AB
* trimfilter failed when 'offsets changing' is enabled, because if you have " 
rob", "robert" as synonyms then it trims the first, and the second offsets "go 
backwards"

These are all bugs.

In general I think offsets after being set should not be changed, because 
filters don't have access to any charfilters
offset correction (correctOffset()) anyway, so they shouldnt be mucking offsets.

So really: only the creator of tokens should make the offsets. And if thats a 
filter, it should be a standard way, 
only inherited from existing offsets and not 'offset mathematics' and not A, 
AB, B in some places and A, B, AB in others.

Really i think we need to step it up if we want highlighting to be first-class 
citizen in lucene, nothing checks the offsets anyhwere at all,
even to check/assert if they are negative, and there are little tests... all we 
have is some newish stuff in basetokenstreamtestcase and
a few trivial test cases.

On the other hand, for example, position increment's impl actually throws 
exception if you give it something like a negative number...

                
> Be consistent about negative vInt/vLong
> ---------------------------------------
>
>                 Key: LUCENE-3738
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3738
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3738.patch, LUCENE-3738.patch
>
>
> Today, write/readVInt "allows" a negative int, in that it will encode and 
> decode correctly, just horribly inefficiently (5 bytes).
> However, read/writeVLong fails (trips an assert).
> I'd prefer that both vInt/vLong trip an assert if you ever try to write a 
> negative number... it's badly trappy today.  But, unfortunately, we sometimes 
> rely on this... had we had this assert in 'since the beginning' we could have 
> avoided that.
> So, if we can't add that assert in today, I think we should at least fix 
> readVLong to handle negative longs... but then you quietly spend 9 bytes 
> (even more trappy!).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3738) Be consistent about negative vInt/vLong

Reply via email to