[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625935#comment-14625935 ] David Smiley commented on LUCENE-6595: -- At first I was a little confused by the interleaved representation you used but then I figured it out. Nice work on the PPT Cao :-) CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, Lucene-6595.pptx Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619292#comment-14619292 ] Michael McCandless commented on LUCENE-6595: Thanks [~caomanhdat]! Is this the failing case if you pass {{off}} instead of {{0}} to {{addOffCorrectMap}}? {noformat} @@ -215,7 +230,8 @@ }; int numRounds = RANDOM_MULTIPLIER * 1; -checkRandomData(random(), analyzer, numRounds); +//checkRandomData(random(), analyzer, numRounds); +checkAnalysisConsistency(random(),analyzer,true,m?(y '); analyzer.close(); } {noformat} Best to add {{// nocommit}} comment when making such temporary changes... and it's spooky the test fails because with the right default here (hmm maybe it should be {{off + cumulativeDiff}} since it's an input offset, it should behave exactly has before? Can you mark the old {{addCorrectMap}} as deprecated? We can remove that in trunk but leave deprecated in 5.x ... seems like any subclasses here really need to tell us the input offset... For the default impl for {{CharFilter.correctEnd}} should we just use {{CharFilter.correct}}? Can we rename correctOffset -- correctStartOffset now that we also have a correctEndOffset? Does {{(correctOffset(endOffset-1)+1)}} not work? It would be nice not to add the new method to {{CharFilter}} (only to {{Tokenizer}}). CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619449#comment-14619449 ] Robert Muir commented on LUCENE-6595: - I am lost in all the correct() methods now for charfilters. I think at most tokenizer should only have one such method. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618122#comment-14618122 ] Cao Manh Dat commented on LUCENE-6595: -- [~mikemccand] Sorry for the late, I will submit a patch tonight (6 hours later) CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619814#comment-14619814 ] Cao Manh Dat commented on LUCENE-6595: -- Thanks [~mikemccand]! {quote} @@ -215,7 +230,8 @@ }; int numRounds = RANDOM_MULTIPLIER * 1; -checkRandomData(random(), analyzer, numRounds); +//checkRandomData(random(), analyzer, numRounds); +checkAnalysisConsistency(random(),analyzer,true,m?(y '); analyzer.close(); } {quote} My fault, I played around with the test and forgot to roll back. {quote} It's spooky the test fails because with the right default here (hmm maybe it should be {code} off + cumulativeDiff {code} since it's an input offset, it should behave exactly has before? {quote} Nice idea, I changed it will {code} off - cumulativeDiff {code} and i work perfectly {quote} For the default impl for CharFilter.correctEnd should we just use CharFilter.correct? Can we rename correctOffset -- correctStartOffset now that we also have a correctEndOffset? {quote} Nice refactoring. {quote} Does (correctOffset(endOffset-1)+1) not work? It would be nice not to add the new method to CharFilter (only to Tokenizer). {quote} I tried to do that, but it cant be. Because the information for the special case lie down in BaseCharFilter. [~rcmuir] I will try to explain the solution in a slide, I'm quite not good at it :( CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616411#comment-14616411 ] Michael McCandless commented on LUCENE-6595: [~caomanhdat] will you have time to fold in some of the feedback above, to minimize API changes? Or I can try to, if you're too busy... CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604614#comment-14604614 ] Michael McCandless commented on LUCENE-6595: I think we'll also need to conditionalize this behavior change by version for back compat ... CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604599#comment-14604599 ] Michael McCandless commented on LUCENE-6595: I think the API change here is necessary, but maybe we can minimize it? E.g., can we fix the existing BaseCharFilter.addOffCorrectMap method to forward to the new one that now takes an inputOffset? And can it just pass {{off}} as the inputOffset (instead of filling with 0)? I think we may not need the new method BaseCharFilter.correctEnd, but we do need Tokenizer.correctEndOffset, but can we just implement it as LUCENE-5734 proposed ({{correctOffset(endOffset-1)+1}})? CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604729#comment-14604729 ] Cao Manh Dat commented on LUCENE-6595: -- {quote} Any HTML entity that maps to empty string (e.g. em, /em, b, etc., I think?) would not be included within the output token's start/endOffset, unless that entity was inside a token. {quote} I think it will not a problem because we only ask for start/end offset of a token. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604408#comment-14604408 ] Michael McCandless commented on LUCENE-6595: bq. So finalOffset should be 3 or 6? In this example finalOffset should be 6. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604419#comment-14604419 ] Michael McCandless commented on LUCENE-6595: bq. And do you agree this issue is the same as LUCENE-5734 ? This looks like the same issue to me, although since HTMLStripCharFilter knows it's replacing HTML entities (I think?) it could be smarter about correcting offsets, vs e.g. MappingCharFilter which needs to be generic/agnostic as to what exactly it's remapping. My first idea was the same idea proposed on LUCENE-5734: add a new correctEndOffset method, which defaults to {{correctOffset(endOffset-1)+1}} but then this fails the - cc case. [~caomanhdat]'s approach here is to store another int per correction, which is the input offset where the correction first applied, which is a neat solution: it seems to solve my two examples, and I think would solve LUCENE-5734 as well? Any HTML entity that maps to empty string (e.g. em, /em, b, etc., I think?) would not be included within the output token's start/endOffset, unless that entity was inside a token. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597883#comment-14597883 ] Cao Manh Dat commented on LUCENE-6595: -- Thanks [~mikemccand]. I quite confuse about finalOffset of Tokenizer. For example {code} Input : ABC))) Output : ABC {code} The end offset of last term is 3. So finalOffset should be 3 or 6? CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596742#comment-14596742 ] Michael McCandless commented on LUCENE-6595: Thanks [~caomanhdat], I'll try to understand your proposed change. But some tests seem to be failing with this patch, e.g.: {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething [junit4] 2 NOTE: reproduce with: ant test -Dtestcase=TestBugInSomething -Dtests.method=test -Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=hr -Dtests.timezone=SystemV/PST8PDT -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 0.00s J2 | TestBugInSomething.test [junit4] Throwable #1: java.lang.AssertionError: finalOffset expected:16 but was:20 [junit4]at __randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:75D8BCDB73FBA305]:0) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:280) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:812) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:674) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:670) [junit4]at org.apache.lucene.analysis.core.TestBugInSomething.test(TestBugInSomething.java:77) [junit4]at java.lang.Thread.run(Thread.java:745) [junit4] IGNOR/A 0.01s J2 | TestBugInSomething.testUnicodeShinglesAndNgrams [junit4] Assumption #1: 'slow' test group is disabled (@Slow()) [junit4] 2 NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=DefaultSimilarity, locale=hr, timezone=SystemV/PST8PDT [junit4] 2 NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=370188896,total=519569408 {noformat} and {noformat} [junit4] Suite: org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory [junit4] 2 NOTE: reproduce with: ant test -Dtestcase=TestHTMLStripCharFilterFactory -Dtests.method=testSingleEscapedTag -Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=lt_LT -Dtests.timezone=America/Thule -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.00s J3 | TestHTMLStripCharFilterFactory.testSingleEscapedTag [junit4] Throwable #1: java.lang.NullPointerException [junit4]at __randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:36A72464D080D0F1]:0) [junit4]at org.apache.lucene.analysis.charfilter.BaseCharFilter.correctEnd(BaseCharFilter.java:82) [junit4]at org.apache.lucene.analysis.CharFilter.correctEndOffset(CharFilter.java:93) [junit4]at org.apache.lucene.analysis.Tokenizer.correctEndOffset(Tokenizer.java:84) [junit4]at org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:176) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:177) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:303) [junit4]at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:327) [junit4]at org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory.testSingleEscapedTag(TestHTMLStripCharFilterFactory.java:99) [junit4]at java.lang.Thread.run(Thread.java:745) {noformat} CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer,
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595289#comment-14595289 ] Cao Manh Dat commented on LUCENE-6595: -- Currently CharFilter have two problems. Problem 1: {code} Input : A B C ) ) ) Output : A B C {code} When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset 3 4 5 6 in the input. CharFilter will correct offset of C to 6 ( end of range ). So why - cc have correct offset? {code} Input :c c c c Output : c c {code} Because offset 2 (which is the second c in output) related to offset 2 3 4 in the input. CharFilter will correct offset 2 to 4 (end of range, which is correct). The different of two examples, In Ex1 : the replacement happen right in the correct point (at 3) and in Ex2 : the replacement happen before the correct point (at 0). So I store an inputOffsets[] which is the start for each replacements. Problem 2: {code} Input : A space ( C Output : A space C {code} When Tokenizer ask to correct offset of 3 (which is C in output). This offset related to offset 3 4 in the input. CharFilter will correct offset of C to 4 (end of range, which is correct). But in this example the replacement also happen right in the correct point. So there is a difference between correct startOffset and endOffset. The root of problems is we mapping N - 1 and then asking an inverse mapping 1 - 1. [~dsmiley] I will look at LUCENE-5734 and try to fix that bug. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595169#comment-14595169 ] David Smiley commented on LUCENE-6595: -- Cao, Mike's last words were I'm not sure what to do here... Could you please describe how you fixed this? And do you agree this issue is the same as LUCENE-5734 ? CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org