[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated LUCENE-6595: - Attachment: LUCENE-6595.patch I changed correctEndOffset to correctEndTokenOffset because in correctFinalOffset we still use correctOffset(). I also apply suggestions of [~mikemccand] to this patch. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, Lucene-6595.pptx Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated LUCENE-6595: - Attachment: LUCENE-6595.patch Refactored some code inside BaseCharFilter to make it cleaner. I think this patch is final. [~mikemccand] I changed {code} addOffCorrectMap(off, cumulativeDiff, 0); {code} to {code} addOffCorrectMap(off, cumulativeDiff, off); {code} But it fail with some test of HTMLStripCharFilterTest. I'm not sure what going on HTMLStripCharFilter. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated LUCENE-6595: - Attachment: Lucene-6595.pptx [~rcmuir] I think it quite clear now :) CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, Lucene-6595.pptx Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated LUCENE-6595: - Attachment: LUCENE-6595.patch Here is the patch that pass all test for CharFilter. I think this patch is only prototype because it change the API of Tokenizer which need the agreement of committers. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch, LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated LUCENE-6595: - Attachment: LUCENE-6595.patch Initial patch (It not pass all the test but it solved above problem). I will continue working on this bug. CharFilter offsets correction is wonky -- Key: LUCENE-6595 URL: https://issues.apache.org/jira/browse/LUCENE-6595 Project: Lucene - Core Issue Type: Bug Reporter: Michael McCandless Attachments: LUCENE-6595.patch Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726 If I make a MappingCharFilter with these mappings: {noformat} ( - ) - {noformat} i.e., just erase left and right paren, then tokenizing the string (F31) with e.g. WhitespaceTokenizer, produces a single token F31, with start offset 1 (good). But for its end offset, I would expect/want 4, but it produces 5 today. This can be easily explained given how the mapping works: each time a mapping rule matches, we update the cumulative offset difference, conceptually as an array like this (it's encoded more compactly): {noformat} Output offset: 0 1 2 3 Input offset: 1 2 3 5 {noformat} When the tokenizer produces F31, it assigns it startOffset=0 and endOffset=3 based on the characters it sees (F, 3, 1). It then asks the CharFilter to correct those offsets, mapping them backwards through the above arrays, which creates startOffset=1 (good) and endOffset=5 (bad). At first, to fix this, I thought this is an off-by-1 and when correcting the endOffset we really should return 1+correct(outputEndOffset-1), which would return the correct value (4) here. But that's too naive, e.g. here's another example: {noformat} - cc {noformat} If I then tokenize , today we produce the correct offsets (0, 4) but if we do this off-by-1 fix for endOffset, we would get the wrong endOffset (2). I'm not sure what to do here... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org