[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-14 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated LUCENE-6595:
-
Attachment: LUCENE-6595.patch

I changed correctEndOffset to correctEndTokenOffset because in 
correctFinalOffset we still use correctOffset().
I also apply suggestions of [~mikemccand] to this patch.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, 
 LUCENE-6595.patch, Lucene-6595.pptx


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated LUCENE-6595:
-
Attachment: LUCENE-6595.patch

Refactored some code inside BaseCharFilter to make it cleaner. I think this 
patch is final.

[~mikemccand] I changed 
{code}
addOffCorrectMap(off, cumulativeDiff, 0);
{code}
to
{code}
addOffCorrectMap(off, cumulativeDiff, off);
{code}
But it fail with some test of HTMLStripCharFilterTest. I'm not sure what going 
on HTMLStripCharFilter.


 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated LUCENE-6595:
-
Attachment: Lucene-6595.pptx

[~rcmuir] I think it quite clear now :)

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, 
 Lucene-6595.pptx


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-23 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated LUCENE-6595:
-
Attachment: LUCENE-6595.patch

Here is the patch that pass all test for CharFilter. I think this patch is only 
prototype because it change the API of Tokenizer which need the agreement of 
committers.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-21 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated LUCENE-6595:
-
Attachment: LUCENE-6595.patch

Initial patch (It not pass all the test but it solved above problem). I will 
continue working on this bug.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org