[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-14 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625935#comment-14625935
 ] 

David Smiley commented on LUCENE-6595:
--

 At first I was a little confused by the interleaved representation you used 
but then I figured it out.  Nice work on the PPT Cao :-) 

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch, 
 Lucene-6595.pptx


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619292#comment-14619292
 ] 

Michael McCandless commented on LUCENE-6595:


Thanks [~caomanhdat]!

Is this the failing case if you pass {{off}} instead of {{0}} to 
{{addOffCorrectMap}}?

{noformat}
@@ -215,7 +230,8 @@
 };
 
 int numRounds = RANDOM_MULTIPLIER * 1;
-checkRandomData(random(), analyzer, numRounds);
+//checkRandomData(random(), analyzer, numRounds);
+checkAnalysisConsistency(random(),analyzer,true,m?(y ');
 analyzer.close();
   }
{noformat}

Best to add {{// nocommit}} comment when making such temporary changes... and 
it's spooky the test fails because with the right default here (hmm maybe it 
should be {{off + cumulativeDiff}} since it's an input offset, it should behave 
exactly has before?

Can you mark the old {{addCorrectMap}} as deprecated?  We can remove that in 
trunk but leave deprecated in 5.x ... seems like any subclasses here really 
need to tell us the input offset...

For the default impl for {{CharFilter.correctEnd}} should we just use 
{{CharFilter.correct}}?

Can we rename correctOffset -- correctStartOffset now that we also have a 
correctEndOffset?

Does {{(correctOffset(endOffset-1)+1)}} not work?  It would be nice not to add 
the new method to {{CharFilter}} (only to {{Tokenizer}}).

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619449#comment-14619449
 ] 

Robert Muir commented on LUCENE-6595:
-

I am lost in all the correct() methods now for charfilters. I think at most 
tokenizer should only have one such method.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618122#comment-14618122
 ] 

Cao Manh Dat commented on LUCENE-6595:
--

[~mikemccand] Sorry for the late, I will submit a patch tonight (6 hours later)

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-08 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619814#comment-14619814
 ] 

Cao Manh Dat commented on LUCENE-6595:
--

Thanks [~mikemccand]!
{quote}
@@ -215,7 +230,8 @@
 };
 
 int numRounds = RANDOM_MULTIPLIER * 1;
-checkRandomData(random(), analyzer, numRounds);
+//checkRandomData(random(), analyzer, numRounds);
+checkAnalysisConsistency(random(),analyzer,true,m?(y ');
 analyzer.close();
   }
{quote}
My fault, I played around with the test and forgot to roll back. 

{quote}
It's spooky the test fails because with the right default here (hmm maybe it 
should be {code} off + cumulativeDiff {code} since it's an input offset, it 
should behave exactly has before?
{quote}
Nice idea, I changed it will {code} off - cumulativeDiff {code} and i work 
perfectly

{quote}
For the default impl for CharFilter.correctEnd should we just use 
CharFilter.correct?
Can we rename correctOffset -- correctStartOffset now that we also have a 
correctEndOffset?
{quote}
Nice refactoring.

{quote}
Does (correctOffset(endOffset-1)+1) not work? It would be nice not to add the 
new method to CharFilter (only to Tokenizer).
{quote}
I tried to do that, but it cant be. Because the information for the special 
case lie down in BaseCharFilter.

[~rcmuir] I will try to explain the solution in a slide, I'm quite not good at 
it :( 


 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-07-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616411#comment-14616411
 ] 

Michael McCandless commented on LUCENE-6595:


[~caomanhdat] will you have time to fold in some of the feedback above, to 
minimize API changes?  Or I can try to, if you're too busy...

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604614#comment-14604614
 ] 

Michael McCandless commented on LUCENE-6595:


I think we'll also need to conditionalize this behavior change by version for 
back compat ...

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604599#comment-14604599
 ] 

Michael McCandless commented on LUCENE-6595:


I think the API change here is necessary, but maybe we can minimize it?

E.g., can we fix the existing BaseCharFilter.addOffCorrectMap method to forward 
to the new one that now takes an inputOffset?  And can it just pass {{off}} as 
the inputOffset (instead of filling with 0)?

I think we may not need the new method BaseCharFilter.correctEnd, but we do 
need Tokenizer.correctEndOffset, but can we just implement it as LUCENE-5734 
proposed ({{correctOffset(endOffset-1)+1}})?



 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-28 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604729#comment-14604729
 ] 

Cao Manh Dat commented on LUCENE-6595:
--

{quote}
Any HTML entity that maps to empty string (e.g. em, /em, b, etc., I 
think?) would not be included within the output token's start/endOffset, unless 
that entity was inside a token.
{quote}
I think it will not a problem because we only ask for start/end offset of a 
token.


 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604408#comment-14604408
 ] 

Michael McCandless commented on LUCENE-6595:


bq. So finalOffset should be 3 or 6?

In this example finalOffset should be 6.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604419#comment-14604419
 ] 

Michael McCandless commented on LUCENE-6595:


bq. And do you agree this issue is the same as LUCENE-5734 ?

This looks like the same issue to me, although since HTMLStripCharFilter 
knows it's replacing HTML entities (I think?) it could be smarter about 
correcting offsets, vs e.g. MappingCharFilter which needs to be 
generic/agnostic as to what exactly it's remapping.

My first idea was the same idea proposed on LUCENE-5734: add a new 
correctEndOffset method, which defaults to {{correctOffset(endOffset-1)+1}} but 
then this fails the  - cc case.

[~caomanhdat]'s approach here is to store another int per correction, which is 
the input offset where the correction first applied, which is a neat solution: 
it seems to solve my two examples, and I think would solve LUCENE-5734 as well? 
 Any HTML entity that maps to empty string (e.g. em, /em, b, etc., I 
think?) would not be included within the output token's start/endOffset, unless 
that entity was inside a token.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch, LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-23 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597883#comment-14597883
 ] 

Cao Manh Dat commented on LUCENE-6595:
--

Thanks [~mikemccand]. 
I quite confuse about finalOffset of Tokenizer. For example
{code}
Input : ABC))) 
Output : ABC
{code}
The end offset of last term is 3. So finalOffset should be 3 or 6?


 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-22 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596742#comment-14596742
 ] 

Michael McCandless commented on LUCENE-6595:


Thanks [~caomanhdat], I'll try to understand your proposed change.  But some 
tests seem to be failing with this patch, e.g.:

{noformat}
   [junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething
   [junit4]   2 NOTE: reproduce with: ant test  -Dtestcase=TestBugInSomething 
-Dtests.method=test -Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=hr 
-Dtests.timezone=SystemV/PST8PDT -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.00s J2 | TestBugInSomething.test 
   [junit4] Throwable #1: java.lang.AssertionError: finalOffset 
expected:16 but was:20
   [junit4]at 
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:75D8BCDB73FBA305]:0)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:280)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:812)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:674)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:670)
   [junit4]at 
org.apache.lucene.analysis.core.TestBugInSomething.test(TestBugInSomething.java:77)
   [junit4]at java.lang.Thread.run(Thread.java:745)
   [junit4] IGNOR/A 0.01s J2 | TestBugInSomething.testUnicodeShinglesAndNgrams
   [junit4] Assumption #1: 'slow' test group is disabled (@Slow())
   [junit4]   2 NOTE: test params are: codec=Asserting(Lucene53): {}, 
docValues:{}, sim=DefaultSimilarity, locale=hr, timezone=SystemV/PST8PDT
   [junit4]   2 NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 
1.8.0_40 (64-bit)/cpus=8,threads=1,free=370188896,total=519569408
{noformat}

and

{noformat}
   [junit4] Suite: 
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory
   [junit4]   2 NOTE: reproduce with: ant test  
-Dtestcase=TestHTMLStripCharFilterFactory -Dtests.method=testSingleEscapedTag 
-Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=lt_LT 
-Dtests.timezone=America/Thule -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.00s J3 | 
TestHTMLStripCharFilterFactory.testSingleEscapedTag 
   [junit4] Throwable #1: java.lang.NullPointerException
   [junit4]at 
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:36A72464D080D0F1]:0)
   [junit4]at 
org.apache.lucene.analysis.charfilter.BaseCharFilter.correctEnd(BaseCharFilter.java:82)
   [junit4]at 
org.apache.lucene.analysis.CharFilter.correctEndOffset(CharFilter.java:93)
   [junit4]at 
org.apache.lucene.analysis.Tokenizer.correctEndOffset(Tokenizer.java:84)
   [junit4]at 
org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:176)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:177)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:303)
   [junit4]at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:327)
   [junit4]at 
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory.testSingleEscapedTag(TestHTMLStripCharFilterFactory.java:99)
   [junit4]at java.lang.Thread.run(Thread.java:745)
{noformat}

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, 

[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-21 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595289#comment-14595289
 ] 

Cao Manh Dat commented on LUCENE-6595:
--

Currently CharFilter have two problems.
Problem 1:
{code}
Input : A B C ) ) )
Output :  A B C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset 
related to offset 3 4 5 6 in the input. CharFilter will correct offset of C to 
6 ( end of range ).

So why  - cc have correct offset?
{code}
Input :c c c c
Output : c c
{code}
Because offset 2 (which is the second c in output) related to offset 2 3 4 in 
the input. CharFilter will correct offset 2 to 4 (end of range, which is 
correct). 

The different of two examples, In Ex1 : the replacement happen right in the 
correct point (at 3) and in Ex2 : the replacement happen before the correct 
point (at 0). So I store an inputOffsets[] which is the start for each 
replacements.

Problem 2:
{code}
Input : A space ( C
Output :  A space C
{code}
When Tokenizer ask to correct offset of 3 (which is C in output). This offset 
related to offset 3 4 in the input. CharFilter will correct offset of C to 4 
(end of range, which is correct). But in this example the replacement also 
happen right in the correct point. So there is a difference between correct 
startOffset and endOffset.

The root of problems is we mapping N - 1 and then asking an inverse mapping 1 
- 1.

[~dsmiley] I will look at LUCENE-5734 and try to fix that bug.

 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

2015-06-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595169#comment-14595169
 ] 

David Smiley commented on LUCENE-6595:
--

Cao,
Mike's last words were I'm not sure what to do here...   Could you please 
describe how you fixed this?  And do you agree this issue is the same as 
LUCENE-5734 ?


 CharFilter offsets correction is wonky
 --

 Key: LUCENE-6595
 URL: https://issues.apache.org/jira/browse/LUCENE-6595
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Michael McCandless
 Attachments: LUCENE-6595.patch


 Spinoff from this original Elasticsearch issue: 
 https://github.com/elastic/elasticsearch/issues/11726
 If I make a MappingCharFilter with these mappings:
 {noformat}
   ( - 
   ) - 
 {noformat}
 i.e., just erase left and right paren, then tokenizing the string
 (F31) with e.g. WhitespaceTokenizer, produces a single token F31,
 with start offset 1 (good).
 But for its end offset, I would expect/want 4, but it produces 5
 today.
 This can be easily explained given how the mapping works: each time a
 mapping rule matches, we update the cumulative offset difference,
 conceptually as an array like this (it's encoded more compactly):
 {noformat}
   Output offset: 0 1 2 3
Input offset: 1 2 3 5
 {noformat}
 When the tokenizer produces F31, it assigns it startOffset=0 and
 endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
 the CharFilter to correct those offsets, mapping them backwards
 through the above arrays, which creates startOffset=1 (good) and
 endOffset=5 (bad).
 At first, to fix this, I thought this is an off-by-1 and when
 correcting the endOffset we really should return
 1+correct(outputEndOffset-1), which would return the correct value (4)
 here.
 But that's too naive, e.g. here's another example:
 {noformat}
    - cc
 {noformat}
 If I then tokenize , today we produce the correct offsets (0, 4)
 but if we do this off-by-1 fix for endOffset, we would get the wrong
 endOffset (2).
 I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org