[ https://issues.apache.org/jira/browse/LUCENE-7729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amrit Sarkar updated LUCENE-7729: --------------------------------- Attachment: LUCENE-7729.patch Thank you David for looking into this. Updated: LUCENE-7729.patch bq. One issue with the implementation I see is that if it starts to find a match but ultimately doesn't, then the position is not reset back to the start (plus 1). This means hypothetically a string separator of ab would fail to find the substring in the input aab. I didn't try with your patch but do you concur? It is taken care in the following section in the original patch: CustomSeparatorBreakIterator::advanceForward()::72 CustomSeparatorBreakIterator::advanceBackward()::121 {code} if(sep_index != separator.length() - 1) { // separator len > 1 sep_index = separator.length() - 1; if(c == separator.charAt(sep_index)){ //check the current token match with last element sep_index --; } } {code} {code} if(sep_index != 0) { //separator len > 0 sep_index = 0; if (c == separator.charAt(sep_index)) { //check the current token match with first element sep_index ++; } } {code} I have added relevant test cases to prove the same: TestCustomSeparatorBreakIterator::testFollowingPrecedingBreakOnCustomSeparator::100 {code}separator = "xz";{code} bq. I'm a little concerned about possible overhead for this mode. Maybe subclassing to override advanceForward and advanceBackward would make sense. If this were an inner class to do the string, then a factory method instead of constructor could be used. I think CustomSeparatorBreakIterator should continue to accept a single char constructor arg; I imagine most uses of this would remain to be one character. I am not able to find an overhead for this implementation. String of length>0 is acceptable which is kind of better than just single char, no? I understand the most use cases will not demand more than single char, that's why we specially have whitespace, but having string arg as default brings no harm as internally char-by-char matching is done. Thank you for the valuable coding standard tips too. Ishan corrected me on above stated points on other JIRA and it slipped my mind that I already attached a patch for this one with improper indentation and style. I will take care of this in future for sure. > Support for string type separator for CustomSeparatorBreakIterator > ------------------------------------------------------------------ > > Key: LUCENE-7729 > URL: https://issues.apache.org/jira/browse/LUCENE-7729 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: Amrit Sarkar > Attachments: LUCENE-7729.patch, LUCENE-7729.patch > > > LUCENE-6485: currently CustomSeparatorBreakIterator breaks the text when the > _char_ passed is found. > Improved CustomSeparatorBreakIterator; as it now supports separator of string > type of arbitrary length. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org