Hi,
The Lucence fragmenter chops off fragments in mid sentence. I couldn't find
a solution for that, so now I made one myself, quick and dirty. I made a
SentenceFragmenter, that starts new fragments when it encounters a '.', '?'
or '!'.  Therefore, it has to keep track of the whole text, because it only
receives words and not punctuation marks. Therefore, I also had to make a
change to the Lucene highlighter. In the getBestTextFragments method, the
new fragment now takes newText.length() + 1, so that the punctuation mark is
also shown in the search result. This works more or less, but it's not
pretty. Does anyone know of an existing solution or have ideas for a cleaner
solution?

This is the part I changed in the getBestTextFragments(TokenStream, String,
int, String) method:
if(textFragmenter.isNewFragment(token))
                                        {
                                                
currentFrag.setScore(fragmentScorer.getFragmentScore());
                                        
                                                currentFrag.textEndPos = 
newText.length()+1; // +1                                                      
to get the
punctuation mark
                                                currentFrag =new 
TextFragment(newText,                                                           
               newText.length()+1,
docFrags.size());
                                                
fragmentScorer.startFragment(currentFrag);
                                                docFrags.add(currentFrag);
                                        }


This is the SentenceFragmenter:

package be.smartlounge.lucene.util;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.search.highlight.Fragmenter;


public class SentenceFragmenter implements Fragmenter {
        
        private static final int DEFAULT_FRAGMENT_SIZE = 300;
        private int currentNumFrags;
        private int fragmentSize;
        private String text;
        


        public SentenceFragmenter()
        {
                this(DEFAULT_FRAGMENT_SIZE);
        }


        
        public SentenceFragmenter(int fragmentSize)
        {
                this.fragmentSize=fragmentSize;
        }

        
        public void start(String originalText)
        {
                setText(originalText);
                currentNumFrags=1;
        }

        public boolean isCriticalChar (char c) {
                return (c == '.'|| c == '?' || c == '!');
        }

        public boolean isNewFragment(Token token)
        {       
                char kar1 = getText().charAt(token.startOffset() - 2);
                char kar2 = getText().charAt(token.startOffset() - 3);
                char kar3 = getText().charAt(token.startOffset() - 4);
                
                
                boolean isNewFrag= 
((token.endOffset()>=(fragmentSize*(currentNumFrags -
1) + (fragmentSize/2))&& 
                                (isCriticalChar(kar1) || isCriticalChar(kar2) 
|| isCriticalChar(kar3)))
                                || 
(token.endOffset()>=(fragmentSize*currentNumFrags)));
                if(isNewFrag)
                {
                        currentNumFrags++;
                }
                return isNewFrag;
        }


        public int getFragmentSize()
        {
                return fragmentSize;
        }


        public void setFragmentSize(int size)
        {
                fragmentSize = size;
        }
        
        public String getText() {
                return text;
        }
        
        public void setText(String newText) {
                text = newText;
        }
        

}
-- 
View this message in context: 
http://www.nabble.com/Fragmenter-ending-with-full-sentences-tp14690255p14690255.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to