Hi, The Lucence fragmenter chops off fragments in mid sentence. I couldn't find a solution for that, so now I made one myself, quick and dirty. I made a SentenceFragmenter, that starts new fragments when it encounters a '.', '?' or '!'. Therefore, it has to keep track of the whole text, because it only receives words and not punctuation marks. Therefore, I also had to make a change to the Lucene highlighter. In the getBestTextFragments method, the new fragment now takes newText.length() + 1, so that the punctuation mark is also shown in the search result. This works more or less, but it's not pretty. Does anyone know of an existing solution or have ideas for a cleaner solution?
This is the part I changed in the getBestTextFragments(TokenStream, String, int, String) method: if(textFragmenter.isNewFragment(token)) { currentFrag.setScore(fragmentScorer.getFragmentScore()); currentFrag.textEndPos = newText.length()+1; // +1 to get the punctuation mark currentFrag =new TextFragment(newText, newText.length()+1, docFrags.size()); fragmentScorer.startFragment(currentFrag); docFrags.add(currentFrag); } This is the SentenceFragmenter: package be.smartlounge.lucene.util; import org.apache.lucene.analysis.Token; import org.apache.lucene.search.highlight.Fragmenter; public class SentenceFragmenter implements Fragmenter { private static final int DEFAULT_FRAGMENT_SIZE = 300; private int currentNumFrags; private int fragmentSize; private String text; public SentenceFragmenter() { this(DEFAULT_FRAGMENT_SIZE); } public SentenceFragmenter(int fragmentSize) { this.fragmentSize=fragmentSize; } public void start(String originalText) { setText(originalText); currentNumFrags=1; } public boolean isCriticalChar (char c) { return (c == '.'|| c == '?' || c == '!'); } public boolean isNewFragment(Token token) { char kar1 = getText().charAt(token.startOffset() - 2); char kar2 = getText().charAt(token.startOffset() - 3); char kar3 = getText().charAt(token.startOffset() - 4); boolean isNewFrag= ((token.endOffset()>=(fragmentSize*(currentNumFrags - 1) + (fragmentSize/2))&& (isCriticalChar(kar1) || isCriticalChar(kar2) || isCriticalChar(kar3))) || (token.endOffset()>=(fragmentSize*currentNumFrags))); if(isNewFrag) { currentNumFrags++; } return isNewFrag; } public int getFragmentSize() { return fragmentSize; } public void setFragmentSize(int size) { fragmentSize = size; } public String getText() { return text; } public void setText(String newText) { text = newText; } } -- View this message in context: http://www.nabble.com/Fragmenter-ending-with-full-sentences-tp14690255p14690255.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]