RE: lucene 4.10.4 punctuation

Trevor Nicholls Thu, 26 Aug 2021 03:31:55 -0700

Hi

You want to write your own analyzer which does not lowercase terms and which 
splits terms at non-alpha or non-alphanumeric characters. You'd use the same 
analyzer for indexing and for searching. Thus when building the index S.O.S is 
indexed as the five terms S . O . S and if you search for S.O.S you search for 
the five consecutive terms S . O . S. If you don't split terms like this then 
words at the end of sentences will be indexed separately from the same word 
within a sentence.


So something like the following, which is adapted from an application I have 
running here (note that I'm using Lucene 8.6.3 so you will need to make the 
appropriate adjustments)
 
public class MyAnalyzer extends Analyzer {
        
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
                WhitespaceTokenizer src = new WhitespaceTokenizer();
                TokenStream result = new MyTokenFilter(src);
                return new TokenStreamComponents(src, result);
        }
}

and

public class MyTokenFilter extends TokenFilter {
    private final CharTermAttribute termAttr;
    private final PositionIncrementAttribute posIncAttr;
    private final ArrayList<String> termStack;
    private AttributeSource.State current;
    private final TypeAttribute typeAttr;

    public MyTokenFilter(TokenStream tokenStream) {
        super(tokenStream);
        termStack = new ArrayList<>();
        termAttr = addAttribute(CharTermAttribute.class);
        posIncAttr = addAttribute(PositionIncrementAttribute.class);
        typeAttr = addAttribute(TypeAttribute.class);
    }
    
    @Override
    public boolean incrementToken() throws IOException {

        if (this.termStack.isEmpty() && input.incrementToken()) {
            final String currentTerm = termAttr.toString();
            final int bufferLen = termAttr.length();

            if (bufferLen > 0) {
                if (termStack.isEmpty()) {
                    termStack.addAll(Arrays.asList(myTokens(currentTerm)));
                    current = captureState();
                }
            }
        }

        if (!this.termStack.isEmpty()) {

                String part = termStack.remove(0);
                restoreState(current);
                termAttr.setEmpty().append(part);
                posIncAttr.setPositionIncrement(1);

                return true;
        }
        else {
                return false;
        }
    }
    
    public static String[] myTokens(String t) {
        List<String> tokenlist = new ArrayList<String>();
        String[] tokens;
        StringBuilder next = new StringBuilder();
        String token;
        char c;
        Boolean inWord = false;

        for (int i = 0; i < t.length(); i++) {
                c = t.charAt(i);
                if (Character.isLetterOrDigit(c) || "_".equals(c)) {
                        next.append(c);
                        inWord = true;                          
                }
                else {
                        if (next.length() > 0) {
                                token = next.toString();
                                tokenlist.add(token);
                                next.setLength(0);
                        }
                        if (Character.isWhitespace(c)) {
                                // shouldn't be possible because the input 
stream has been tokenized on whitespace
                        }
                        else {
                                tokenlist.add(String.valueOf(c));
                        }
                                inWord = false;
                }
        }
        if (next.length() > 0) {
                token = next.toString();
                tokenlist.add(token);
                // next.setLength(0);
        }
        tokens = tokenlist.toArray(new String[0]); 
        return tokens;
    }
}

Cheers
T

-----Original Message-----
From: Younes Bahloul <[email protected]> 
Sent: Thursday, 26 August 2021 22:07
To: [email protected]
Subject: Re: lucene 4.10.4 punctuation

Hi thanks for getting back to me so quickly So to give some context, there are 
two things we would like to be able to
do:

1. We want to have the option to be able to search on terms that include 
punctuation. So for example, if we have the two texts: "they sent an S.O.S 
from", and "she wrote SOS, but she meant Soz", the user may want to search for 
the acronym of 'Save our Souls', which would be "S.O.S", and in this instance 
they only want to match the first text, i.e. "they sent an S.O.S from", and not 
the second.

2. We want to have the option to make our searches case-sensitive. By default, 
I think in Lucene with the StandardAnalyzer everything is converted to 
lower-case at both index and search time. Instead we want upper/lower case to 
be important, so that for example the texts "Hello said bob", "mike says hello 
to bob", "The project HeLlO" and "HELLO is the acronym for" are all different 
texts, and if the user were to search for "Hello" they would only match one of 
the texts, i.e. "Hello said bob".

Does that help?

On Wed, 25 Aug 2021 at 18:43, Uwe Schindler <[email protected]> wrote:

> Hi,
>
> you should explain to use what you exactly want to do: How do you want 
> to search, how do your documents look like? Why is it important to 
> match on punctuation and how should this matching look like?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: [email protected]
>
> > -----Original Message-----
> > From: Younes Bahloul <[email protected]>
> > Sent: Wednesday, August 25, 2021 6:34 PM
> > To: [email protected]
> > Subject: lucene 4.10.4 punctuation
> >
> > Hello
> > i m part of a team that maintain
> > http://exist-db.org/exist/apps/homepage/index.html
> > its an Open Source XML database
> > and we use lucene 4.10.4
> > i m trying to introduce punctuation in search feature is there an 
> > analyzer that provides that or a way to  do it in 4.10.4 API
> >
> > thanks Younes
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: lucene 4.10.4 punctuation

Reply via email to