Hi
You want to write your own analyzer which does not lowercase terms and which
splits terms at non-alpha or non-alphanumeric characters. You'd use the same
analyzer for indexing and for searching. Thus when building the index S.O.S is
indexed as the five terms S . O . S and if you search for S.O.S you search for
the five consecutive terms S . O . S. If you don't split terms like this then
words at the end of sentences will be indexed separately from the same word
within a sentence.
So something like the following, which is adapted from an application I have
running here (note that I'm using Lucene 8.6.3 so you will need to make the
appropriate adjustments)
public class MyAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new MyTokenFilter(src);
return new TokenStreamComponents(src, result);
}
}
and
public class MyTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public MyTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(myTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
}
else {
return false;
}
}
public static String[] myTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char c;
Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
c = t.charAt(i);
if (Character.isLetterOrDigit(c) || "_".equals(c)) {
next.append(c);
inWord = true;
}
else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input
stream has been tokenized on whitespace
}
else {
tokenlist.add(String.valueOf(c));
}
inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Cheers
T
-----Original Message-----
From: Younes Bahloul <[email protected]>
Sent: Thursday, 26 August 2021 22:07
To: [email protected]
Subject: Re: lucene 4.10.4 punctuation
Hi thanks for getting back to me so quickly So to give some context, there are
two things we would like to be able to
do:
1. We want to have the option to be able to search on terms that include
punctuation. So for example, if we have the two texts: "they sent an S.O.S
from", and "she wrote SOS, but she meant Soz", the user may want to search for
the acronym of 'Save our Souls', which would be "S.O.S", and in this instance
they only want to match the first text, i.e. "they sent an S.O.S from", and not
the second.
2. We want to have the option to make our searches case-sensitive. By default,
I think in Lucene with the StandardAnalyzer everything is converted to
lower-case at both index and search time. Instead we want upper/lower case to
be important, so that for example the texts "Hello said bob", "mike says hello
to bob", "The project HeLlO" and "HELLO is the acronym for" are all different
texts, and if the user were to search for "Hello" they would only match one of
the texts, i.e. "Hello said bob".
Does that help?
On Wed, 25 Aug 2021 at 18:43, Uwe Schindler <[email protected]> wrote:
> Hi,
>
> you should explain to use what you exactly want to do: How do you want
> to search, how do your documents look like? Why is it important to
> match on punctuation and how should this matching look like?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: [email protected]
>
> > -----Original Message-----
> > From: Younes Bahloul <[email protected]>
> > Sent: Wednesday, August 25, 2021 6:34 PM
> > To: [email protected]
> > Subject: lucene 4.10.4 punctuation
> >
> > Hello
> > i m part of a team that maintain
> > http://exist-db.org/exist/apps/homepage/index.html
> > its an Open Source XML database
> > and we use lucene 4.10.4
> > i m trying to introduce punctuation in search feature is there an
> > analyzer that provides that or a way to do it in 4.10.4 API
> >
> > thanks Younes
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]