Erik Hatcher wrote:
> Rather than changing StandardAnalyzer, you could create a custom
> Analyzer that is something along the lines of StandardTokenizer ->
> custom apostrophe splitting filter -> ISOLatinFilter.
Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?
And I'm quite concerned with performance issue, but it seem's to me that
your solution will only affect "APOSTROPHE" typed token, so the overhead
will be unexistant, right ?
> You get a special type for words with interior apostrophes from
> StandardTokenizer (look at StandardFilter to see how that works). You
> could create a simple TokenFilter that splits apostrophe'd tokens
> into two.
I'm not sure to figure out to do that efficiently. Is it something like
that ? :
<code>
private Stack subTokens; //previously initialized
public final Token next() throws IOException {
Token t = null;
if (subTokens != null && !subTokens.empty) {
t = subTokens.pop();
} else {
t = input.next();
if (t != null)
{
String type = t.type();
if (type == APOSTROPHE_TYPE) {
tokenizeApostrophe(t, subTokens);
}
}
}
return t;
}
</code>
with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.
> Maybe it's simple enough also to expand "j" and "l" into "je" and
> "le" in the same step too?
It will be simple, but I'm not sure yet I want to expand them back.
Maybe it will be useful to index the "j" token after all.
Anyway thanks for your quick answer,
--
Hugo
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]