On 05/12/2011 22:38, Marvin Humphrey wrote:
Hi, Nick,
Awesome stuff coming through on the new Lucy::Analysis::StandardTokenizer!
On Mon, Dec 05, 2011 at 09:02:42PM -0000, [email protected] wrote:
PolyAnalyzer*
PolyAnalyzer_new(const CharBuf *language, VArray *analyzers) {
@@ -43,7 +43,7 @@ PolyAnalyzer_init(PolyAnalyzer *self, co
else if (language) {
self->analyzers = VA_new(3);
VA_Push(self->analyzers, (Obj*)CaseFolder_new());
- VA_Push(self->analyzers, (Obj*)RegexTokenizer_new(NULL));
+ VA_Push(self->analyzers, (Obj*)StandardTokenizer_new());
VA_Push(self->analyzers, (Obj*)SnowStemmer_new(language));
}
This will cause a backwards compatibility break. I really want to make your
StandardTokenizer the default, but I think we might want to go about it
differently.
I made that change mainly to see if the test suite breaks (and it
didn't). I plan to revert it before committing StandardTokenizer to trunk.
How about we leave PolyAnalyzer alone, but add a new class called
"EasyAnalyzer", with the following default stack:
1. StandardTokenizer
2. Normalizer
3. SnowballStemmer
This integrates both your recent contributions, plus changes the order to be
avoid the Highlighter problems you identified and be more in line with the
potential refactoring you talked about.
It would be nice to benchmark this just to see what sort of performance impact
changing the order has before we finalize it.
If this works out, we can then swap out PolyAnalyzer for EasyAnalyzer
throughout the tutorial and other high-level documentation.
Sounds like a good idea.
Nick