Hi Ted,
On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:
I would opt for the most specific tokenization that is feasible (no
stemming, as much compounding as possible).
By "as much compounding as possible", do you mean you want the
tokenizer to do as much splitting as possible, or as little?
E.g. "super-duper" should be left as-is, or turned into "super" and
"duper"?
Is there a particular configuration of Lucene tokenizers that you'd
suggest?
Thanks,
-- Ken
The rationale for this is that
stemming and uncompounding can be added by linear transformations of
the
matrix at any time.
The only serious issue with this is the problem of overlapping
compound
words.
On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <[email protected]
>wrote:
I assume there would also be an issue of which tokenizer to use to
create
the terms from the text.
And possibly issues around storing separate vectors for (at least)
title
vs. content?
Anybody have input on either of these?
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g