Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Ken Krugler Fri, 13 Nov 2009 10:45:34 -0800

Hi Ted,

On Nov 3, 2009, at 6:37pm, Ted Dunning wrote:

I would opt for the most specific tokenization that is feasible (no
stemming, as much compounding as possible).

By "as much compounding as possible", do you mean you want thetokenizer to do as much splitting as possible, or as little?

E.g. "super-duper" should be left as-is, or turned into "super" and"duper"?

Is there a particular configuration of Lucene tokenizers that you'dsuggest?


Thanks,

-- Ken

The rationale for this is that
stemming and uncompounding can be added by linear transformations ofthe
matrix at any time.
The only serious issue with this is the problem of overlappingcompound
words.
On Tue, Nov 3, 2009 at 2:39 PM, Ken Krugler <[email protected]>wrote:
I assume there would also be an issue of which tokenizer to use tocreate
the terms from the text.
And possibly issues around storing separate vectors for (at least)title
vs. content?

Anybody have input on either of these?


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to