Re: what if my database data contains other language (like danish, german).

Chris Collins Mon, 11 May 2009 08:28:40 -0700

Is anyone aware of either of the two things:

1) ability to plugin an external source for DF, this would allow youto circumvent the problem you mentioned below. (Of course you wouldhave to compute a df set for each language you care to have meaningfulweights for).2) any open source segmenters, primarily for german, but also for CJKat a longshot :-}


Thanks

C

On May 11, 2009, at 8:13 AM, Ted Dunning wrote:

Yes. Lucene can handle that. You have to select which stemmer touse. You
may have to improve the German and Danish stemmers a little bit.
You may also have some issues with the fact that if Danish is 5% ofyourcorpus, then words that occur in 100% of your Danish documents willtend tohave too high weights since they only occur in 5% of yourdocuments. Any
term that occurs in more than 20% of a sub-corpus should generally be
discarded from your query.  This can be difficult in multi-lingual
situations.

For a first pass, I would ignore this issue, however.
On Mon, May 11, 2009 at 4:07 AM, uday kumar maddigatla<[email protected]>wrote:
what if my database data contains other language (like danish,german).
Is Lucene will handle that .

If yes How?
--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: what if my database data contains other language (like danish, german).

Reply via email to