On Mon, May 11, 2009 at 8:28 AM, Chris Collins <[email protected]>wrote:
> Is anyone aware of either of the two things: > > 1) ability to plugin an external source for DF, this would allow you to > circumvent the problem you mentioned below. (Of course you would have to > compute a df set for each language you care to have meaningful weights for). See http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html#weight(org.apache.lucene.search.Searcher) The typical idiom is to extend Searcher with a specialized structure that knows the term frequencies that you want it to know. This is what katta does to propagate cluster-global term frequencies to shard specific searches. Presumably solr does likewise. > 2) any open source segmenters, primarily for german, but also for CJK at a > longshot :-} Lucene has a rudimentary german stemmer which may be sufficient. Real lemma identification in German can be difficult because of the large number of morphological variants and word compounding. For text retrieval, however, compounding is your friend and very simple stemmers typically suffice. For CJK, the approach that I favor lately is this one: http://technology.chtsai.org/mmseg/ Basically, it is a longest dictionary match method with the addition that it picks the next token that is part of the longest match for the next three tokens. This gets rid of the garden path problems that greedy algorithms without look-ahead have. It depends a bit on the assumption that long words in the dictionary have higher frequency than would be expected if the possible components occur independently. This means that picking the longer match in the dictionary is equivalent to doing a more subtle statistical test. (See here for more details on the stats involved in bigram detection: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html). -- Ted Dunning, CTO DeepDyve
