Markus Fischer a écrit :
Hi,
[Resent: guess I sent the first before I completed my subscription,
just in case it comes up twice ...]
the subject may be a bit weird but I couldn't find a better way to
describe a problem I'm trying to solve.
If I'm not mistaken, one factor of scoring is the distance of the word
within the document and the length of the document. I'm titling my
problem as the cauliflower-problem, but it's related to any "type of"
problem.
When searching for "cauliflower", I get x hits. Now documents in the
top range (pos < 10) are most likely selected by the user.
Unfortunately, although "cauliflower" is in the documents and it's
word position is like around 3000 characters in the document, the
document itself has nothing much to do with "cauliflower" or
"vegetables" or "eating", etc.
Unfortunately, the most relevant documents come much later in the
index (pos >= 10) because the "cauliflower" word is positioned like
around 5000 characters within the document.
Based on the relation on the content, these later documents are much
more appropriate to the search term, because the also deal with
"vegetables" and "eating", etc.
I'm stuck here how I can signal Lucene to boost those later documents,
because frankly I don't know on what. I would probably have to tag the
relation of the document (is-about vegetables, is-about eating) and
also detect that the searched term is-a vegetable. This gets even more
complex with non-single-term queries.
On a related topic, I'm also searching for a way to suggest alternate
spelling of words to the user, when we found a word which is very less
frequent used in the index or not in the index at all. I'm Austrian
based, when I e.g. search for "retthich" (wrong spelled "rettich"
which is radish), Google suggests me the proper spelled word. I'm
searching for a way to figure how to accomplish this, but again this
may be lucene off-topic and/or I should properly start a separate
thread ...
you can use the ngram pattern and levestein distance to find near words.
I try with phonetic and aspell dictionnary.
M.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]