I work with Lucene 2.0. I boost some documents:
Document doc = new Document();
// adding fields
doc.setBoost(2.0f);
indexwriter.addDocument(doc);
If I look to my index with Luke (0.6) the boost value of all documents
is still 1.0.
How can I boost documents?
Thanks. Sören
Peter Bloem wrote:
[...]
+(A B) C D E
[...]
In other words, Lucene considers all documents that
have both A and B, and ranks them higher if they also have C D or E.
Hello Peter,
for my understanding +(A B) C D E means at least one of the terms A
or B must be contained and the terms C, D,
bhecht wrote:
I want to be able to split tokens by giving a list of substring words.
So I can give a list f subwords like: strasse, gasse,
And the token mainstrasse or maingasse will be split to 2 tokens main
and strasse.
IMBEMBA, PASQUALINO: A Splitter for German Compound Words. Free
DECAFFMEYER MATHIEU wrote:
The score depends of
1. the query
2. the matched document
3. the index.
I don't really understand why the index must influence the score (why it
ahs been implemented that way).
The score should be the similarity (inverse distance) between the query
and the matched
DECAFFMEYER MATHIEU wrote:
Both are the same document but in different indexes,
the only difference is that the second idnex has more document than the
first one, the first one contains only that page.
I would like to have the same score as in the second index,
Simple speaking, the score
Karl Koch wrote:
If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?
I understand that sentence:
The natural solution is to correlate a term's matching value with its
Hello Mario,
I had a similar problem a few weeks ago (thread How to get Term Weights
(document term matrix)?, 2006-11-02,
http://www.gossamer-threads.com/lists/lucene/java-user/41726).
I think there is no simple function creating a document term matrix or
accessing it. I extracted the
Soeren Pekrul wrote:
The score for a document is the sum of the term weights w(tf, idf) for
each containing term. So you have already the combination of
coordination level matching with IDF. Now it is possible that your query
requests three terms A, B and C. Two of them (A and B) are quite
mariolone wrote:
They are successful to extract the matrix.
But with collections of large documents is not one too much expensive
solution?
I have a quite small collection with 14,960 documents and 29,828 unique
terms. If I remember right it took a few minutes on a normal laptop
computer to
Hello Karl,
I’m very interested in the details of Lucene’s scoring as well.
Karl Koch wrote:
For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
which is also called cosine normalisation. This is a technique that
If there would be a boost factor for a single keyword (term) at index
time I would index a class as a document with the keys as keywords and
values as boost factor. Unfortunately you can just boost documents and
fields at index time. Single terms can only be boosted at search time
Hello Lisheng,
a search process has to do usually two thinks. First it has to find the
term in the index. I don’t know the implementation of finding a term in
Lucene. I hope that the index is at least a sorted list or a binary
tree, so it can search binary. The time finding a term depends of
Hello Nils,
how about having one index for all documents with two fields date and
content? You can search documents for a specific date and the score
uses the global idf of all documents.
Sören
Nils Höller schrieb:
I thought of making the idf function a NOOP, since this is somehow one
of
Hello Van,
it looks like splitting of compound words. This topic was discussed in
the thread Analysis/tokenization of compound words
(http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).
The main idea is as follow:
You have a corpus (lexicon/dictionary). You
Chris Hostetter wrote:
that's a pretty specific and not all together intuitive ranking... can you
elaborate on your actual use case? ... why is B+C better then A+B ? .. are
these rules specific to a known list of terms, or is a general rule
relating to how you parse the users input?
The
How can I manipulate the score depending on the combination of query
terms containing in the result document? Not a single term is important.
That could be boosted. Important is the combination of terms.
The user searches for the terms A, B, C and D.
Of-course, the document containing all
Chris Hostetter wrote:
You really, *REALLY* don't wnat to be doing this using the Hits class
like in your example ...
1) this will re-execute your search behind the scenes many many times
2) the scores returnd by Hits are psuedo-normalized ... they will be
meaningless for any sort of
Chris Hostetter wrote:
I don't really know what a term matrix is, but when you ask about
weight' is it possible you are just looking for the TermDoc.freq() of the
term/doc pair?
Thank you Chris,
that was also my first idea. I wanted to get the document frequency
Hello,
I would like to extract and store the document term matrix externally. I
iterate the terms and the documents for each term:
TermEnum terms=IndexReader.terms();
while(terms.next()) {
TermDocs docs=IndexReader.termDocs(terms.term());
while(docs.next()) {
Hello Antony,
I have a similar problem. My collection contains mainly German
documents, but some in English and few in French, Spain and Latin. I
know that each language has its own stemming rules.
Language detection is not my domain. But I can imagine it could be
possible to detect the
20 matches
Mail list logo