I work with Lucene 2.0. I boost some documents:
Document doc = new Document();
// adding fields
doc.setBoost(2.0f);
indexwriter.addDocument(doc);
If I look to my index with Luke (0.6) the boost value of all documents
is still 1.0.
How can I boost documents?
Thanks. Sören
Peter Bloem wrote:
[...]
"+(A B) C D E"
[...]
In other words, Lucene considers all documents that
have both A and B, and ranks them higher if they also have C D or E.
Hello Peter,
for my understanding "+(A B) C D E" means at least one of the terms "A"
or "B" must be contained and the terms
bhecht wrote:
I want to be able to split tokens by giving a list of substring words.
So I can give a list f subwords like: "strasse", "gasse",
And the token "mainstrasse" or "maingasse" will be split to 2 tokens "main"
and "strasse".
IMBEMBA, PASQUALINO: A Splitter for German Compound Words. F
Mike O'Leary wrote:
Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formattin
DECAFFMEYER MATHIEU wrote:
The score depends of
1. the query
2. the matched document
3. the index.
I don't really understand why the index must influence the score (why it
ahs been implemented that way).
The score should be the similarity (inverse distance) between the query
and the matched
DECAFFMEYER MATHIEU wrote:
Both are the same document but in different indexes,
the only difference is that the second idnex has more document than the
first one, the first one contains only that page.
I would like to have the same score as in the second index,
Simple speaking, the score dep
mariolone wrote:
They are successful to extract the matrix.
But with collections of large documents is not one too much expensive
solution?
I have a quite small collection with 14,960 documents and 29,828 unique
terms. If I remember right it took a few minutes on a normal laptop
computer to
Soeren Pekrul wrote:
The score for a document is the sum of the term weights w(tf, idf) for
each containing term. So you have already the combination of
coordination level matching with IDF. Now it is possible that your query
requests three terms A, B and C. Two of them (A and B) are quite
Hello Mario,
I had a similar problem a few weeks ago (thread "How to get Term Weights
(document term matrix)?", 2006-11-02,
http://www.gossamer-threads.com/lists/lucene/java-user/41726).
I think there is no simple function creating a document term matrix or
accessing it. I extracted the matr
Karl Koch wrote:
If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?
I understand that sentence:
"The natural solution is to correlate a term's matching value with its
co
Hello Karl,
I’m very interested in the details of Lucene’s scoring as well.
Karl Koch wrote:
For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
which is also called cosine normalisation. This is a technique that
If there would be a boost factor for a single keyword (term) at index
time I would index a class as a document with the keys as keywords and
values as boost factor. Unfortunately you can just boost documents and
fields at index time. Single terms can only be boosted at search time
(TermQuery.se
Hello Lisheng,
a search process has to do usually two thinks. First it has to find the
term in the index. I don’t know the implementation of finding a term in
Lucene. I hope that the index is at least a sorted list or a binary
tree, so it can search binary. The time finding a term depends of t
Hello Nils,
how about having one index for all documents with two fields "date" and
"content"? You can search documents for a specific date and the score
uses the global idf of all documents.
Sören
Nils Höller schrieb:
I thought of making the idf function a NOOP, since this is somehow one
o
Hello Van,
it looks like splitting of compound words. This topic was discussed in
the thread "Analysis/tokenization of compound words"
(http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).
The main idea is as follow:
You have a corpus (lexicon/dictionary). You
Chris Hostetter wrote:
that's a pretty specific and not all together intuitive ranking... can you
elaborate on your actual use case? ... why is B+C better then A+B ? .. are
these rules specific to a known list of terms, or is a general rule
relating to how you parse the users input?
The origina
How can I manipulate the score depending on the combination of query
terms containing in the result document? Not a single term is important.
That could be boosted. Important is the combination of terms.
The user searches for the terms A, B, C and D.
Of-course, the document containing all terms
Chris Hostetter wrote:
You really, *REALLY* don't wnat to be doing this using the "Hits" class
like in your example ...
1) this will re-execute your search behind the scenes many many times
2) the scores returnd by "Hits" are psuedo-normalized ... they will be
meaningless for any sort
Chris Hostetter wrote:
I don't really know what a "term matrix" is, but when you ask about
"weight' is it possible you are just looking for the TermDoc.freq() of the
term/doc pair?
Thank you Chris,
that was also my first idea. I wanted to get the document frequency
indexreader.docFreq(
Hello,
I would like to extract and store the document term matrix externally. I
iterate the terms and the documents for each term:
TermEnum terms=IndexReader.terms();
while(terms.next()) {
TermDocs docs=IndexReader.termDocs(terms.term());
while(docs.next()) {
//s
Hello Antony,
I have a similar problem. My collection contains mainly German
documents, but some in English and few in French, Spain and Latin. I
know that each language has its own stemming rules.
Language detection is not my domain. But I can imagine it could be
possible to detect the lang
21 matches
Mail list logo