Document.setBoost() doesn't work

2008-02-27 Thread Soeren Pekrul
I work with Lucene 2.0. I boost some documents: Document doc = new Document(); // adding fields doc.setBoost(2.0f); indexwriter.addDocument(doc); If I look to my index with Luke (0.6) the boost value of all documents is still 1.0. How can I boost documents? Thanks. Sören

Re: Optional terms in BooleanQuery

2007-05-21 Thread Soeren Pekrul
Peter Bloem wrote: [...] +(A B) C D E [...] In other words, Lucene considers all documents that have both A and B, and ranks them higher if they also have C D or E. Hello Peter, for my understanding +(A B) C D E means at least one of the terms A or B must be contained and the terms C, D,

Re: What is the best way to split substring words

2007-05-20 Thread Soeren Pekrul
bhecht wrote: I want to be able to split tokens by giving a list of substring words. So I can give a list f subwords like: strasse, gasse, And the token mainstrasse or maingasse will be split to 2 tokens main and strasse. IMBEMBA, PASQUALINO: A Splitter for German Compound Words. Free

Re: IDFrequency

2007-02-02 Thread Soeren Pekrul
DECAFFMEYER MATHIEU wrote: The score depends of 1. the query 2. the matched document 3. the index. I don't really understand why the index must influence the score (why it ahs been implemented that way). The score should be the similarity (inverse distance) between the query and the matched

Re: Score

2007-01-29 Thread Soeren Pekrul
DECAFFMEYER MATHIEU wrote: Both are the same document but in different indexes, the only difference is that the second idnex has more document than the first one, the first one contains only that page. I would like to have the same score as in the second index, Simple speaking, the score

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Soeren Pekrul
Karl Koch wrote: If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this? I understand that sentence: The natural solution is to correlate a term's matching value with its

Re: Lucene LSA

2006-12-14 Thread Soeren Pekrul
Hello Mario, I had a similar problem a few weeks ago (thread How to get Term Weights (document term matrix)?, 2006-11-02, http://www.gossamer-threads.com/lists/lucene/java-user/41726). I think there is no simple function creating a document term matrix or accessing it. I extracted the

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Soeren Pekrul
Soeren Pekrul wrote: The score for a document is the sum of the term weights w(tf, idf) for each containing term. So you have already the combination of coordination level matching with IDF. Now it is possible that your query requests three terms A, B and C. Two of them (A and B) are quite

Re: Lucene LSA

2006-12-14 Thread Soeren Pekrul
mariolone wrote: They are successful to extract the matrix. But with collections of large documents is not one too much expensive solution? I have a quite small collection with 14,960 documents and 29,828 unique terms. If I remember right it took a few minutes on a normal laptop computer to

Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-12 Thread Soeren Pekrul
Hello Karl, I’m very interested in the details of Lucene’s scoring as well. Karl Koch wrote: For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with norm_q : sqrt(sum_t((tf_q*idf_t)^2)) which is also called cosine normalisation. This is a technique that

Re: Store a document-like map

2006-12-06 Thread Soeren Pekrul
If there would be a boost factor for a single keyword (term) at index time I would index a class as a document with the keys as keywords and values as boost factor. Unfortunately you can just boost documents and fields at index time. Single terms can only be boosted at search time

Re: Lucene search performance: linear?

2006-12-05 Thread Soeren Pekrul
Hello Lisheng, a search process has to do usually two thinks. First it has to find the term in the index. I don’t know the implementation of finding a term in Lucene. I hope that the index is at least a sorted list or a binary tree, so it can search binary. The time finding a term depends of

Re: Incremental Index and Comparing different Scores from different Index

2006-12-04 Thread Soeren Pekrul
Hello Nils, how about having one index for all documents with two fields date and content? You can search documents for a specific date and the score uses the global idf of all documents. Sören Nils Höller schrieb: I thought of making the idf function a NOOP, since this is somehow one of

Re: any ides on this type of analyzer?

2006-12-01 Thread Soeren Pekrul
Hello Van, it looks like splitting of compound words. This topic was discussed in the thread Analysis/tokenization of compound words (http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded). The main idea is as follow: You have a corpus (lexicon/dictionary). You

Re: Scoring depending on terms combination

2006-11-13 Thread Soeren Pekrul
Chris Hostetter wrote: that's a pretty specific and not all together intuitive ranking... can you elaborate on your actual use case? ... why is B+C better then A+B ? .. are these rules specific to a known list of terms, or is a general rule relating to how you parse the users input? The

Scoring depending on terms combination

2006-11-09 Thread Soeren Pekrul
How can I manipulate the score depending on the combination of query terms containing in the result document? Not a single term is important. That could be boosted. Important is the combination of terms. The user searches for the terms A, B, C and D. Of-course, the document containing all

Re: How to get Term Weights (document term matrix)?

2006-11-04 Thread Soeren Pekrul
Chris Hostetter wrote: You really, *REALLY* don't wnat to be doing this using the Hits class like in your example ... 1) this will re-execute your search behind the scenes many many times 2) the scores returnd by Hits are psuedo-normalized ... they will be meaningless for any sort of

Re: How to get Term Weights (document term matrix)?

2006-11-03 Thread Soeren Pekrul
Chris Hostetter wrote: I don't really know what a term matrix is, but when you ask about weight' is it possible you are just looking for the TermDoc.freq() of the term/doc pair? Thank you Chris, that was also my first idea. I wanted to get the document frequency

How to get Term Weights (document term matrix)?

2006-11-02 Thread Soeren Pekrul
Hello, I would like to extract and store the document term matrix externally. I iterate the terms and the documents for each term: TermEnum terms=IndexReader.terms(); while(terms.next()) { TermDocs docs=IndexReader.termDocs(terms.term()); while(docs.next()) {

Re: Analyzers and multiple languages

2006-10-13 Thread Soeren Pekrul
Hello Antony, I have a similar problem. My collection contains mainly German documents, but some in English and few in French, Spain and Latin. I know that each language has its own stemming rules. Language detection is not my domain. But I can imagine it could be possible to detect the