Preventing phrase queries from matching across lines

2006-04-28 Thread Eric Jain
What is the best way to prevent a phrase query such as "eggs white" matching "fried eggs\nwhite snow"? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. "newline". 2. Have an analyzer somehow increment the position of a term for each line break it e

Ask for a better solution for the case

2006-04-28 Thread hu andy
Hi, I hava an application that need mark the retrieved documents which have been read. So the next time I needn't read the marked documents again. I have an idea that adding a particular field into the indexed document. But as lucene have no update method, I have to delete that document, and

Vector space model

2006-04-28 Thread trupti mulajkar
hi i am trying to implement the vector space model for lucene. i did find some code for generating the vectors, but can any1 suggest a better way of creating the IndexReader object as it is the only way that can return the index created. cheers, trupti mulajkar MSc Advanced Computer Science -

Re: Vector space model

2006-04-28 Thread jason
Hi, I am also interested in this problem. Regards Jason On 4/28/06, trupti mulajkar <[EMAIL PROTECTED]> wrote: > > hi > > i am trying to implement the vector space model for lucene. > i did find some code for generating the vectors, but can any1 suggest a > better > way of creating the IndexRead

Re: Ask for a better solution for the case

2006-04-28 Thread Erick Erickson
This one's fairly wild, I'm interested to see what the gurus think... You could create a bitset and mark each document retrieved by the appropriate bit position (using the Lucene document id). Persist this bitset (assuming you need it across sessions). Be careful, I wouldn't persist it via the to

Re: Efficiently paginating results.

2006-04-28 Thread Marc Dauncey
I read somewhere recently (maybe even on this list) a recommendation to requery each time for successive pages as this avoids some of the complexity involved in session management. Whats peoples view of this? Marc --- karl wettin <[EMAIL PROTECTED]> wrote: > > 27 apr 2006 kl. 20.44 skrev Jean

Re: Efficiently paginating results.

2006-04-28 Thread Hannes Carl Meyer
Hi Marc, I'm using this method for a web-application. I'm storing only the current viewable set of documents in the session and re-query if the user scrolls to the next page. This method is pretty fast and has a minimal session- and processing-footprint. But, if your index is changed during scr

Re: Efficiently paginating results.

2006-04-28 Thread Marc Dauncey
Yes, I was thinking about index updates. Getting a different result set when you go back to a previous page might be an issue - could always cache each page as its opened rather than the entire result set. --- Hannes Carl Meyer <[EMAIL PROTECTED]> wrote: > Hi Marc, > > I'm using this met

Re: Preventing phrase queries from matching across lines

2006-04-28 Thread Erik Hatcher
On Apr 28, 2006, at 5:35 AM, Eric Jain wrote: What is the best way to prevent a phrase query such as "eggs white" matching "fried eggs\nwhite snow"? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. "newline". 2. Have an analyzer somehow increment

Re: Efficiently paginating results.

2006-04-28 Thread Hannes Carl Meyer
p.s. To avoid that issue you could store the result-sets document ids in the session. Marc Dauncey schrieb: Yes, I was thinking about index updates. Getting a different result set when you go back to a previous page might be an issue - could always cache each page as its opened rather than

RE: Partial token matches

2006-04-28 Thread Eric Isakson
Thank you all for the ideas and thanks to the developers for producing such a great tool. I hadn't considered the "too many clauses" problem in my original implementation and I'm definitely hitting it. I decided to use a bi-gram tokenization approach combined with a PhraseQuery to get the "term

Re: Efficiently paginating results.

2006-04-28 Thread Volodymyr Bychkoviak
I'm caching hits by query. When accessing more documents Lucene automatically re-quering index to retrieve more document. When index changes then I reopen IndexReader and clear cache. Marc Dauncey wrote: I read somewhere recently (maybe even on this list) a recommendation to requery each time f

RE: Efficiently paginating results.

2006-04-28 Thread Kinnar Kumar Sen, Noida
Hi Marc Can you give some statistics about the amount of data you are indexing ? Do you not think requering for pagination will increase the time taken for bringing the hits. Rather than bringing the entire hits once in the memory then displaying it as and when the user is clicking on the next but

RE: Efficiently paginating results.

2006-04-28 Thread Marc Dauncey
Hi Kinnar, Well, I have quite a few indexes, some of which get updated infrequently with large loads (quartley) and then some indexes which will have approx 2000 additions a day. Originally I planned to store the results on the session - but I have to design for growth, both in users and in data

Re: for the similarity measure

2006-04-28 Thread Sebastian Marius Kirsch
On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote: > After reading the code, I found the similarity measure in Lucene is not the > same as the cosine coefficient measure commonly used. I dont know it is > correct. And I wonder whether i can use the cosine coefficient measure in > lucene or mayb

Re: Ask for a better solution for the case

2006-04-28 Thread Doug Cutting
hu andy wrote: Hi, I hava an application that need mark the retrieved documents which have been read. So the next time I needn't read the marked documents again. You could mark the documents as deleted, then later clear deletions. So long as you don't close the IndexReader, the deletions wil

Tips on building a better BooleanQuery

2006-04-28 Thread Daniel Shane
Hi! [I'm sorry for also posting this on the dev mailing list, but I was not sure in which one it would be best, so if there is a moderator, please kill either one.] I'm planning on contributing to Lucene by adding a new kind of query. I dont know how to call it yet, but it would be a mix of Bool

Scoring without floating point calculations

2006-04-28 Thread Otis Gospodnetic
Hello, Apparently Sun's Niagara servers have a weak FPU, and I don't need my matches to contain floating point scores, so I would like to avoid floating point calculations when scoring, if possible. Doing a quick `grep -R ' float ' *` in the source tree shows a number of places where floats ar

Re: Scoring without floating point calculations

2006-04-28 Thread Ken Krugler
Apparently Sun's Niagara servers have a weak FPU, and I don't need my matches to contain floating point scores, so I would like to avoid floating point calculations when scoring, if possible. Doing a quick `grep -R ' float ' *` in the source tree shows a number of places where floats are used:

SpanFirstQuery and SpanNotQuery

2006-04-28 Thread Chris Hostetter
I'm looking at SpanQueries as I work on new test cases for LUCENE-557, and I'm confused by the implimentation of SpanFirstQuery.getSpans(). In the Anonymous Spans instance returned, start() and end() are allways the start() and end() of the inner SpanQuery for the current doc -- shouldn't the sta

Highlighter and complex queries

2006-04-28 Thread Marios Skounakis
Hi all, Suppose the user enters the following query using a textbox interface: "rate based optimization" (as a phrase query, including the quotes). The query is parsed using QueryParser, then it is rewritten, and given to the highlighter. Then, method getBestTextFragments is called. The met

RE: Efficiently paginating results.

2006-04-28 Thread Kinnar Kumar Sen, Noida
Hi Marc I have basically gone through the book Lucene in Action where it suggest requerying would be better, but I believe it depends on the kind of application you have. In my case I need to rank the hits according to some other parameters so I need the total hits at a time then rank it accord