Re: Removing similar documents from search results

2005-03-15 Thread David Spencer
Miles Barr wrote: On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote: I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in pages. So

Using JDBCDirectory

2005-03-15 Thread Ravi Rao
All, I got JDBCDirectory from information on the lucene-user's mailing list. http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1644063 I cannot get basic searches to work. I tried to merge the JDBC directory with a filesystem index and search the filesystem index. That produced

Congratulations to Otis and Erik

2005-03-15 Thread Chuck Williams
Nice write-up in today's Search Day on Lucene in Action! If you don't get it, you can see it here (currently the top article): http://searchenginewatch.com/ Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comman

Re: Search in appendable fields

2005-03-15 Thread Otis Gospodnetic
Hello, PhraseQuery will help you do that, as will BooleanQuery (make both clauses required), or Boolean operators (using + in front of each term or AND between them) if you are parsing the query string with QueryParser: http://www.lucenebook.com/search?query=phrase+query http://www.lucenebook.

Survey on Understanding Code

2005-03-15 Thread Vineet Sinha
Hi, We are running a set of small surveys, in an attempt to understand developers problems when attempting to understand code. Results from this survey will be used in refining (open-source) tools that we are building. If you have looked at the code of any of the (Java) projects below, we woul

Search in appendable fields

2005-03-15 Thread Jerónimo López Bezanilla
I want to index articles: My document is: - Title - Authors There are one or more authors, and I index the field with "Appendable Fields" (page 68, Lucene in action). Document doc = new Document(); doc.add(Field.Text("Title", title)); doc.add(Field.Text("Author", author1)); doc.add(Field.Text("Au

Re: search performace

2005-03-15 Thread Erik Hatcher
I've been effectively off-line for a few days, so I'm not sure if anyone has replied on this thread yet. Using boosts will definitely use less resources than sorting. If you do use sorting for dates, be sure you're doing it numerically rather than lexicographically. Erik On Mar 10, 20

RE: SPECIFIC HIT

2005-03-15 Thread Robichaud, Jean-Philippe
Hi Guys, It is somewhat difficult to suggest something useful without more details. If you a pretty sure of the quality of the query, then here is my suggestion: Index the documents with an extra field called "last_word" that will contains the last word in the document. So from your exa

Re: Removing similar documents from search results

2005-03-15 Thread sergiu gordea
Chris Lamprecht wrote: It's a nice idea, and makes sense. I think that it can be broken if boosting is used and the search is performed on multiple fileds, especially unstored ones. In this case the distance between very similar documents might be increased. I think that also the duplications sho

Querying multiple indexes and combining results

2005-03-15 Thread iain . d . keddie
Hi, I am currently evaluating a system that uses Lucene, so please excuse any lack of understanding. Could somebody tell me if it is possible to query across separate indexes with different criteria, but then to join/merge the results. An analogy is querying two separate tables then joining ba

Re: Removing similar documents from search results

2005-03-15 Thread Chris Lamprecht
Miles, I'm assuming that you want to detect documents that are "almost" exactly the same (since if they were identical, you could just do a straight string compare or md5 compare, etc). If you're storing term vectors in your index, you could compare the term vectors for the search results, and if

Re: Removing similar documents from search results

2005-03-15 Thread Miles Barr
On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote: > I think what they do at Google is a fancy heuristic -- as David Spencer > mentioned, suburls of a given page, identical snippets, or titles... My > idea was more towards providing a 'realistic overview' of subjects in > pages. So you could

Re: Removing similar documents from search results

2005-03-15 Thread Miles Barr
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote: > Yes, in theory the "similarity" package in the sandbox can help. > The code generates a query for a source document to find documents that > are similar to it - the MoreLikeThis class uses the heuristic that 2 > docs are similar if they sh