Re: Similar Document Search

2003-08-27 Thread Brian Mila
> As a user of Lucene I missed some features. Part of the OSS culture is > for me to tell others about this and maybe to try to find solutions. > Mark's code seems to be one, so I proposed to consider adding it into > some spot with better exposure for testing. And I don't seem to be the > only pe

RE: Similar Document Search

2003-08-26 Thread Gregor Heinrich
ichen [mailto:[EMAIL PROTECTED] Sent: Thursday, August 21, 2003 2:54 PM To: Lucene Users List Subject: Re: Similar Document Search Hi Peter, I took a look at Mark's thesis and briefly at some of his code. It appears to me that what he's done with the so-called forward indexing is to (a)

Re: Similar Document Search

2003-08-25 Thread Peter Becker
Brian Mila wrote: amounts). I failed to find a way to get Lucene to give me this information without hacking this or that. Considering the attention IR Excuse me if this is off-topic, but isn't hacking the code what open source software is all about? Not always, but quite often :-) I mean

Re: Similar Document Search

2003-08-25 Thread Brian Mila
> amounts). I failed to find a way to get Lucene to give me this > information without hacking this or that. Considering the attention IR Excuse me if this is off-topic, but isn't hacking the code what open source software is all about? I mean, its always better to try to do it with existing meth

RE: Similar Document Search

2003-08-21 Thread Eric Hahn
Apologies for asking the obvious, but could someone explain why Documents.Document is a sealed class? Seems like many of us would love to implement UniqueDocument to support this oft-requested uniqueness field. Would still have the task of implementing an IndexWriterEx.AddDocument(UniqueDocument)

Re: Similar Document Search

2003-08-21 Thread Peter Becker
hat end? Regards, Terry - Original Message - From: "Peter Becker" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search Hi all, it seems there are quite a few people l

Re: Similar Document Search

2003-08-21 Thread Terry Steichen
s, Terry - Original Message - From: "Peter Becker" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search > Hi all, > > it seems there are quite a few people looking

Re: Similar Document Search

2003-08-20 Thread Peter Becker
yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and in the

RE: Similar Document Search

2003-08-20 Thread Gregor Heinrich
d yet, therefore no code at this time... Best regards, Gregor -Original Message- From: Peter Becker [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 19, 2003 3:06 AM To: Lucene Users List Subject: Re: Similar Document Search Hi Terry, we have been thinking about the same problem and i

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
Hi Peter, I guess you are right. I've implemented this for a index with ten millions of really small documents that all are stored in the index. The documents are never more than a thousand words so re-indexing is quick enough. However it is probably not advisable to do this with bigger documen

Re: Similar Document Search

2003-08-19 Thread Peter Becker
Hi Magnus, thanks for the offer, but unfortunately I can't/don't want to make the assumption that I can easily access the documents to re-index them. And I don't think this approach would be feasible unless you can keep the documents in memory somehow. Storing the other/non-inverted/normal/wha

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
Ok, here it is. It's part of a JSP that prints out all keywords in a document. /magnus <%@ page import="org.apache.lucene.index.IndexReader, org.apache.lucene.document.Document, com.technohuman.search.language.SwedishAnalyzer, java.io.StringReader,

Re: Similar Document Search

2003-08-19 Thread Rociel Buico
hello magnus, can i ask your sample script? --buics Hi Peter If the original document is available. You could extract keywords from the document at query time. That is when someone asks for documents similar to document a. You re-analyze document a and in combination with statistics from t

Re: Similar Document Search

2003-08-19 Thread Magnus Johansson
Hi Peter If the original document is available. You could extract keywords from the document at query time. That is when someone asks for documents similar to document a. You re-analyze document a and in combination with statistics from the Lucene index you extract keywords from document a that

Re: Similar Document Search

2003-08-18 Thread Terry Steichen
of reality, maybe Doug could comment?) Regards, Terry - Original Message - From: "Peter Becker" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 18, 2003 9:05 PM Subject: Re: Similar Document Search > Hi Terry, > >

Re: Similar Document Search

2003-08-18 Thread Peter Becker
Hi Terry, we have been thinking about the same problem and in the end we decided that most likely the only good solution to this is to keep a non-inverted index, i.e. a map from the documents to the terms. Then you can query the most terms for the documents and query other documents matching p

Re: Similar Document Search

2003-08-18 Thread Erik Hatcher
Using the QueryFilter would help with the refining a search based on hits from a previous search, but it wouldn't help with the "like" part your asked about. I'm interested in what you turn up with this though. Erik On Monday, August 18, 2003, at 01:11 PM, Terry Steichen wrote: Is it possib