Re: Similar Document Search

Peter Becker Wed, 20 Aug 2003 22:43:07 -0700

Hi all,

it seems there are quite a few people looking for similar features, i.e. (a) document identity and (b) forward indexing. So far we hacked (a) by using a wrapper implementing equals/hashcode based on a unique field, but of course that assumes maintaining a unique field in the index. (b) is something we haven't tackled yet, but plan to.

The source code for Mark's thesis seems to be part of the Haystack distribution. The comments in the files put it under Apche-license. This seems to make it a good candidate to be included at least in the Lucene sandbox -- although I haven't tried it myself yet. But it sounds like a good candidate for us to use.

Since the haystack source is a bit larger and I actually couldn't get the download at the moment, here is a copy of the relevant bit grabbed from one of my colleague's machines:

http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

Note that this is just a tarball of src/org/apache/lucene out of some Haystack source. Untested, unmodified.

I'd love to see something like this supported in the Lucene context were people might actually find it :-)

Peter

Gregor Heinrich wrote:

Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.

We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...

Best regards,

Gregor


-----Original Message-----
From: Peter Becker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search

Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.

Peter

Terry Steichen wrote:

Is it possible without extensive additional coding to use Lucene to conduct

a search based on a document rather than a query? (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)

Regards,

Terry


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Similar Document Search

Reply via email to