Re: Token retrieval question

2001-10-12 Thread Dmitry Serebrennikov
Anders Nielsen wrote: >Can't you just keep 2 fields, one with the stemmed version of the text used >for indexing purposes (index but not stored) and a second field with the >original text (un-indexed but stored). Then when you know you got a match on >the nth term in the stemmed version, you ca

RE: Token retrieval question

2001-10-12 Thread Anders Nielsen
tober 2001 03:44 To: [EMAIL PROTECTED] Subject: RE: Token retrieval question >From what I remember, lucene indices are structures like: ...> where for every TERM there is a list of DOCs in which it appears and the respective POSitions in that DOC. Our problem is that TERM, usually, is a n

RE: Token retrieval question

2001-10-12 Thread Alex Murzaku
>From what I remember, lucene indices are structures like: ...> where for every TERM there is a list of DOCs in which it appears and the respective POSitions in that DOC. Our problem is that TERM, usually, is a non-word (or stem). For display purposes, having a real word as the representative f

Re: Token retrieval question

2001-10-12 Thread Maurits van Wijland
Hi, This is a nice discussion :) > > > Yes, I see that. One additional problem that I need to solve for my > application is that I need to map from stemmed forms of the terms to at > least one un-stemmed form. Ideally it would be all un-stemmed forms, but > I can live with the first one. I r

Re: Token retrieval question

2001-10-11 Thread Dmitry Serebrennikov
Excellent! This is a good confirmation of my direction. I have a question to the list - are there any votes out there for including this kind of "stem reversal" into Lucene, or does it more properly belong outside of Lucene, in the application using it? (I'm leaving the text below for easy refe

RE: Token retrieval question

2001-10-11 Thread Alex Murzaku
>From what I remember, lucene indices are structures like: ...> where for every TERM there is a list of DOCs in which it appears and the respective POSitions in that DOC. Our problem is that TERM, usually, is a non-word (or stem). For display purposes, having a real word as the representative f

Re: Token retrieval question

2001-10-11 Thread Dmitry Serebrennikov
Doug Cutting wrote: >>From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] >> >>Doug, thanks for posting these. I may end up going in this >>direction in >>the next few days and will use this as a blueprint. Maybe I'll end up >>putting in the first pass implementation and then you can >>la

RE: Token retrieval question

2001-10-11 Thread Doug Cutting
> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] > > Doug, thanks for posting these. I may end up going in this > direction in > the next few days and will use this as a blueprint. Maybe I'll end up > putting in the first pass implementation and then you can > later further > tune it

RE: Token retrieval question

2001-10-10 Thread Dave Kor
You can count me in on this :) --- Doug Cutting <[EMAIL PROTECTED]> wrote: > Right now, Lucene does not have good support for > what you're doing. Lucene > as it stands is designed to support basic search, > not other statistical text > processing. However there are two features that I > would

Re: Token retrieval question

2001-10-10 Thread Dmitry Serebrennikov
Doug, thanks for posting these. I may end up going in this direction in the next few days and will use this as a blueprint. Maybe I'll end up putting in the first pass implementation and then you can later further tune it when you get to it. Question on term numbers through: what would be an a

RE: Token retrieval question

2001-10-10 Thread Doug Cutting
Right now, Lucene does not have good support for what you're doing. Lucene as it stands is designed to support basic search, not other statistical text processing. However there are two features that I would like to add to Lucene that would help you. 1. Seekable TermDocs. This would let you ef

Re: Token retrieval question

2001-10-10 Thread Dmitry Serebrennikov
I'm actually working on exactly the same problem. Just yesterday, I implemented a new query (called CooccuranceQuery) that, given a list of terms, acts as a BooleanQuery with all of the terms being required and then reports back a list of other terms in the index with a count of how many docum