Re: Token retrieval question

Dmitry Serebrennikov Thu, 11 Oct 2001 18:46:06 -0700

Excellent! This is a good confirmation of my direction.
I have a question to the list - are there any votes out there for 
including this kind of "stem reversal" into Lucene, or does it more 
properly belong outside of Lucene, in the application using it?


(I'm leaving the text below for easy reference)

Alex Murzaku wrote:

>>From what I remember, lucene indices are structures like:
>
><term, <doc(i), pos1, ...>...>
>
>where for every TERM there is a list of DOCs in which it appears and the
>respective POSitions in that DOC.
>
>Our problem is that TERM, usually, is a non-word (or stem). For display
>purposes, having a real word as the representative for all the words that
>end up in that stem could be very helpful.
>
>1) Since you are getting at a very low level, would it be terribly expensive
>to add one more field to the above structure which holds the first unstemmed
>form that creates the entry (or all forms that end up at the same stem)?
>
><term, <form1, ...>, <doc(i), pos1, ...>, ...>
>
>This would mean that the analyzer will have to return both the stem and the
>original word and the two will have to be passed along at every step.
>
>2) Or more simply, as you suggest, create some kind of map that contains the
>stem as key and the forms as values. The stemmed word and its originator are
>intercepted after every call to the stemmer and fed to the map.
>
>The function of the map would become some kind of reverse stemming
>(generation of all forms from a given stem). This map would grow
>assymptotically since there is a finite number of words in every language.
>It seems that the purpose of this feature would be to display the keywords
>in a more human friendly fashion, therefore, the map doesn't have to be
>extremely fast - it will be accessed in real time only when some view or
>result is generated. When it is written, it could be queued in its own
>thread so that the rest of indexing keeps going at the same speed.
>
>Alex
>
>-----Original Message-----
>From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]
>
>Yes, I see that. One additional problem that I need to solve for my
>application is that I need to map from stemmed forms of the terms to at
>least one un-stemmed form. Ideally it would be all un-stemmed forms, but
>I can live with the first one. I realize that Lucene does not ealisy
>support this because of the separation of church and state (I mean the
>term filtering prior to indexing and querying), but I still need this
>functionality... So, the question is, is this going to be common enough
>to add a concept of a TermDictionary to Lucene and provide methods to
>access it on the IndexReader and IndexWriter? If not, I could implement
>this externally, but then I would not be able to use the IO framework
>and whole concept of directories. Also, since the Term numbers are going
>to be euphemeral just like doc numbers, externally I would have to refer
>to them by text, slowing dow the translation process, etc., etc., etc..
>
>It's not yet clear enough in my mind to put an API together. Maybe the
>way to do this is to create and Analyzer that outputs a subclass of Term
>that has additional data, namely: String original_text, and int data.
>The data int is to keep application-specific flags such as term
>classification. Then the indexing code can be extended to support these
>extra fields and maintain the TermDictionary with them. The first entry
>for a given term wins in terms of the original_text and the data int.
>
>Any ideas to make this less of a hack?
>
>Dmitry.
>
>>
>>Doug
>>
>
>
>

Re: Token retrieval question

Reply via email to