RE: Token retrieval question

Anders Nielsen Fri, 12 Oct 2001 01:20:16 -0700

Can't you just keep 2 fields, one with the stemmed version of the text used
for indexing purposes (index but not stored) and a second field with the
original text (un-indexed but stored). Then when you know you got a match on
the nth term in the stemmed version, you can use the same Analyzer but
without the stemming on the stored text field, and take the nth term from
that?


The only trouble I can see with that is if the stemmer either skips terms or
makes two terms into one.

regards,
Anders Nielsen

-----Original Message-----
From: Alex Murzaku [mailto:[EMAIL PROTECTED]]
Sent: 12. oktober 2001 03:44
To: [EMAIL PROTECTED]
Subject: RE: Token retrieval question


>From what I remember, lucene indices are structures like:

<term, <doc(i), pos1, ...>...>

where for every TERM there is a list of DOCs in which it appears and the
respective POSitions in that DOC.

Our problem is that TERM, usually, is a non-word (or stem). For display
purposes, having a real word as the representative for all the words that
end up in that stem could be very helpful.

1) Since you are getting at a very low level, would it be terribly expensive
to add one more field to the above structure which holds the first unstemmed
form that creates the entry (or all forms that end up at the same stem)?

<term, <form1, ...>, <doc(i), pos1, ...>, ...>

This would mean that the analyzer will have to return both the stem and the
original word and the two will have to be passed along at every step.

2) Or more simply, as you suggest, create some kind of map that contains the
stem as key and the forms as values. The stemmed word and its originator are
intercepted after every call to the stemmer and fed to the map.

The function of the map would become some kind of reverse stemming
(generation of all forms from a given stem). This map would grow
assymptotically since there is a finite number of words in every language.
It seems that the purpose of this feature would be to display the keywords
in a more human friendly fashion, therefore, the map doesn't have to be
extremely fast - it will be accessed in real time only when some view or
result is generated. When it is written, it could be queued in its own
thread so that the rest of indexing keeps going at the same speed.

Alex

-----Original Message-----
From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]

Yes, I see that. One additional problem that I need to solve for my
application is that I need to map from stemmed forms of the terms to at
least one un-stemmed form. Ideally it would be all un-stemmed forms, but
I can live with the first one. I realize that Lucene does not ealisy
support this because of the separation of church and state (I mean the
term filtering prior to indexing and querying), but I still need this
functionality... So, the question is, is this going to be common enough
to add a concept of a TermDictionary to Lucene and provide methods to
access it on the IndexReader and IndexWriter? If not, I could implement
this externally, but then I would not be able to use the IO framework
and whole concept of directories. Also, since the Term numbers are going
to be euphemeral just like doc numbers, externally I would have to refer
to them by text, slowing dow the translation process, etc., etc., etc..

It's not yet clear enough in my mind to put an API together. Maybe the
way to do this is to create and Analyzer that outputs a subclass of Term
that has additional data, namely: String original_text, and int data.
The data int is to keep application-specific flags such as term
classification. Then the indexing code can be extended to support these
extra fields and maintain the TermDictionary with them. The first entry
for a given term wins in terms of the original_text and the data int.

Any ideas to make this less of a hack?

Dmitry.

>
>
>Doug
>

RE: Token retrieval question

Reply via email to