"To confuse matters more, it is not really a matter of synonyms, as the orginal term is discarded from the index and there is only one mapped term"

I'm not sure I fully understand this: am I right in thinking that you will be searching using these controlled volcabulary words, and that the search must then find any of the ordinary words which map to the controled vocaburlary words, and highlight them?

Because if that's the case, I think it's relatively simple: You create a separate index, which only maps the controlled vocubulary to the ordinary words. That's your "synonyms" index. Then, you index your target document as normal. When you search, you first look up your search term against the synonyms index. So, following your exmaple, if you looked up "dog" in the synoyms index, youd get back "chien", "canis" and "cane". (Achieving this part is easy: you just keep adding "synonyms" to the field at the same position.) Whether or not the returned list also contains the orignal "dog" is up to you when you create your synonyms index. (In a typical synonyms ring, the original word would have to be in there, because you don't know which word will be used to search)

Now all you have to do is combine those returned terms as Boolean OR clauses in a single BooleanQuery, and search on the main index. You'll find all documents containing any of those 3 words, and you can use the highlighting code form the Lucene contrib projects to highlight

Does this help? Forgive me if I've misunderstood or undersetimated the problem!

Regards,
-John

per original term or phrase and the algorithm determines the controlled
meaning from the context.

1world1love wrote:
First off Karl, thanks for your reply and your time.



karl wettin-3 wrote:
One could also say you are classifying your data based on keywords in
the text?


I probably didn't explain myself very well or more specifically provide a
good example. In my case, there really isn't any relationship between the
mapped terms per document. That is to say that an individual term or phrase
in the document is mapped to a concrete concept in a controlled vocabulary.
The concept doesn't represent a class of anything and no relationship exists
between the concepts. They would never be grouped by any means. It is more a
matter of replacing some arbitrary word or phrase with an adjudicated
version.

The example I gave did in fact use classifications for the terms, but that
is not exactly the point that I was trying to convey. I suppose a better
example would be where each term or phrase in the sentence mapped to any
equivilent in another language:

dog -> canis
dog -> cane
dog -> chien

So that if you searched for "canis", then any document with "dog" would be
returned (unless the context inferred that dog meant something else). By the
same token, if the text was "here we go" or "let's go", then it may map to
"vamos" or "vamonos".

To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term
per original term or phrase and the algorithm determines the controlled
meaning from the context.


karl wettin-3 wrote:
You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..

But that is a way expensive solution.


Indeed, though doesn't a analyzed field have some other information attached
to it?

Forgive me if this is a naive question. I am fairly new to Lucene.


karl wettin-3 wrote:
This is known as faceted classification.

<http://en.wikipedia.org/wiki/Faceted_classification>
<http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>


Again, I am not overly familiar with these disciplines, but I always thought
of facets as a organizational strategy. As I said, my example betrayed me a
bit, as I am not that interested in organizing these documents, rather
providing a controlled vocabulary from which to search as opposed to any
random text.



karl wettin-3 wrote:
Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.


This is actually not a web based application and the highlighting would
really only be used for analyzing performance of the mapping algorithms. The
main issue is that we do need to be able to provide the location of the
original term for each mapped keyword.



karl wettin-3 wrote:
Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.


It may not be that big of an issue.


karl wettin-3 wrote:
Instead you could add a single Term, ignore the built in positions and
store them for all positions in the payload of that single Term.


for (String facet : facets) {
   doc.addField(
       "f", new SingleTokenTokenStream(
           facet, new Payload(offsets.toByteArray())
       )
   );
}

(This is dry coded, you will need to implement some of them things.)

You also need to modify the highligher so it can read this data.



Something like this seems like it might work well for my purposes. I will
look at this further.

Thanks again,

J


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to