Re: storing position - keyword

John Byrne Thu, 06 Mar 2008 01:28:09 -0800

"To confuse matters more, it is not really a matter of synonyms, as theorginal term is discarded from the index and there is only one mapped term"

I'm not sure I fully understand this: am I right in thinking that youwill be searching using these controlled volcabulary words, and that thesearch must then find any of the ordinary words which map to thecontroled vocaburlary words, and highlight them?

Because if that's the case, I think it's relatively simple: You create aseparate index, which only maps the controlled vocubulary to theordinary words. That's your "synonyms" index. Then, you index yourtarget document as normal. When you search, you first look up yoursearch term against the synonyms index. So, following your exmaple, ifyou looked up "dog" in the synoyms index, youd get back "chien", "canis"and "cane". (Achieving this part is easy: you just keep adding"synonyms" to the field at the same position.) Whether or not thereturned list also contains the orignal "dog" is up to you when youcreate your synonyms index. (In a typical synonyms ring, the originalword would have to be in there, because you don't know which word willbe used to search)

Now all you have to do is combine those returned terms as Boolean ORclauses in a single BooleanQuery, and search on the main index. You'llfind all documents containing any of those 3 words, and you can use thehighlighting code form the Lucene contrib projects to highlight

Does this help? Forgive me if I've misunderstood or undersetimated theproblem!


Regards,
-John

per original term or phrase and the algorithm determines the controlled
meaning from the context.

1world1love wrote:

First off Karl, thanks for your reply and your time.



karl wettin-3 wrote:

One could also say you are classifying your data based on keywords in
the text?


I probably didn't explain myself very well or more specifically provide a
good example. In my case, there really isn't any relationship between the
mapped terms per document. That is to say that an individual term or phrase
in the document is mapped to a concrete concept in a controlled vocabulary.
The concept doesn't represent a class of anything and no relationship exists
between the concepts. They would never be grouped by any means. It is more a
matter of replacing some arbitrary word or phrase with an adjudicated
version.

The example I gave did in fact use classifications for the terms, but that
is not exactly the point that I was trying to convey. I suppose a better
example would be where each term or phrase in the sentence mapped to any
equivilent in another language:

dog -> canis
dog -> cane
dog -> chien

So that if you searched for "canis", then any document with "dog" would be
returned (unless the context inferred that dog meant something else). By the
same token, if the text was "here we go" or "let's go", then it may map to
"vamos" or "vamonos".

To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term
per original term or phrase and the algorithm determines the controlled
meaning from the context.


karl wettin-3 wrote:

You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..

But that is a way expensive solution.


Indeed, though doesn't a analyzed field have some other information attached
to it?

Forgive me if this is a naive question. I am fairly new to Lucene.


karl wettin-3 wrote:

This is known as faceted classification.

<http://en.wikipedia.org/wiki/Faceted_classification>
<http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>


Again, I am not overly familiar with these disciplines, but I always thought
of facets as a organizational strategy. As I said, my example betrayed me a
bit, as I am not that interested in organizing these documents, rather
providing a controlled vocabulary from which to search as opposed to any
random text.



karl wettin-3 wrote:

Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.


This is actually not a web based application and the highlighting would
really only be used for analyzing performance of the mapping algorithms. The
main issue is that we do need to be able to provide the location of the
original term for each mapped keyword.



karl wettin-3 wrote:

Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.


It may not be that big of an issue.


karl wettin-3 wrote:

Instead you could add a single Term, ignore the built in positions and
store them for all positions in the payload of that single Term.


for (String facet : facets) {
   doc.addField(
       "f", new SingleTokenTokenStream(
           facet, new Payload(offsets.toByteArray())
       )
   );
}

(This is dry coded, you will need to implement some of them things.)

You also need to modify the highligher so it can read this data.


Something like this seems like it might work well for my purposes. I will
look at this further.

Thanks again,

J


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing position - keyword

Reply via email to