1world1love skrev:
Greetings all. I am indexing a set of documents where I am extracting terms
and mapping them to a controlled vocabulary and then placing the matched
vocabulary in a keyword field.

One could also say you are classifying your data based on keywords in
the text?

What I want to know is if there is a way to store the original term location
with the keyword field?

You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..

But that is a way expensive solution.

Example Text: "The quick brown fox jumped over the lazy dog" -->

Controlled Vocabulary Terms: "physical activity", "exercise", "sedentary
lifestyle", "canine"

I am storing these controlled terms in a keyword field so they are stored
and searchable exactly.

This is known as faceted classification.

<http://en.wikipedia.org/wiki/Faceted_classification>
<http://www.nabble.com/forum/Search.jtp?query=facets&local=y&forum=44>

What I would like to be able to do is to highlight the context of the
original term or phrase that is associated with a mapped term. So in the
example above, if the controlled term is "sedentary lifestyle", I would like
to highlight "lazy".

There can be multiple mapped terms for an original term or phrase.

The algorithm that handles the mapping provides the start and end
position of the original text


Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.

Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.

Instead you could add a single Term, ignore the built in positions and
store them for all positions in the payload of that single Term.


for (String facet : facets) {
  doc.addField(
      "f", new SingleTokenTokenStream(
          facet, new Payload(offsets.toByteArray())
      )
  );
}

(This is dry coded, you will need to implement some of them things.)

You also need to modify the highligher so it can read this data.



   karl


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to