You would parse the XML (or whatever) into separate strings, and put
each piece into it's own Field in a Lucene Document
For instance:
Document doc = new Document();
String body = getBody(input);
String people = getPeople(input)
Field body = new Field("body", body);
Field people = new Field("people", people);
writer.addDocument(doc)
Essentially, you just need to implement the getPeople and getBody
methods to extract the appropriate content from your text.
On Mar 17, 2008, at 5:05 PM, lucene-seme1 s wrote:
I already have the document preprocessed and the annotations (i.e.
<Person>John</Person>) are already stored in an array with features
attached
to some annotations (such as the root and lemma of the word). Can
you please
elaborate some more on how to "index them as normally would" ?
Regards,
JK
On Mon, Mar 17, 2008 at 4:33 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:
I think there are a couple of ways you can approach this, although I
have never used GATE.
If these annotations are marked in line in your content, then you can
either preprocess the files to have them separately and index as you
normally would, or you can use the relatively new TeeTokenFilter and
SinkTokenizer to extract them as you go for use in other fields. I
have done this successfully for some apps that I have worked on and I
think it works quite nice and beats preprocessing IMO. Essentially,
you set up a TeeTokenFilter that recognizes your Person and then set
that token aside in the Sink. Then, when you construct the Person
field, you use the SinkTokenizer.
HTH,
Grant
On Mar 17, 2008, at 8:54 AM, lucene-seme1 s wrote:
Hello,
I am a newbie here and still experimenting with Lucene. I have
annotations
and features generated by GATE for many documents and would like to
index
the original content of the documents in addition to the generated
annotations. The annotations are in the form of [<Person> John </
Person>
loves fishing]. I would like to be able to search using the Person
attribute.
Any hint or suggestion is highly appreciated
regards,
JK
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]