Re: Indexing/Querying Annotations and Fields for a document

mark harwood Tue, 18 Mar 2008 03:49:29 -0700

I've used a custom analyzer before now to "blend in" GATE annotations as tokens 
at the same position as the words they relate to.


E.g. 
    Fred Smith works for Microsoft

would be tokenized ordinarily as the following tokens:

    position    offset    text
    ======    ===    ===
    1            0        fred
    2            6        smith
    3            13      works
    ....
But in a custom analyzer you would know the offsets of all these normal tokens 
plus have visibility of the GATE annotations, including offsets. Your custom 
analyzer can blend these to produce as follows:

    position    offset    text
    ======    ===    ===
    1            0        fred
    1            0        GATE_PERSON
    2            6        smith
    3            13      works

The trick to adding "GATE_PERSON" at the same position as "fred" is to set the 
"position increment" of this token to zero. 

Now you can construct a Lucene query that uses this position info in queries. 
i.e. instead of searching for the specific:

    "Fred works for Microsoft"~5

you can now search for the more general:

    "GATE_PERSON works for microsoft"~5

The GATE tokens e.g. "GATE_PERSON" would have to be terms you wouldn't expect 
to find in normal text so they wouldn't clash. 
Another way of doing this which avoids this problem might be to look at the new 
payloads API. 
Anyone care to wade in with if this is feasible and the state of play with 
payloads?

Cheers
Mark


----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 18 March, 2008 12:24:02 AM
Subject: Re: Indexing/Querying Annotations and Fields for a document

You would parse the XML (or whatever) into separate strings, and put  
each piece into it's own Field in a Lucene Document

For instance:

Document doc = new Document();
String body = getBody(input);
String people = getPeople(input)
Field body = new Field("body", body);
Field people = new Field("people", people);

writer.addDocument(doc)


Essentially, you just need to implement the getPeople and getBody  
methods to extract the appropriate content from your text.


On Mar 17, 2008, at 5:05 PM, lucene-seme1 s wrote:

> I already have the document preprocessed and the annotations (i.e.
> <Person>John</Person>) are already stored in an array with features  
> attached
> to some annotations (such as the root and lemma of the word). Can  
> you please
> elaborate some more on how to "index them as normally would" ?
>
> Regards,
> JK
>
>
> On Mon, Mar 17, 2008 at 4:33 PM, Grant Ingersoll <[EMAIL PROTECTED]>
> wrote:
>
>> I think there are a couple of ways you can approach this, although I
>> have never used GATE.
>>
>> If these annotations are marked in line in your content, then you can
>> either preprocess the files to have them separately and index as you
>> normally would, or you can use the relatively new TeeTokenFilter and
>> SinkTokenizer to extract them as you go for use in other fields.  I
>> have done this successfully for some apps that I have worked on and I
>> think it works quite nice and beats preprocessing IMO.  Essentially,
>> you set up a TeeTokenFilter that recognizes your Person and then set
>> that token aside in the Sink.  Then, when you construct the Person
>> field, you use the SinkTokenizer.
>>
>> HTH,
>> Grant
>>
>> On Mar 17, 2008, at 8:54 AM, lucene-seme1 s wrote:
>>
>>> Hello,
>>>
>>> I am a newbie here and still experimenting with Lucene. I have
>>> annotations
>>> and features generated by GATE for many documents and would like to
>>> index
>>> the original content of the documents in addition to the generated
>>> annotations. The annotations are in the form of [<Person> John </
>>> Person>
>>> loves fishing]. I would like to be able to search using the Person
>>> attribute.
>>>
>>> Any hint or suggestion is highly appreciated
>>>
>>> regards,
>>> JK
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucenebootcamp.com
>> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      ___________________________________________________________ 
Rise to the challenge for Sport Relief with Yahoo! For Good  

http://uk.promotions.yahoo.com/forgood/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing/Querying Annotations and Fields for a document

Reply via email to