I think Mark's idea is better for this. Although I seem to recall
there being some caveats w/ multiple tokens at the same position, but
I don't remember the details. I _think_ term vectors don't like it,
so if you need them, you might have troubles. Perhaps a search of
the mailing lists
ect: Re: Storing Part of Speech information in Lucene Indices
We need to be able to search by word and POS and also have POS
available for each occurrence. Appending POS to the terms will
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and
ssage
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 July, 2006 4:15:34 PM
Subject: Re: Storing Part of Speech information in Lucene Indices
We need to be able to search by word and POS and also have POS
available for each occurrence. Appe
We need to be able to search by word and POS and also have POS
available for each occurrence. Appending POS to the terms will
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and
foo_ADJ etc.).
I can store the POS in a parallel field
l Message
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Cc: Amit Kumar <[EMAIL PROTECTED]>
Sent: Wednesday, 12 July, 2006 6:36:24 AM
Subject: Storing Part of Speech information in Lucene Indices
Hi,
A new project that I am investigating lucene for needs the P
Hi Amit,
This is definitely something you can do. What are your goals for
it? Do you want to search by word and POS or do you just want POS
available for post processing?
You could just append the POS tag onto the end of your token as it
gets indexed, something like foo_NN or foo_ADJ.
Hi,
A new project that I am investigating lucene for needs the Parts of
speech information for the tokens. I can get that
information using NLP techniques (GATE etc.), by pre processing the
documents but I would like to store that
information in the Indices. Something along the lines of