Hi Amit,
This is definitely something you can do. What are your goals for
it? Do you want to search by word and POS or do you just want POS
available for post processing?
You could just append the POS tag onto the end of your token as it
gets indexed, something like foo_NN or foo_ADJ. This approach may
mean you have to use prefix query when you want to search against
just "foo". You could also have a parallel field to your main
field that stores the POS. Then you could access it via the term
vectors array.
Also, we have been discussing on the developers list on how to add
payloads to a posting (i.e. store related information at a position
in the index) similar to what Google discusses in their original
paper. Unfortunately, this isn't implemented yet, but if you feel
like helping out, check out the discussion on the developer's list
(see Flexible Indexing).
-Grant
On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
Hi,
A new project that I am investigating lucene for needs the Parts
of speech information for the tokens. I can get that
information using NLP techniques (GATE etc.), by pre processing
the documents but I would like to store that
information in the Indices. Something along the lines of
TermVectorOffsetInfo[?].getPartofSpeech();
I am writing to ask for your advice, you can tell me I am b o n k e
r s or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to
store the offset information along with parts of speech
outside Lucene.
Has anyone else done that?
Best,
Amit
ps: Thank you for putting the LuceneInAction source online, it was
a great help to see the CategorizerTest.java.
I am ordering my copy of the book tomorrow :)
---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]