Re: Storing Part of Speech information in Lucene Indices

Grant Ingersoll Wed, 12 Jul 2006 09:51:07 -0700

I think Mark's idea is better for this. Although I seem to recallthere being some caveats w/ multiple tokens at the same position, butI don't remember the details. I _think_ term vectors don't like it,so if you need them, you might have troubles. Perhaps a search ofthe mailing lists and JIRA might turn up something or maybe someoneelse remembers. At any rate, it may not effect you, so I would tryMark's suggestion and see if it works.


-Grant


On Jul 12, 2006, at 11:15 AM, Amit Kumar wrote:

We need to be able to search by word and POS and also have POSavailable for each occurrence. Appending POS to the terms willcreate post processing nightmare to retrieveterm frequencies right? (I would have to add all the foo_NN andfoo_ADJ etc.).
I can store the POS in a parallel field and access it via termvectors, but that wouldn't allow any kind of search on POS relatedfields right? For example if I wanted to search for anyadjective with in 3 words of say a term or say If I wanted to getall the patterns that follow the sequence ADJ NN ADJ.
Let me look in the developer archives for the payload discussions,perhaps implementing that might satisfy my use cases.
Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:
Hi Amit,
This is definitely something you can do. What are your goals forit? Do you want to search by word and POS or do you just want POSavailable for post processing?
You could just append the POS tag onto the end of your token as itgets indexed, something like foo_NN or foo_ADJ. This approach maymean you have to use prefix query when you want to search againstjust "foo". You could also have a parallel field to your mainfield that stores the POS. Then you could access it via the termvectors array.
Also, we have been discussing on the developers list on how to addpayloads to a posting (i.e. store related information at aposition in the index) similar to what Google discusses in theiroriginal paper. Unfortunately, this isn't implemented yet, but ifyou feel like helping out, check out the discussion on thedeveloper's list (see Flexible Indexing).
-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
Hi,
A new project that I am investigating lucene for needs the Partsof speech information for the tokens. I can get thatinformation using NLP techniques (GATE etc.), by pre processingthe documents but I would like to store that
information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();
I am writing to ask for your advice, you can tell me I am b o n ke r s or let me know where I should start digging :).Is that a good idea? Or would it be just less trouble for me tostore the offset information along with parts of speech
outside Lucene.

Has anyone else done that?

Best,
Amit
ps: Thank you for putting the LuceneInAction source online, itwas a great help to see the CategorizerTest.java.
I am ordering my copy of the book tomorrow :)

---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
---------------------------------------------------------


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Storing Part of Speech information in Lucene Indices

Reply via email to