Thanks, Erick - Indeed every word will have a part of speech token but Is this how the slop actually works? My understanding was that if I have two tokens in the same location then each will not effect searches involving other in terms of the slop as slop indicates the number of words *between* search terms in a phrase.
Are tokens at the same location actually adjacent in their ordinal values, thus affecting the slop as you describe? If so, Is there a predictable way to determine which comes before the other - perhaps the order they are inserted when being tokenized? thanks, C>T> On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson <erickerick...@gmail.com>wrote: > If I'm reading this right, your tokenizer creates two tokens. One > "report" and one "_n"... I suspect if so that this will create some > "interesting" > behaviors. For instance, if you put two tokens in place, are you going > to double the slop when you don't care about part of speech? Is every > word going to get a marker? etc. > > I'm not sure payloads would be useful here, but you might check it out... > > What I'd think about, though, is a variant of synonyms. That is, index > report and report_n (note no space) at the same location. Then, when > you wanted to create a part-of-speech-aware query, you'd attach the > various markers to your terms (_n, _v, _adj, _adv etc.) and not have to > worry about unexpected side-effects. > > HTH > Erick > > On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor <ctig...@thinkmap.com > >wrote: > > > Hello, > > > > I have indexed words in my documents with part of speech tags at the same > > location as these words using a custom Tokenizer as described, very > > helpfully, here: > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e > > > > I would like to do a search that retrieves documents when a given word is > > used with a specific part of speech, e.g. all docs where "report" is used > > as > > a noun. > > > > I was hoping I could use something like a PhraseQuery with "report _n" > (_n > > is my noun part of speech tag) with some sort of identifier that > describes > > the words as having to be at the same location - like a null slop or > > something. > > > > Any thoughts on how to do this? > > > > thanks so much, > > > > C>T> > > > > -- > > TH!NKMAP > > > > Christopher Tignor | Senior Software Architect > > 155 Spring Street NY, NY 10012 > > p.212-285-8600 x385 f.212-285-8999 > > > -- TH!NKMAP Christopher Tignor | Senior Software Architect 155 Spring Street NY, NY 10012 p.212-285-8600 x385 f.212-285-8999