Ahhh, I should have followed the link. I was interpreting your first note as emitting two tokens NOT at the same offset. My mistake, ignore my nonsense about unexpected consequences. Your original assumption is correct, zero offsets are pretty transparent.
What do you really want to do here? Mark's email (at the link) allows you to create queries queries expressing "find all phrases of the form noun-verb-adverb" say. The slop allows for intervening words. Your original post seems to want different semantics. <<<I would like to do a search that retrieves documents when a given word is used with a specific part of speech, e.g. all docs where "report" is used as a noun>>>. For that, my suggestion seems simpler, which is not surprising since it addresses a less general problem. So instead of including a general part of speech token, just suffix your original word with your marker and use that for your "synonym. Then expressing your intent is simply tacking on the part of speech marker to the words you care about (e.g. report_n when you wanted report as a noun). No phrases or slop required, at the expense of more terms. Hmmmm, if you wanted to, say, "find all the nouns in the index", you could *prefix* the word (e.g. n_report) which would group all the nouns together in the term enumerations.... Sorry for the confusion Erick On Thu, Nov 19, 2009 at 9:38 AM, Christopher Tignor <ctig...@thinkmap.com>wrote: > Thanks, Erick - > > Indeed every word will have a part of speech token but Is this how the slop > actually works? My understanding was that if I have two tokens in the same > location then each will not effect searches involving other in terms of the > slop as slop indicates the number of words *between* search terms in a > phrase. > > Are tokens at the same location actually adjacent in their ordinal values, > thus affecting the slop as you describe? > > If so, Is there a predictable way to determine which comes before the other > - perhaps the order they are inserted when being tokenized? > > thanks, > > C>T> > > On Thu, Nov 19, 2009 at 8:35 AM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > If I'm reading this right, your tokenizer creates two tokens. One > > "report" and one "_n"... I suspect if so that this will create some > > "interesting" > > behaviors. For instance, if you put two tokens in place, are you going > > to double the slop when you don't care about part of speech? Is every > > word going to get a marker? etc. > > > > I'm not sure payloads would be useful here, but you might check it out... > > > > What I'd think about, though, is a variant of synonyms. That is, index > > report and report_n (note no space) at the same location. Then, when > > you wanted to create a part-of-speech-aware query, you'd attach the > > various markers to your terms (_n, _v, _adj, _adv etc.) and not have to > > worry about unexpected side-effects. > > > > HTH > > Erick > > > > On Wed, Nov 18, 2009 at 5:20 PM, Christopher Tignor < > ctig...@thinkmap.com > > >wrote: > > > > > Hello, > > > > > > I have indexed words in my documents with part of speech tags at the > same > > > location as these words using a custom Tokenizer as described, very > > > helpfully, here: > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3c20060712115026.38897.qm...@web26002.mail.ukl.yahoo.com%3e > > > > > > I would like to do a search that retrieves documents when a given word > is > > > used with a specific part of speech, e.g. all docs where "report" is > used > > > as > > > a noun. > > > > > > I was hoping I could use something like a PhraseQuery with "report _n" > > (_n > > > is my noun part of speech tag) with some sort of identifier that > > describes > > > the words as having to be at the same location - like a null slop or > > > something. > > > > > > Any thoughts on how to do this? > > > > > > thanks so much, > > > > > > C>T> > > > > > > -- > > > TH!NKMAP > > > > > > Christopher Tignor | Senior Software Architect > > > 155 Spring Street NY, NY 10012 > > > p.212-285-8600 x385 f.212-285-8999 > > > > > > > > > -- > TH!NKMAP > > Christopher Tignor | Senior Software Architect > 155 Spring Street NY, NY 10012 > p.212-285-8600 x385 f.212-285-8999 >