Well, I tried that, and it doesn't seem to work still. I would be happy to zip up the new files, so you can see what I'm using -- maybe you can get it to work. The first time, I tried building the documents without quotes surrounding each phrase. Then, I retried by enclosing every phrase within double quotes. Neither seemed to work. When constructing the query string for the search, I always added the double quotes (otherwise, it'd think it was multiple terms). (I didn't even test the underscore and hyphenated terms.) I thought Lucene was (sort of by default) set up to search quoted phrases. From http://lucene.apache.org/java/docs/api/index.html --> A Phrase is a group of words surrounded by double quotes such as "hello dolly". So, this should be easy, right? I must be missing something stupid.
Thanks, Philip Mark Miller-5 wrote: > > So this will recognize anything in quotes as a single token and '_' and > '-' will not break up words. There may be some repercussions for the NUM > token but nothing I'd worry about. maybe you want to use Unicode for '-' > and '_' as well...I wouldn't worry about it myself. > > - Mark > > > TOKEN : { // token patterns > > // basic word: a sequence of digits & letters > <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ > > > | <QUOTED: "\"" (~["\""])+ "\""> > > // internal apostrophes: O'Reilly, you're, O'Reilly's > // use a post-filter to remove possesives > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ > > > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ > > > // company names like AT&T and [EMAIL PROTECTED] > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> > > > // email addresses > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> > (("."|"-") <ALPHANUM>)+ > > > // hostname > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ > > > // floating point, serial, model numbers, ip addresses, etc. > // every other segment must have at least one digit > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> > | <HAS_DIGIT> <P> <ALPHANUM> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > ) > > > | <#P: ("_"|"-"|"/"|"."|",") > > | <#HAS_DIGIT: // at least one digit > (<LETTER>|<DIGIT>)* > <DIGIT> > (<LETTER>|<DIGIT>)* > > > > | < #ALPHA: (<LETTER>)+> > | < #LETTER: // unicode letters > [ > "\u0041"-"\u005a", > "\u0061"-"\u007a", > "\u00c0"-"\u00d6", > "\u00d8"-"\u00f6", > "\u00f8"-"\u00ff", > "\u0100"-"\u1fff", > "-", "_" > ] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]