Sorry to hear you're having trouble. You indeed need the double quotes in the source text. You will also need them in the query string. Make sure they are in both places. My machine is hosed right now or I would do it for you real quick. My guess is that I forgot to mention...no only do you need to add the <QUOTED> definiton to the TOKEN section, but below that you will find the grammer...you need to add <QUOTED> to the grammer. If you look how <NUM> and <APOSTROPHE> are done you will prob see what you should do. If not, my machine should be back up tomarrow...
- Mark On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
Well, I tried that, and it doesn't seem to work still. I would be happy to zip up the new files, so you can see what I'm using -- maybe you can get it to work. The first time, I tried building the documents without quotes surrounding each phrase. Then, I retried by enclosing every phrase within double quotes. Neither seemed to work. When constructing the query string for the search, I always added the double quotes (otherwise, it'd think it was multiple terms). (I didn't even test the underscore and hyphenated terms.) I thought Lucene was (sort of by default) set up to search quoted phrases. From http://lucene.apache.org/java/docs/api/index.html --> A Phrase is a group of words surrounded by double quotes such as "hello dolly". So, this should be easy, right? I must be missing something stupid. Thanks, Philip Mark Miller-5 wrote: > > So this will recognize anything in quotes as a single token and '_' and > '-' will not break up words. There may be some repercussions for the NUM > token but nothing I'd worry about. maybe you want to use Unicode for '-' > and '_' as well...I wouldn't worry about it myself. > > - Mark > > > TOKEN : { // token patterns > > // basic word: a sequence of digits & letters > <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ > > > | <QUOTED: "\"" (~["\""])+ "\""> > > // internal apostrophes: O'Reilly, you're, O'Reilly's > // use a post-filter to remove possesives > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ > > > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ > > > // company names like AT&T and [EMAIL PROTECTED] > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> > > > // email addresses > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM> > (("."|"-") <ALPHANUM>)+ > > > // hostname > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ > > > // floating point, serial, model numbers, ip addresses, etc. > // every other segment must have at least one digit > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> > | <HAS_DIGIT> <P> <ALPHANUM> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > ) > > > | <#P: ("_"|"-"|"/"|"."|",") > > | <#HAS_DIGIT: // at least one digit > (<LETTER>|<DIGIT>)* > <DIGIT> > (<LETTER>|<DIGIT>)* > > > > | < #ALPHA: (<LETTER>)+> > | < #LETTER: // unicode letters > [ > "\u0041"-"\u005a", > "\u0061"-"\u007a", > "\u00c0"-"\u00d6", > "\u00d8"-"\u00f6", > "\u00f8"-"\u00ff", > "\u0100"-"\u1fff", > "-", "_" > ] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]