Re: Phrase search using quotes -- special Tokenizer

Mark Miller Fri, 01 Sep 2006 15:22:05 -0700

Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need to
add the <QUOTED> definiton to the TOKEN section, but below that you will
find the grammer...you need to add <QUOTED> to the grammer. If you look how
<NUM> and <APOSTROPHE> are done you will prob see what you should do. If
not, my machine should be back up tomarrow...


- Mark

On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:



Well, I tried that, and it doesn't seem to work still.  I would be happy
to
zip up the new files, so you can see what I'm using -- maybe you can get
it
to work.  The first time, I tried building the documents without quotes
surrounding each phrase.  Then, I retried by enclosing every phrase within
double quotes.  Neither seemed to work.  When constructing the query
string
for the search, I always added the double quotes (otherwise, it'd think it
was multiple terms).  (I didn't even test the underscore and hyphenated
terms.)  I thought Lucene was (sort of by default) set up to search quoted
phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
Phrase is a group of words surrounded by double quotes such as "hello
dolly".  So, this should be easy, right?  I must be missing something
stupid.

Thanks,

Philip


Mark Miller-5 wrote:
>
> So this will recognize anything in quotes as a single token and '_' and
> '-' will not break up words. There may be some repercussions for the NUM
> token but nothing I'd worry about. maybe you want to use Unicode for '-'
> and '_' as well...I wouldn't worry about it myself.
>
> - Mark
>
>
> TOKEN : {                      // token patterns
>
>   // basic word: a sequence of digits & letters
>   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>
> | <QUOTED:     "\"" (~["\""])+ "\"">
>
>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>   // use a post-filter to remove possesives
> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>
>   // acronyms: U.S.A., I.B.M., etc.
>   // use a post-filter to remove dots
> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>
>   // company names like AT&T and [EMAIL PROTECTED]
> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>
>   // email addresses
> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> (("."|"-") <ALPHANUM>)+ >
>
>   // hostname
> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>
>   // floating point, serial, model numbers, ip addresses, etc.
>   // every other segment must have at least one digit
> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>        | <HAS_DIGIT> <P> <ALPHANUM>
>        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>         )
>   >
> | <#P: ("_"|"-"|"/"|"."|",") >
> | <#HAS_DIGIT:                      // at least one digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>   >
>
> | < #ALPHA: (<LETTER>)+>
> | < #LETTER:                      // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "-", "_"
>       ]
>   >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

--
View this message in context:
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Phrase search using quotes -- special Tokenizer

Reply via email to