Well Philip...bad news. I should have thought of this before...I think the query parser is the problem. You are tokening "all in the quotes" to one token...but when QueryParser sees that, it doesnt matter what analyzer you use, it's going to see the quotes and strip them right off . Then it passes whats inside the quotes to be analyzed. Now whats inside the quotes will be missing stop words, have no quotes, etc. Sorry I hadn't thought of this. You're getting deeper...modify the queryparser? It might only take passing in the entire quoted token in the queryparser.jj
q = getFieldQuery(field, term.image.substring(1, term.image.length()-1), s);
just try term.image.

Then queryparser's analyzer would get the whole quoted hunk of goodness. Think it out a bit though...I haven't yet.

- Mark

No, I've never used Luke.  Is there an easy way to examine my RAMDirectory
index?  I can create the index with no quoted keywords, and when I search
for a keyword, I get back the expected results (just can't search for a
phrase that has whitespace in it).  If I create the index with phrases in
quotes, then when I search for anything in double quotes, I get back
nothing.  If I create the index with everything in quotes, then when I
search for anything by the keyword field, I get nothing, regardless of
whether I use quotes in the query string or not.  (I can get results back by
searching on other fields.)  What do you think?

Philip


Erick Erickson wrote:
OK, I've gotta ask. Have you examined your index with Luke to see if what
you *think* is in the index actually *is*???

Erick

On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
Interesting...just ran a test where I put double quotes around everything
(including single keywords) of source text and then ran searches for a
known
keyword with and without double quotes -- doesn't find either time.


Mark Miller-5 wrote:
Sorry to hear you're having trouble. You indeed need the double quotes
in
the source text. You will also need them in the query string. Make sure
they
are in both places. My machine is hosed right now or I would do it for
you
real quick. My guess is that I forgot to mention...no only do you need
to
add the <QUOTED> definiton to the TOKEN section, but below that you
will
find the grammer...you need to add <QUOTED> to the grammer. If you look
how
<NUM> and <APOSTROPHE> are done you will prob see what you should do.
If
not, my machine should be back up tomarrow...

- Mark

On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
Well, I tried that, and it doesn't seem to work still.  I would be
happy
to
zip up the new files, so you can see what I'm using -- maybe you can
get
it
to work.  The first time, I tried building the documents without
quotes
surrounding each phrase.  Then, I retried by enclosing every phrase
within
double quotes.  Neither seemed to work.  When constructing the query
string
for the search, I always added the double quotes (otherwise, it'd
think
it
was multiple terms).  (I didn't even test the underscore and
hyphenated
terms.)  I thought Lucene was (sort of by default) set up to search
quoted
phrases.  From http://lucene.apache.org/java/docs/api/index.html --> A
Phrase is a group of words surrounded by double quotes such as "hello
dolly".  So, this should be easy, right?  I must be missing something
stupid.

Thanks,

Philip


Mark Miller-5 wrote:
So this will recognize anything in quotes as a single token and '_'
and
'-' will not break up words. There may be some repercussions for the
NUM
token but nothing I'd worry about. maybe you want to use Unicode for
'-'
and '_' as well...I wouldn't worry about it myself.

- Mark


TOKEN : {                      // token patterns

  // basic word: a sequence of digits & letters
  <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

| <QUOTED:     "\"" (~["\""])+ "\"">

  // internal apostrophes: O'Reilly, you're, O'Reilly's
  // use a post-filter to remove possesives
| <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >

  // acronyms: U.S.A., I.B.M., etc.
  // use a post-filter to remove dots
| <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >

  // company names like AT&T and [EMAIL PROTECTED]
| <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >

  // email addresses
| <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >

  // hostname
| <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >

  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit
| <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
       | <HAS_DIGIT> <P> <ALPHANUM>
       | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
       | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
       | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
       | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
        )
  >
| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                      // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >

| < #ALPHA: (<LETTER>)+>
| < #LETTER:                      // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
       "\u0100"-"\u1fff",
       "-", "_"
      ]
  >


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
View this message in context:

http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
View this message in context:
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to