No, I've never used Luke. Is there an easy way to examine my
RAMDirectory
index? I can create the index with no quoted keywords, and when I
search
for a keyword, I get back the expected results (just can't search for a
phrase that has whitespace in it). If I create the index with
phrases in
quotes, then when I search for anything in double quotes, I get back
nothing. If I create the index with everything in quotes, then when I
search for anything by the keyword field, I get nothing, regardless of
whether I use quotes in the query string or not. (I can get results
back
by
searching on other fields.) What do you think?
Philip
Erick Erickson wrote:
>
> OK, I've gotta ask. Have you examined your index with Luke to see if
what
> you *think* is in the index actually *is*???
>
> Erick
>
> On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
>>
>>
>> Interesting...just ran a test where I put double quotes around
everything
>> (including single keywords) of source text and then ran searches
for a
>> known
>> keyword with and without double quotes -- doesn't find either time.
>>
>>
>> Mark Miller-5 wrote:
>> >
>> > Sorry to hear you're having trouble. You indeed need the double
quotes
>> in
>> > the source text. You will also need them in the query string. Make
sure
>> > they
>> > are in both places. My machine is hosed right now or I would do it
for
>> you
>> > real quick. My guess is that I forgot to mention...no only do you
need
>> to
>> > add the <QUOTED> definiton to the TOKEN section, but below that you
>> will
>> > find the grammer...you need to add <QUOTED> to the grammer. If you
look
>> > how
>> > <NUM> and <APOSTROPHE> are done you will prob see what you
should do.
>> If
>> > not, my machine should be back up tomarrow...
>> >
>> > - Mark
>> >
>> > On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> Well, I tried that, and it doesn't seem to work still. I would be
>> happy
>> >> to
>> >> zip up the new files, so you can see what I'm using -- maybe
you can
>> get
>> >> it
>> >> to work. The first time, I tried building the documents without
>> quotes
>> >> surrounding each phrase. Then, I retried by enclosing every
phrase
>> >> within
>> >> double quotes. Neither seemed to work. When constructing the
query
>> >> string
>> >> for the search, I always added the double quotes (otherwise, it'd
>> think
>> >> it
>> >> was multiple terms). (I didn't even test the underscore and
>> hyphenated
>> >> terms.) I thought Lucene was (sort of by default) set up to
search
>> >> quoted
>> >> phrases. From
http://lucene.apache.org/java/docs/api/index.html -->
A
>> >> Phrase is a group of words surrounded by double quotes such as
"hello
>> >> dolly". So, this should be easy, right? I must be missing
something
>> >> stupid.
>> >>
>> >> Thanks,
>> >>
>> >> Philip
>> >>
>> >>
>> >> Mark Miller-5 wrote:
>> >> >
>> >> > So this will recognize anything in quotes as a single token and
'_'
>> and
>> >> > '-' will not break up words. There may be some repercussions for
the
>> >> NUM
>> >> > token but nothing I'd worry about. maybe you want to use Unicode
for
>> >> '-'
>> >> > and '_' as well...I wouldn't worry about it myself.
>> >> >
>> >> > - Mark
>> >> >
>> >> >
>> >> > TOKEN : { // token patterns
>> >> >
>> >> > // basic word: a sequence of digits & letters
>> >> > <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>> >> >
>> >> > | <QUOTED: "\"" (~["\""])+ "\"">
>> >> >
>> >> > // internal apostrophes: O'Reilly, you're, O'Reilly's
>> >> > // use a post-filter to remove possesives
>> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>> >> >
>> >> > // acronyms: U.S.A., I.B.M., etc.
>> >> > // use a post-filter to remove dots
>> >> > | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>> >> >
>> >> > // company names like AT&T and [EMAIL PROTECTED]
>> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>> >> >
>> >> > // email addresses
>> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
>> >> > (("."|"-") <ALPHANUM>)+ >
>> >> >
>> >> > // hostname
>> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>> >> >
>> >> > // floating point, serial, model numbers, ip addresses, etc.
>> >> > // every other segment must have at least one digit
>> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >> > | <HAS_DIGIT> <P> <ALPHANUM>
>> >> > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>> >> > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>> >> > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
>> <HAS_DIGIT>)+
>> >> > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
>> <ALPHANUM>)+
>> >> > )
>> >> > >
>> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>> >> > | <#HAS_DIGIT: // at least one digit
>> >> > (<LETTER>|<DIGIT>)*
>> >> > <DIGIT>
>> >> > (<LETTER>|<DIGIT>)*
>> >> > >
>> >> >
>> >> > | < #ALPHA: (<LETTER>)+>
>> >> > | < #LETTER: // unicode letters
>> >> > [
>> >> > "\u0041"-"\u005a",
>> >> > "\u0061"-"\u007a",
>> >> > "\u00c0"-"\u00d6",
>> >> > "\u00d8"-"\u00f6",
>> >> > "\u00f8"-"\u00ff",
>> >> > "\u0100"-"\u1fff",
>> >> > "-", "_"
>> >> > ]
>> >> > >
>> >> >
>> >> >
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> > For additional commands, e-mail:
[EMAIL PROTECTED]
>> >> >
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>>
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>
>> >>
>> >>
---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>>
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
Sent from the Lucene - Java Users forum at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]