On Feb 22, 2005, at 8:50 PM, Chris Hostetter wrote:


: >>> Just curious: it would seem easier to use multiple fields for the
: >>> original case and lowercase searching. Is there any particular reason
: >>> you analyzed the documents to multiple indexes instead of multiple
: >>> fields?
: >>
: >> I considered that approach, however to expose QueryParser I'd have to
: >> get tricky. If I have title_orig and title_lc fields, how would I
: >> allow freeform queries of title:something?


Why have seperate fields?

Why not index the title into the "title" field twice, once with each term
lowercased and once with the case left alone. (Using an analyzer that
tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The]
[Quick] [BrOwN] [fox]")


Then at search time, depending on the value of of the checkbox, construct
your QueryParser using the appropriate Analyzer.

I assume you mean to stack the tokens in the same positions, so it'd be like this:


        [the]   [quick] [brown] [fox]
        [The]   [Quick] [BrOwN] [fox]

Otherwise, if you simply string it together like what you show, then this phrase matches "fox The Quick", which is not in the original document. Though putting in a large gap would do the trick in your example.

There is a fiddly issue with this technique that I'm not quite seeing at the moment, but I'll brainstorm on it and hopefully remember it or perhaps be proven wrong.

I'm Lucene-brain-dead.... I just did a presentation to our local Unix Users Group. I built a man page indexer/searcher with PyLucene (thank you Andi!). I had to learn Python as well, which was a good exercise, and learned lots from Andi's helpful private e-mails coaching me through my learning curve. Now that I've seen the beast known as Python, I'm yearning for a Ruby version based on GCJ/SWIG. A local Ruby guru and I are planning on meeting for a few hours each week and take a stab at it. I'll commit whatever we do directly to a /ruby directory in Subversion.

Here's an example of my PyLucene output:

$ mansearch.py interface section:5
remote - remote host description file
rtadvd.conf - config file for router advertisement daemon
ipnat - IP NAT file format
groff_out - groff intermediate output format
xinetd.conf - Extended Internet Services Daemon configuration file
plist - property list format
racoon.conf - configuration file for racoon
ssh_config - OpenSSH SSH client configuration files
sudoers - list of which users may execute what

Even with custom formatting:

$ mansearch.py --format=#filename interface section:5
/usr/share/man/man5/remote.5
/usr/share/man/man5/rtadvd.conf.5
/usr/share/man/man5/ipnat.5
/usr/share/man/man5/groff_out.5
/usr/share/man/man5/xinetd.conf.5
/usr/share/man/man5/plist.5
/usr/share/man/man5/racoon.conf.5
/usr/share/man/man5/ssh_config.5
/usr/share/man/man5/sudoers.5

suitable for xargs :)

        Erik



The only problem i can think of would be inflated scores for terms that
are naturally lowercased, because they would wind up getting added to the
index twice, but based on what i've seen of hte data you are working
with, i imageing that if you used UPPERCASE instead of lowercase you
could drasticly reduce the likelyhood of any problems with that.




-Hoss


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to