Re: Lucene in the Humanities

Erik Hatcher Tue, 22 Feb 2005 19:37:05 -0800


On Feb 22, 2005, at 8:50 PM, Chris Hostetter wrote:

: >>> Just curious: it would seem easier to use multiple fields for the : >>> original case and lowercase searching. Is there any particular reason : >>> you analyzed the documents to multiple indexes instead of multiple : >>> fields? : >> : >> I considered that approach, however to expose QueryParser I'd have to : >> get tricky. If I have title_orig and title_lc fields, how would I : >> allow freeform queries of title:something?
Why have seperate fields?
Why not index the title into the "title" field twice, once with each term lowercased and once with the case left alone. (Using an analyzer that tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The] [Quick] [BrOwN] [fox]")

Then at search time, depending on the value of of the checkbox, construct your QueryParser using the appropriate Analyzer.

I assume you mean to stack the tokens in the same positions, so it'd be like this:

        [the]   [quick] [brown] [fox]
        [The]   [Quick] [BrOwN] [fox]

Otherwise, if you simply string it together like what you show, then this phrase matches "fox The Quick", which is not in the original document. Though putting in a large gap would do the trick in your example.

There is a fiddly issue with this technique that I'm not quite seeing at the moment, but I'll brainstorm on it and hopefully remember it or perhaps be proven wrong.

I'm Lucene-brain-dead.... I just did a presentation to our local Unix Users Group. I built a man page indexer/searcher with PyLucene (thank you Andi!). I had to learn Python as well, which was a good exercise, and learned lots from Andi's helpful private e-mails coaching me through my learning curve. Now that I've seen the beast known as Python, I'm yearning for a Ruby version based on GCJ/SWIG. A local Ruby guru and I are planning on meeting for a few hours each week and take a stab at it. I'll commit whatever we do directly to a /ruby directory in Subversion.

Here's an example of my PyLucene output:

$ mansearch.py interface section:5
remote - remote host description file
rtadvd.conf - config file for router advertisement daemon
ipnat - IP NAT file format
groff_out - groff intermediate output format
xinetd.conf - Extended Internet Services Daemon configuration file
plist - property list format
racoon.conf - configuration file for racoon
ssh_config - OpenSSH SSH client configuration files
sudoers - list of which users may execute what

Even with custom formatting:

$ mansearch.py --format=#filename interface section:5
/usr/share/man/man5/remote.5
/usr/share/man/man5/rtadvd.conf.5
/usr/share/man/man5/ipnat.5
/usr/share/man/man5/groff_out.5
/usr/share/man/man5/xinetd.conf.5
/usr/share/man/man5/plist.5
/usr/share/man/man5/racoon.conf.5
/usr/share/man/man5/ssh_config.5
/usr/share/man/man5/sudoers.5

suitable for xargs :)

        Erik

The only problem i can think of would be inflated scores for terms that are naturally lowercased, because they would wind up getting added to the index twice, but based on what i've seen of hte data you are working with, i imageing that if you used UPPERCASE instead of lowercase you could drasticly reduce the likelyhood of any problems with that.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene in the Humanities

Reply via email to