Re: Lucene in the Humanities
: Just curious: it would seem easier to use multiple fields for the : original case and lowercase searching. Is there any particular reason : you analyzed the documents to multiple indexes instead of multiple : fields? : : I considered that approach, however to expose QueryParser I'd have to : get tricky. If I have title_orig and title_lc fields, how would I : allow freeform queries of title:something? Why have seperate fields? Why not index the title into the title field twice, once with each term lowercased and once with the case left alone. (Using an analyzer that tokenizes The Quick BrOwN fox as [the] [quick] [brown] [fox] [The] [Quick] [BrOwN] [fox]) Then at search time, depending on the value of of the checkbox, construct your QueryParser using the appropriate Analyzer. The only problem i can think of would be inflated scores for terms that are naturally lowercased, because they would wind up getting added to the index twice, but based on what i've seen of hte data you are working with, i imageing that if you used UPPERCASE instead of lowercase you could drasticly reduce the likelyhood of any problems with that. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
Erik, On Saturday 19 February 2005 01:33, Erik Hatcher wrote: On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote: On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Saturday 19 February 2005 11:02, Erik Hatcher wrote: On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ To get that it would be necessary to override all query parser methods that take a field argument. Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field In the overriding getFieldQuery method something like: if (caseSensitiveSearch(field) originalFieldIndexed(field)) { field = field + _orig; } else { //the other 3 cases ... } return super.getFieldQuery(field, queryText); The if statement could be factored out for the other overriding methods. dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? With an index for every combination of fields and case sensitivity for these fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
And before too many replies happen on this thread, I've corrected the spelling mistake in the subject! :O - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 3:25 PM, Luke Shannon wrote: Nice work Eric. I would like to spend more time playing with it, but I saw a few things I really liked. When a specific query turns up no results you prompt the client to preform a free form search. Less sauvy search users will benefit from this strategy. That's merely an artifact of all searches going to the results page, which just shows the free-form search on it. I also like the display of information when you select a result. Everything is at your finger tips without clutter. For comparison, check the older site's search is here: http://jefferson.village.virginia.edu:2020/search.html (don't bother trying it - it's SLOOOW) And also for comparison, here is an older look: http://jefferson.village.virginia.edu:8090/styler/servlet/ SaxonServlet?source=http://jefferson.village.virginia.edu:2020/tamino/ files/1-1847.s244.raw.xmlstyle=http://jefferson.village.virginia.edu: 2020/tamino/rossetti.xslclear-stylesheet-cache=yes Dig that URL! The new look and URL is here: http://www.rossettiarchive.org/docs/1-1847.s244.raw.html I did get this error when a name search failed to turn up results and I clicked 'help' in the free form search row (the second row). Page 'help-freeform.html' not found in application namespace. I've corrected this and it'll be corrected in my next deployment :) So nice to have community of testers! Thanks. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? Erik p.s. It's fun to see the types of queries folks have already tried since I sent this e-mail (repeated queries are possibly someone paging): INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = rosetti : hits = 3 INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR archivetype:rap) : hits = 2182 INFO: Query = advil : hits = 0 INFO: Query = test : hits = 24 INFO: Query = td : hits = 1 INFO: Query = td : hits = 1 INFO: Query = woman : hits = 363 INFO: Query = woman : hits = 363 INFO: Query = hello : hits = 0 INFO: Query = +rosetta +archivetype:rap : hits = 0 INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR archivetype:rap) : hits = 2182 INFO: Query = poem : hits = 316 INFO: Query = crisis : hits = 7 INFO: Query = crisis at every moment : hits = 1 INFO: Query = toy : hits = 41 INFO: Query = title:echer : hits = 0 INFO: Query = senori : hits = 0 INFO: Query = +dear +sirs : hits = 11 INFO: Query = title:more : hits = 0 INFO: Query = more : hits = 365 INFO: Query = title:rossetti : hits = 329 INFO: Query = +blessed +damozel : hits = 103 INFO: Query = title:test : hits = 0 INFO: Query = +test +archivetype:radheader : hits = 3 INFO: Query = crisis at every moment : hits = 1 INFO: Query = rome : hits = 70 INFO: Query = fdshjkfjkhkfad : hits = 0 INFO: Query = stone : hits = 153 INFO: Query = +title:shakespeare +archivetype:radheader : hits = 1 INFO: Query = title:xx i ll : hits = 0 INFO: Query = +dog +cat : hits = 6 INFO: Query = +year:[1280 TO 1305] +archivetype:radheader : hits = 0 INFO: Query = guru : hits = 0 INFO: Query = philosophy : hits = 14 INFO: Query = title:install : hits = 0 INFO: Query = +title:install +archivetype:radheader : hits = 0 INFO: Query = help freeform.html : hits = 0 INFO: Query = help freeform.html : hits = 0 INFO: Query = install : hits = 1 INFO: Query = life : hits = 554 INFO: Query = life : hits = 554 INFO: Query = life : hits = 554 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote: On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]