Re: Lucene in the Humanities

2005-02-22 Thread Chris Hostetter

:  Just curious: it would seem easier to use multiple fields for the
:  original case and lowercase searching. Is there any particular reason
:  you analyzed the documents to multiple indexes instead of multiple
:  fields?
: 
:  I considered that approach, however to expose QueryParser I'd have to
:  get tricky.  If I have title_orig and title_lc fields, how would I
:  allow freeform queries of title:something?

Why have seperate fields?

Why not index the title into the title field twice, once with each term
lowercased and once with the case left alone. (Using an analyzer that
tokenizes The Quick BrOwN fox as [the] [quick] [brown] [fox] [The]
[Quick] [BrOwN] [fox])

Then at search time, depending on the value of of the checkbox, construct
your QueryParser using the appropriate Analyzer.

The only problem i can think of would be inflated scores for terms that
are naturally lowercased, because they would wind up getting added to the
index twice, but based on what i've seen of hte data you are working
with, i imageing that if you used UPPERCASE instead of lowercase you
could drasticly reduce the likelyhood of any problems with that.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
Erik,

On Saturday 19 February 2005 01:33, Erik Hatcher wrote:
 
 On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
 
  On Friday 18 February 2005 21:55, Erik Hatcher wrote:
 
  On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
 
  Erik,
 
  Just curious: it would seem easier to use multiple fields for the
  original case and lowercase searching. Is there any particular reason
  you analyzed the documents to multiple indexes instead of multiple
  fields?
 
  I considered that approach, however to expose QueryParser I'd have to
  get tricky.  If I have title_orig and title_lc fields, how would I
  allow freeform queries of title:something?
 
  By lowercasing the querytext and searching in title_lc ?
 
 Well sure, but how about this query:
 
   title:Something AND anotherField:someOtherValue
 
 QueryParser, as-is, won't be able to do field-name swapping.  I could 
 certainly apply that technique on all the structured queries that I 
 build up with the API, but with QueryParser it is trickier.   I'm 
 definitely open for suggestions on improving how case is handled.  The 

Overriding this (1.4.3 QueryParser.jj, line 286) might work:

protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }

It will be called by the parser for both parts of the query above, so one
could change the field depending on the requested type of search
and the field name in the query.

 only drawback now is that I'm duplicating indexes, but that is only an 
 issue in how long it takes to rebuild the index from scratch (currently 
 about 20 minutes or so on a good day - when the machine isn't swamped).

Once the users get the hang of this, you might end up having to quadruple
the index, or more.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-19 Thread Erik Hatcher
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
By lowercasing the querytext and searching in title_lc ?
Well sure, but how about this query:
title:Something AND anotherField:someOtherValue
QueryParser, as-is, won't be able to do field-name swapping.  I could
certainly apply that technique on all the structured queries that I
build up with the API, but with QueryParser it is trickier.   I'm
definitely open for suggestions on improving how case is handled.  The
Overriding this (1.4.3 QueryParser.jj, line 286) might work:
protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }
It will be called by the parser for both parts of the query above, so 
one
could change the field depending on the requested type of search
and the field name in the query.
But that wouldn't work for any other type of query 
title:somethingFuzzy~

Though now that I think more about it, a simple s/title:/title_orig:/ 
before parsing would work, and of course make the default field 
dynamic.   I need to evaluate how many fields would need to be done 
this way - it'd be several.  Thanks for the food for thought!

only drawback now is that I'm duplicating indexes, but that is only an
issue in how long it takes to rebuild the index from scratch 
(currently
about 20 minutes or so on a good day - when the machine isn't 
swamped).
Once the users get the hang of this, you might end up having to 
quadruple
the index, or more.
Why would that be?   They want a case sensitive/insensitive switch.  
How would it expand beyond that?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-19 Thread Paul Elschot
On Saturday 19 February 2005 11:02, Erik Hatcher wrote:
 
 On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
  By lowercasing the querytext and searching in title_lc ?
 
  Well sure, but how about this query:
 
 title:Something AND anotherField:someOtherValue
 
  QueryParser, as-is, won't be able to do field-name swapping.  I could
  certainly apply that technique on all the structured queries that I
  build up with the API, but with QueryParser it is trickier.   I'm
  definitely open for suggestions on improving how case is handled.  The
 
  Overriding this (1.4.3 QueryParser.jj, line 286) might work:
 
  protected Query getFieldQuery(String field, String queryText)
  throws ParseException { ... }
 
  It will be called by the parser for both parts of the query above, so 
  one
  could change the field depending on the requested type of search
  and the field name in the query.
 
 But that wouldn't work for any other type of query 
 title:somethingFuzzy~

To get that it would be necessary to override all query parser
methods that take a field argument.

 
 Though now that I think more about it, a simple s/title:/title_orig:/ 
 before parsing would work, and of course make the default field 

In the overriding getFieldQuery method something like:

if (caseSensitiveSearch(field)  originalFieldIndexed(field)) {
  field = field + _orig;
} else { //the other 3 cases
 ...
}
return super.getFieldQuery(field, queryText);

The if statement could be factored out for the other overriding methods.

 dynamic.   I need to evaluate how many fields would need to be done 
 this way - it'd be several.  Thanks for the food for thought!
 
  only drawback now is that I'm duplicating indexes, but that is only an
  issue in how long it takes to rebuild the index from scratch 
  (currently
  about 20 minutes or so on a good day - when the machine isn't 
  swamped).
 
  Once the users get the hang of this, you might end up having to 
  quadruple
  the index, or more.
 
 Why would that be?   They want a case sensitive/insensitive switch.  
 How would it expand beyond that?

With an index for every combination of fields and case sensitivity for these
fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
And before too many replies happen on this thread, I've corrected the 
spelling mistake in the subject!  :O

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 3:25 PM, Luke Shannon wrote:
Nice work Eric. I would like to spend more time playing with it, but I  
saw a
few things I really liked. When a specific query turns up no results  
you
prompt the client to preform a free form search. Less sauvy search  
users
will benefit from this strategy.
That's merely an artifact of all searches going to the results page,  
which just shows the free-form search on it.

 I also like the display of information when
you select a result. Everything is at your finger tips without clutter.
For comparison, check the older site's search is here:
http://jefferson.village.virginia.edu:2020/search.html
(don't bother trying it - it's SLOOOW)
And also for comparison, here is an older look:
	http://jefferson.village.virginia.edu:8090/styler/servlet/ 
SaxonServlet?source=http://jefferson.village.virginia.edu:2020/tamino/ 
files/1-1847.s244.raw.xmlstyle=http://jefferson.village.virginia.edu: 
2020/tamino/rossetti.xslclear-stylesheet-cache=yes

Dig that URL!  The new look and URL is here:
http://www.rossettiarchive.org/docs/1-1847.s244.raw.html
I did get this error when a name search failed to turn up results and I
clicked 'help' in the free form search row (the second row).
 Page 'help-freeform.html' not found in application namespace.
I've corrected this and it'll be corrected in my next deployment :)
So nice to have community of testers!  Thanks.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Paul Elschot
Erik,

Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple fields?

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
Erik,
Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple 
fields?
I considered that approach, however to expose QueryParser I'd have to 
get tricky.  If I have title_orig and title_lc fields, how would I 
allow freeform queries of title:something?

Erik
p.s. It's fun to see the types of queries folks have already tried 
since I sent this e-mail (repeated queries are possibly someone 
paging):

INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = rosetti : hits = 3
INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = advil : hits = 0
INFO: Query = test : hits = 24
INFO: Query = td : hits = 1
INFO: Query = td : hits = 1
INFO: Query = woman : hits = 363
INFO: Query = woman : hits = 363
INFO: Query = hello : hits = 0
INFO: Query = +rosetta +archivetype:rap : hits = 0
INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = poem : hits = 316
INFO: Query = crisis : hits = 7
INFO: Query = crisis at every moment : hits = 1
INFO: Query = toy : hits = 41
INFO: Query = title:echer : hits = 0
INFO: Query = senori : hits = 0
INFO: Query = +dear +sirs : hits = 11
INFO: Query = title:more : hits = 0
INFO: Query = more : hits = 365
INFO: Query = title:rossetti : hits = 329
INFO: Query = +blessed +damozel : hits = 103
INFO: Query = title:test : hits = 0
INFO: Query = +test +archivetype:radheader : hits = 3
INFO: Query = crisis at every moment : hits = 1
INFO: Query = rome : hits = 70
INFO: Query = fdshjkfjkhkfad : hits = 0
INFO: Query = stone : hits = 153
INFO: Query = +title:shakespeare +archivetype:radheader : hits = 1
INFO: Query = title:xx i ll : hits = 0
INFO: Query = +dog +cat : hits = 6
INFO: Query = +year:[1280 TO 1305] +archivetype:radheader : hits = 0
INFO: Query = guru : hits = 0
INFO: Query = philosophy : hits = 14
INFO: Query = title:install : hits = 0
INFO: Query = +title:install +archivetype:radheader : hits = 0
INFO: Query = help freeform.html : hits = 0
INFO: Query = help freeform.html : hits = 0
INFO: Query = install : hits = 1
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Paul Elschot
On Friday 18 February 2005 21:55, Erik Hatcher wrote:
 
 On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
 
  Erik,
 
  Just curious: it would seem easier to use multiple fields for the
  original case and lowercase searching. Is there any particular reason
  you analyzed the documents to multiple indexes instead of multiple 
  fields?
 
 I considered that approach, however to expose QueryParser I'd have to 
 get tricky.  If I have title_orig and title_lc fields, how would I 
 allow freeform queries of title:something?

By lowercasing the querytext and searching in title_lc ?

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
On Friday 18 February 2005 21:55, Erik Hatcher wrote:
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
Erik,
Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple
fields?
I considered that approach, however to expose QueryParser I'd have to
get tricky.  If I have title_orig and title_lc fields, how would I
allow freeform queries of title:something?
By lowercasing the querytext and searching in title_lc ?
Well sure, but how about this query:
title:Something AND anotherField:someOtherValue
QueryParser, as-is, won't be able to do field-name swapping.  I could 
certainly apply that technique on all the structured queries that I 
build up with the API, but with QueryParser it is trickier.   I'm 
definitely open for suggestions on improving how case is handled.  The 
only drawback now is that I'm duplicating indexes, but that is only an 
issue in how long it takes to rebuild the index from scratch (currently 
about 20 minutes or so on a good day - when the machine isn't swamped).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]