Re: Lucene in the Humanties

Praveen Peddi Fri, 18 Feb 2005 12:01:38 -0800

Good work Eric (even though UI could be made pretty). We use lucene so I have some knowledge of it. I could see the features you are using with lucene (like paging, highlighting, different kinds of pharases). Over all, good stuff.

Praveen ----- Original Message ----- From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene User" <lucene-user@jakarta.apache.org> Sent: Friday, February 18, 2005 2:46 PM Subject: Lucene in the Humanties

It's about time I actually did something real with Lucene....  :)
I have been working with the Applied Research in Patacriticism group at the University of Virginia for a few months and finally ready to present what I've been doing. The primary focus of my group is working with the Rossetti Archive - poems, artwork, interpretations, collections, and so on of Dante Gabriel Rossetti. I was initially brought on to build a collection and exhibit system, though I got detoured a bit as I got involved in applying Lucene to the archive to replace their existing search system. The existing system used an old version of Tamino with XPath queries. Tamino is not at fault here, at least not entirely, because our data is in a very complicated set of XML files with a lot of non-normalized and legacy metadata - getting at things via XPath is challenging and practically impossible in many cases.
My work is now presentable at
http://www.rossettiarchive.org/rose
(rose is for ROsetti SEarch)
This system is implicitly designed for academics who are delving into Rossetti's work, so it may not be all that interesting for most of you. Have fun and send me any interesting things you discover, especially any issues you may encounter.

Here are some numbers to give you a sense of what is going on underneath... There are currently 4,983 XML files, totally about 110MB. Without getting into a lot of details of the confusing domain, there are basically 3 types of XML files (works, pictures, and transcripts). It is important that there be case-sensitive and case-insensitive searches. To accomplish that, a custom analyzer is used in two different modes, one applying a LowerCaseFilter, and one not with the same documents written to two different indexes. There is one particular type of XML file that gets indexed as two different types of documents (a specialized summary/header type). In this first set of indexes, it is basically a one-to-one mapping of XML file to Lucene Document (with one type being indexed twice in different ways) - all said there are 5539 documents in each of the two main indexes. The transcript type gets sliced into another set of original case and lowercased indexes with each document in that index representing a document division (a <div> element in the XML). There are 12326 documents in each of these <div>-level indexes. All said, the 4 indexes built total about 3GB in size - I'm storing several fields in order to hit-highlight. Only one of these indexes is being hit at a time - it depends on what parameters you use when querying for which index is used.

Lucene brought the search times into a usable, and impressive to the scholars, state. The previous search solution often timed the browser out! Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but things are getting interesting in other ways. There has been little tuning in terms of ranking quality so far, but this is the next area of work. There is one document type that is more important than the others, and it is being boosted during indexing. There is now a growing interest in tinkering with all the new knobs and dials that are now possible. Putting in similar and more-like-this features are desired and will be relatively straightforward to implement. I'm currently using catch-all-aggregate-field technique for a default field for QueryParser searching. Using a multi-field expansion is an area that is desirable instead though. So, I've got my homework to do and catch up on all the goodness that has been mentioned in this list recently regarding all of these techniques.

An area where I'd like to solicit more help from the community relates to something akin to personalization. The scholars would like to be able to tune results based on the role (such as "art historian") that is searching the site. This would involve some type of training or continual learning process so that someone searching feeds back preferences implicitly for their queries by visiting the actual documents that are of interest. Now that the scholars have seen what is possible (I showed them the cool SearchMorph comparison page searching Wikipedia for "rossetti"), they want more and more!

So - here's where I'm soliciting feedback - who's doing these types of things in the realm of Humanties? Where should we go from here in terms of researching and applying the types of features dreamed about here? How would you recommend implementing these types of features?

I'd be happy to share more about what I've done under the covers. As you may be able to tell, the web UI is Tapestry for the search and results pages (though you won't be able to tell from the URL's you'll see :). The UI was designed primarily by one of our very graphical/CSS savvy post doc research associates, and was designed with the research scholar in mind. I continue to push for the "free form" search box to transcend structural search, but there are some good scholarly reasons to have focused structural searching also. The documents beyond the links in the search results were another area where I've applied some elbow grease - they were originally dynamically generated with hideous URL's through Tamino and a Saxon transformation servlet. These are documents that _rarely_ change - so they are now being XSL'd into all sorts of HTML views during a build process and statically served with Apache with quite clean URLs (and Google friendly, which, interestingly, is not something scholars care that much about for these types of archives it seems). The XML files are accessible directly as hyperlinks at the bottoms of the document pages too - so you can see the parsing fun I've had (thank you JDOM and XPath!).
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene in the Humanties

Reply via email to