we (DENIC) are the world's second largest domain registry (.de-zone has almost 6.9 million domains) and are using Lucene to index and search our website in a high-traffic scenario. Most of our web pages are available in English in addition to our native language German. If you want to try our Lucene-based search engine, please start here:
http://www.denic.de/en/special/index.jsp
Use the input field on the page to search our website. Don't use the input field at the top right, that is only for searching domains in our domain database, it has nothing to do with Lucene.
The indexes for German and English are seperate, so you should find only English pages from that page.
A somewhat interesting feature is the summarizer, on the results page you'll get a short summary of the page. These are not hand-written blurbs, rather they are generated automatically from the HTML pages at indexing time. I'd be especially interested in improvement suggestions in this area.
Naturally, the automatically generated texts don't have the same quality as hand-written ones. But they're better than nothing and in my eyes more useful than Google-style excerpts. How many times has it happened to you that the Google excerpt doesn't really tell you anything, because it's totally out of context? Summaries tell you what the whole page is about, irregardless of the context within which your search terms may appear. After reading the summary you should (hopefully) be able to decide whether the page contains the info you're looking for. Comments welcome!
We're using the snowball stemmers/analyzers for German and English, custom stopword lists and the HTML parser from the Sourceforge htmlparser project. Apart from that it's vanilla Lucene.
cheers,
Ulrich
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]