RE: New Lucene-powered Website
Ulrich, Well done! I too would love to know how you implemented the summarizer. If you are unable to provide the details, would you be able to steer a person in the right direction? I've experimented with a few applications that will do it, some my own, some found via searches, but none are as clear cut and professional as yours (i.e. most were simply grabbing the first 200 or so characters of a pageetc etc). Regards, John -Original Message- From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring Sent: Thursday, November 27, 2003 8:30 PM To: [EMAIL PROTECTED] Subject: New Lucene-powered Website Hello, we (DENIC) are the world's second largest domain registry (.de-zone has almost 6.9 million domains) and are using Lucene to index and search our website in a high-traffic scenario. Most of our web pages are available in English in addition to our native language German. If you want to try our Lucene-based search engine, please start here: http://www.denic.de/en/special/index.jsp Use the input field on the page to search our website. Don't use the input field at the top right, that is only for searching domains in our domain database, it has nothing to do with Lucene. The indexes for German and English are seperate, so you should find only English pages from that page. A somewhat interesting feature is the summarizer, on the results page you'll get a short summary of the page. These are not hand-written blurbs, rather they are generated automatically from the HTML pages at indexing time. I'd be especially interested in improvement suggestions in this area. Naturally, the automatically generated texts don't have the same quality as hand-written ones. But they're better than nothing and in my eyes more useful than Google-style excerpts. How many times has it happened to you that the Google excerpt doesn't really tell you anything, because it's totally out of context? Summaries tell you what the whole page is about, irregardless of the context within which your search terms may appear. After reading the summary you should (hopefully) be able to decide whether the page contains the info you're looking for. Comments welcome! We're using the snowball stemmers/analyzers for German and English, custom stopword lists and the HTML parser from the Sourceforge htmlparser project. Apart from that it's vanilla Lucene. cheers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Ulrich, Vince, I think a big, "I'm a dummy" post may be in order. ;-) I'll do as you suggested immediately. Regards, John -Original Message- From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring Sent: Thursday, June 26, 2003 1:30 AM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? John Takacs wrote: > Good idea. I was just following the install directions, but if I don't have > to pay attention to the install directions, I'll find a much better one. > > Any hints? Previous email discussion maybe? I found some references via > searching the archives, but I'm not 100% convinced they are applicable to my > situation. I'm not sure what you mean with install directions, Lucene is just a JAR file and you use it like any other Java class library. There's also the WAR file with a few demos, which you can just drop into Tomcat. Perhaps you were trying to build it? I just downloaded the binary distribution and used it. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Good idea. I was just following the install directions, but if I don't have to pay attention to the install directions, I'll find a much better one. Any hints? Previous email discussion maybe? I found some references via searching the archives, but I'm not 100% convinced they are applicable to my situation. John -Original Message- From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring Sent: Thursday, June 26, 2003 12:48 AM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? John Takacs wrote: > > I'd love to try Lucene with the above, but the Lucene install fails because > of JavaCC issues. Surprised more people haven't encountered this problem, > as the install instructions are out of date. Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Tatu, I agree 100% with everything you've said. Let's look at MySQL for example. Great database. No doubt about it. BUT, looking at the Full text indexing/searching part...it not up to snuff. Currently, I'm using mysql's full text search support. I have a database of 3-5 million rows. Each row is unique, let's say a product. Each row has several columns, but the two I search on are title and description. I created a full text index on title and description. Title has approximately 100 characters, and description has 255 characters. At the moment, mysql is taking 50 seconds plus to return results on simple one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM RedHat Linux 7.3 platform, with nothing else running on it, i.e. another server is handling HTTP requests. It is a dedicated mysql box. In addition, I'm the only person making queries. Obviously, the above performance is unacceptable for real world web applications. I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Regards, John -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 25, 2003 12:26 PM To: Lucene Users List Subject: Re: commercial websites powered by Lucene? On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote: > Chris Miller wrote: ... > Well, nothing against Lucene, but it doesn't solve your problem, which > is an overloaded DB-Server. It may temporarily alleviate the effects, > but you'll soon be at the same load again. So I'd recommend to install I don't think that would necessarily be the case. Like you mention later on, indexing data stored in DB does flatten it to allow faster indexing (and retrieval), and faster in this context means more efficient, not only sharing the load between DB and search engine, but potentially lowering total load? The alternative, data warehouse - like preprocessing of data, for faster search, would likely be doable too, but it's usually more useful for running reports. For actual searches Lucene does it job nicely and efficiently, biggest problems I've seen are more related to relevancy questions. But that's where tuning of Lucene ranking should be easier than trying to build your own ranking from raw database hits (except if one uses OracleText or such that's pretty much a search engine on top of DB itself). So, to me it all comes down to "right tool for the job" aspect; DBs are good at mass retrieval of data, or using aggregate functions (in read-only side), whereas dedicated search engines are better for, well, searching. ... > Of course, in real life there may be political obstacles which will > prevent you from doing the right thing as detailed above for example, > and your only chance is to circumvent in some way - and then Lucene is a > great way to do that. But keep in mind that you are basically > reinventing the functionality that is already built-in in a database :) It depends on type of queries, but Lucene certainly has much more advanced text searching functionality, even if indexed content comes from a rigid structure like RDBMS. I'm not sure using a ready product like Lucene is reinventing much functionality, even considering synchronization issues? So I would go as far saying that for searching purposes, plain vanilla RDBMSs are not all that great in the first place. Even if queries need not use advanced search features (advanced as in not just using % and _ in addition to exact matches) Lucene may well offer better search performance and functionality. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: commercial websites powered by Lucene?
Hi Nader, This thread is by far one of the best, and most practical. It will only be topped when someone provides benchmarks for a DMOZ.org type directory of 3 million plus urls. I would love to, but the whole JavaCC thing is a show stopper. Questions: I noticed that search is a little slow. What has been your experience? Perhaps it was a bandwidth issue, but I'm living in a country with the greatest internet connectivity and penetration in the world (South Korea), so I don't think that is an issue on my end. You have 500,000 resumes. Based on the steps you took to get to 500,000, do you think your current setup will scale to millions, like say, 3 million or so? What is your hardware like? CPU/RAM? Warm regards, and thanks for sharing. If I can ever get passed the Lucene/JavaCC installation failure, I'll share my benchmarks on the above directory scenario. John -Original Message- From: Nader S. Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2003 5:30 PM To: 'Lucene Users List' Subject: RE: commercial websites powered by Lucene? I handle updates or inserts the same way first I delete the document from the index and then I insert it (better safe than sorry), I batch my updates/inserts every twenty minutes, I would do it in smaller intervals but since I have to sync the XML files created from the DB to three machines (I maintain three separate Lucene indices on my three separate web-servers) it takes a little longer. You have to batch your changes because Updating the index takes time as opposed to deleted which I batch every two minutes. You won't have a problem updating the index and searching at the same time because lucene updates the index on a separate set of files and then when It's done it overwrites the old version. I've had to provide for Backups, and things like server crashes mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS IT AWAY. -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller Sent: Tuesday, June 24, 2003 12:06 PM To: [EMAIL PROTECTED] Subject: Re: commercial websites powered by Lucene? Hi Nader, I was wondering if you'd mind me asking you a couple of questions about your implementation? The main thing I'm interested in is how you handle updates to Lucene's index. I'd imagine you have a fairly high turnover of CVs and jobs, so index updates must place a reasonable load on the CPU/disk. Do you keep CVs and jobs in the same index or two different ones? And what is the process you use to update the index(es) - do you batch-process updates or do you handle them in real-time as changes are made? Any insight you can offer would be much appreciated as I'm about to implement something similar and am a little unsure of the best approach to take. We need to be able to handle indexing about 60,000 documents/day, while allowing (many) searches to continue operating alongside. Thanks! Chris "Nader S. Henein" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > We use Lucene http://www.bayt.com , we're basically an on-line > Recruitment site and up until now we've got around 500 000 CVs and > documents indexed with results that stump Oracle Intermedia. > > Nader Henein > Senior Web Dev > > Bayt.com > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Wednesday, June 04, 2003 6:09 PM > To: [EMAIL PROTECTED] > Subject: commercial websites powered by Lucene? > > > > Hello All, > > I've been trying to find examples of large commercial websites that > use Lucene to power their search. Having such examples would make > Lucene an easy sell to management > > Does anyone know of any good examples? The bigger the better, and the > more the better. > > TIA, > -John > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]