RE: New Lucene-powered Website

2003-11-28 Thread Dr. John Takacs
Ulrich,

Well done!

I too would love to know how you implemented the summarizer.  If you are
unable to provide the details, would you be able to steer a person in the
right direction?  I've experimented with a few applications that will do it,
some my own, some found via searches, but none are as clear cut and
professional as yours (i.e. most were simply grabbing the first 200 or so
characters of a pageetc etc).

Regards,

John

-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, November 27, 2003 8:30 PM
To: [EMAIL PROTECTED]
Subject: New Lucene-powered Website


Hello,

we (DENIC) are the world's second largest domain registry (.de-zone has
almost 6.9 million domains) and are using Lucene to index and search our
website in a high-traffic scenario. Most of our web pages are available
in English in addition to our native language German. If you want to try
our Lucene-based search engine, please start here:

http://www.denic.de/en/special/index.jsp

Use the input field on the page to search our website. Don't use the
input field at the top right, that is only for searching domains in our
domain database, it has nothing to do with Lucene.

The indexes for German and English are seperate, so you should find only
English pages from that page.

A somewhat interesting feature is the summarizer, on the results page
you'll get a short summary of the page. These are not hand-written
blurbs, rather they are generated automatically from the HTML pages at
indexing time. I'd be especially interested in improvement suggestions
in this area.

Naturally, the automatically generated texts don't have the same quality
as hand-written ones. But they're better than nothing and in my eyes
more useful than Google-style excerpts. How many times has it happened
to you that the Google excerpt doesn't really tell you anything, because
it's totally out of context? Summaries tell you what the whole page is
about, irregardless of the context within which your search terms may
appear. After reading the summary you should (hopefully) be able to
decide whether the page contains the info you're looking for. Comments
welcome!

We're using the snowball stemmers/analyzers for German and English,
custom stopword lists and the HTML parser from the Sourceforge
htmlparser project. Apart from that it's vanilla Lucene.

cheers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Ulrich, Vince,

I think a big, "I'm a dummy" post may be in order.  ;-)

I'll do as you suggested immediately.

Regards,

John

-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 1:30 AM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
> Good idea.  I was just following the install directions, but if I don't
have
> to pay attention to the install directions, I'll find a much better one.
>
> Any hints?  Previous email discussion maybe?  I found some references via
> searching the archives, but I'm not 100% convinced they are applicable to
my
> situation.

I'm not sure what you mean with install directions, Lucene is just a JAR
file and you use it like any other Java class library. There's also the
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary
distribution and used it.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.

Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.

John



-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 12:48 AM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
 >
> I'd love to try Lucene with the above, but the Lucene install fails
because
> of JavaCC issues.  Surprised more people haven't encountered this problem,
> as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for
building the supplied HTML-Parser? There are much better HTML parsers
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Tatu,

I agree 100% with everything you've said.

Let's look at MySQL for example.  Great database.  No doubt about it.

BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.

Obviously, the above performance is unacceptable for real world web
applications.

I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.

Regards,

John



-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 12:26 PM
To: Lucene Users List
Subject: Re: commercial websites powered by Lucene?


On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> Chris Miller wrote:
...
> Well, nothing against Lucene, but it doesn't solve your problem, which
> is an overloaded DB-Server. It may temporarily alleviate the effects,
> but you'll soon be at the same load again. So I'd recommend to install

I don't think that would necessarily be the case. Like you mention later on,
indexing data stored in DB does flatten it to allow faster indexing (and
retrieval), and faster in this context means more efficient, not only
sharing
the load between DB and search engine, but potentially lowering total load?

The alternative, data warehouse - like preprocessing of data, for faster
search, would likely be doable too, but it's usually more useful for running
reports. For actual searches Lucene does it job nicely and efficiently,
biggest problems I've seen are more related to relevancy questions. But
that's where tuning of Lucene ranking should be easier than trying to build
your own ranking from raw database hits (except if one uses OracleText or
such that's pretty much a search engine on top of DB itself).

So, to me it all comes down to "right tool for the job" aspect;  DBs are
good
at mass retrieval of data, or using aggregate functions (in read-only side),
whereas dedicated search engines are better for, well, searching.

...
> Of course, in real life there may be political obstacles which will
> prevent you from doing the right thing as detailed above for example,
> and your only chance is to circumvent in some way - and then Lucene is a
> great way to do that. But keep in mind that you are basically
> reinventing the functionality that is already built-in in a database :)

It depends on type of queries, but Lucene certainly has much more advanced
text searching functionality, even if indexed content comes from a rigid
structure like RDBMS. I'm not sure using a ready product like Lucene is
reinventing much functionality, even considering synchronization issues?

So I would go as far saying that for searching purposes, plain vanilla
RDBMSs
are not all that great in the first place. Even if queries need not use
advanced search features (advanced as in not just using % and _ in addition
to exact matches) Lucene may well offer better search performance and
functionality.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-24 Thread John Takacs
Hi Nader,

This thread is by far one of the best, and most practical.  It will only be
topped when someone provides benchmarks for a DMOZ.org type directory of 3
million plus urls.  I would love to, but the whole JavaCC thing is a show
stopper.

Questions:

I noticed that search is a little slow.  What has been your experience?
Perhaps it was a bandwidth issue, but I'm living in a country with the
greatest internet connectivity and penetration in the world (South Korea),
so I don't think that is an issue on my end.

You have 500,000 resumes.  Based on the steps you took to get to 500,000, do
you think your current setup will scale to millions, like say, 3 million or
so?

What is your hardware like?  CPU/RAM?

Warm regards, and thanks for sharing.  If I can ever get passed the
Lucene/JavaCC installation failure, I'll share my benchmarks on the above
directory scenario.

John



-Original Message-
From: Nader S. Henein [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2003 5:30 PM
To: 'Lucene Users List'
Subject: RE: commercial websites powered by Lucene?


 I handle updates or inserts the same way first I delete the document
from the index and then I insert it (better safe than sorry), I batch my
updates/inserts every twenty minutes, I would do it in smaller intervals
but since I have to sync the XML files created from the DB to three
machines (I maintain three separate Lucene indices on my three separate
web-servers) it takes a little longer. You have to batch your changes
because Updating the index takes time as opposed to deleted which I
batch every two minutes. You won't have a problem updating the index and
searching at the same time because lucene updates the index on a
separate set of files and then when It's done it overwrites the old
version. I've had to provide for Backups, and things like server crashes
mid-indexing, but I was using Oracle Intermedia before and Lucene BLOWS
IT AWAY.

-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Chris Miller
Sent: Tuesday, June 24, 2003 12:06 PM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


Hi Nader,

I was wondering if you'd mind me asking you a couple of questions about
your implementation?

The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index updates must place a reasonable load on the CPU/disk. Do you keep
CVs and jobs in the same index or two different ones? And what is the
process you use to update the index(es) - do you batch-process updates
or do you handle them in real-time as changes are made?

Any insight you can offer would be much appreciated as I'm about to
implement something similar and am a little unsure of the best approach
to take. We need to be able to handle indexing about 60,000
documents/day, while allowing (many) searches to continue operating
alongside.

Thanks!
Chris

"Nader S. Henein" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> We use Lucene http://www.bayt.com , we're basically an on-line
> Recruitment site and up until now we've got around 500 000 CVs and
> documents indexed with results that stump Oracle Intermedia.
>
> Nader Henein
> Senior Web Dev
>
> Bayt.com
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 04, 2003 6:09 PM
> To: [EMAIL PROTECTED]
> Subject: commercial websites powered by Lucene?
>
>
>
> Hello All,
>
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would make
> Lucene an easy sell to management
>
> Does anyone know of any good examples?  The bigger the better, and the

> more the better.
>
> TIA,
> -John
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]