Re: commercial websites powered by Lucene?

2003-06-25 Thread Ulrich Mayring
Chris Miller wrote:
I'm not clear on why you think we'll soon be back up to the same load on the
DB server?
Experience ;-)

If my DB Server is overloaded and everyone knows that, people will not 
come with some less-than-important ideas for additional searches and 
stuff. And if they come I can turn them back. However, once the DB 
server actually has capacity again, flocks of people will demand 
(rightly so) that their wishes be realized now: "Cool, now you can run 
my report every minute instead of every hour."

> What is going to increase the load? Our volume of data is not
increasing, all that will change is that the DB will no longer get hit for
searches. We'll still be pulling content etc from the database at roughly
the same rate, but that doesn't appear to be a source of any problems.
Whether we offload the searching to MySQL DBs or Lucene makes no difference
as far as I can see.
Not in as far as load increase is concerned. But there are some 
differences you'll have to consider:

SQL vs. Lucene Query Language
Users/Groups/Permissions vs. nothing
Transactions vs. nothing
Vanilla indexing vs. powerful, flexible, customizable indexing
All kinds of APIs vs. Java API
need to write a replicator vs. need to write an indexer
cheers,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: commercial websites powered by Lucene?

2003-06-25 Thread Nader S. Henein
On a more realistic note, We were running our search off Intermedia
which is an Oracle proprietary search engine. Our Oracle setup was
running off our monster DB server, 8 Sun Ultra Sprarcs III , 450Mhz,
with 8 Gegs of ram. We were looking at alternative search engine
solutions, and we tried about six, before I read about Lucene, had
everything been going smooth we wouldn't need anything but, the more our
site grew, the slower the searches were getting and I was seeing loads
of 3 and 4 on the CPUs (on an 8 CPU machine). So we decided to go with
an external search engine, when Lucene finally came on-line CPU leads
came down to 0.5 and DB was serving things faster and Lucene hasn't
given me trouble since.

You say "However, once the DB server actually has capacity again, flocks
of people will demand  (rightly so) that their wishes be realized now",
If you're talking about using Lucene to query statistical information
about data you have in the DB than you don't need a search engine, you
need a data-miner, Lucene will pull data based on dates, keywords and
ranges, but if you want to do sums and group bys than you shouldn't be
looking at Lucene . And as for joins, the way you structure the data
that lucene digests is up to you, I pass Lucene XML files that collect
information from 12 tables (12 table join), you have to think in the
context of giving Lucene your business objects, not just raw data from
tables. You can run Lucene off a smaller computer and totally isolate
the load off your DB server. 


Nader Henein


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how lucene search?

2003-06-25 Thread di99mwo
Hi.

I have read some documents about lucene. I have questions about how lucene 
search for matching words.

I think I got it wrong, but here is what I get out of the text I read:
Documents contents fields. In one of the field-value there are terms that are 
searchable. When the searched term is found in the field it returns the field 
which says where the document is.

Questions:
To search if there are more terms that match, you have to search through all 
documents?? 
So every time you search for a word you have to search through all documents??


I thought that when you create a inverted index, you go through all words in 
the documents and make a list. The list contents words that only appeare once 
in the list. Each word in the list points out those documents that content that 
word. And there are a table that says where every document is located. In this 
way you don't have to search through all documents everytime you do a search.

I'm confused...
sorry for my English

-
This mail sent through IMP: http://horde.org/imp/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how lucene search?

2003-06-25 Thread Otis Gospodnetic
The search is made through the inverted index, just like you are
thinking, with word lists and postings.

Otis

--- [EMAIL PROTECTED] wrote:
> Hi.
> 
> I have read some documents about lucene. I have questions about how
> lucene 
> search for matching words.
> 
> I think I got it wrong, but here is what I get out of the text I
> read:
> Documents contents fields. In one of the field-value there are terms
> that are 
> searchable. When the searched term is found in the field it returns
> the field 
> which says where the document is.
> 
> Questions:
> To search if there are more terms that match, you have to search
> through all 
> documents?? 
> So every time you search for a word you have to search through all
> documents??
> 
> 
> I thought that when you create a inverted index, you go through all
> words in 
> the documents and make a list. The list contents words that only
> appeare once 
> in the list. Each word in the list points out those documents that
> content that 
> word. And there are a table that says where every document is
> located. In this 
> way you don't have to search through all documents everytime you do a
> search.
> 
> I'm confused...
> sorry for my English
> 
> -
> This mail sent through IMP: http://horde.org/imp/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Tatu,

I agree 100% with everything you've said.

Let's look at MySQL for example.  Great database.  No doubt about it.

BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.

Obviously, the above performance is unacceptable for real world web
applications.

I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.

Regards,

John



-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 12:26 PM
To: Lucene Users List
Subject: Re: commercial websites powered by Lucene?


On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> Chris Miller wrote:
...
> Well, nothing against Lucene, but it doesn't solve your problem, which
> is an overloaded DB-Server. It may temporarily alleviate the effects,
> but you'll soon be at the same load again. So I'd recommend to install

I don't think that would necessarily be the case. Like you mention later on,
indexing data stored in DB does flatten it to allow faster indexing (and
retrieval), and faster in this context means more efficient, not only
sharing
the load between DB and search engine, but potentially lowering total load?

The alternative, data warehouse - like preprocessing of data, for faster
search, would likely be doable too, but it's usually more useful for running
reports. For actual searches Lucene does it job nicely and efficiently,
biggest problems I've seen are more related to relevancy questions. But
that's where tuning of Lucene ranking should be easier than trying to build
your own ranking from raw database hits (except if one uses OracleText or
such that's pretty much a search engine on top of DB itself).

So, to me it all comes down to "right tool for the job" aspect;  DBs are
good
at mass retrieval of data, or using aggregate functions (in read-only side),
whereas dedicated search engines are better for, well, searching.

...
> Of course, in real life there may be political obstacles which will
> prevent you from doing the right thing as detailed above for example,
> and your only chance is to circumvent in some way - and then Lucene is a
> great way to do that. But keep in mind that you are basically
> reinventing the functionality that is already built-in in a database :)

It depends on type of queries, but Lucene certainly has much more advanced
text searching functionality, even if indexed content comes from a rigid
structure like RDBMS. I'm not sure using a ready product like Lucene is
reinventing much functionality, even considering synchronization issues?

So I would go as far saying that for searching purposes, plain vanilla
RDBMSs
are not all that great in the first place. Even if queries need not use
advanced search features (advanced as in not just using % and _ in addition
to exact matches) Lucene may well offer better search performance and
functionality.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Ulrich Mayring
John Takacs wrote:
>
I'd love to try Lucene with the above, but the Lucene install fails because
of JavaCC issues.  Surprised more people haven't encountered this problem,
as the install instructions are out of date.
Well, what do you need JavaCC for? Isn't it just the technology for 
building the supplied HTML-Parser? There are much better HTML parsers 
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.

Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.

John



-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 12:48 AM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
 >
> I'd love to try Lucene with the above, but the Lucene install fails
because
> of JavaCC issues.  Surprised more people haven't encountered this problem,
> as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for
building the supplied HTML-Parser? There are much better HTML parsers
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Ulrich Mayring
John Takacs wrote:
Good idea.  I was just following the install directions, but if I don't have
to pay attention to the install directions, I'll find a much better one.
Any hints?  Previous email discussion maybe?  I found some references via
searching the archives, but I'm not 100% convinced they are applicable to my
situation.
I'm not sure what you mean with install directions, Lucene is just a JAR 
file and you use it like any other Java class library. There's also the 
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary 
distribution and used it.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: commercial websites powered by Lucene?

2003-06-25 Thread John Takacs
Ulrich, Vince,

I think a big, "I'm a dummy" post may be in order.  ;-)

I'll do as you suggested immediately.

Regards,

John

-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Ulrich Mayring
Sent: Thursday, June 26, 2003 1:30 AM
To: [EMAIL PROTECTED]
Subject: Re: commercial websites powered by Lucene?


John Takacs wrote:
> Good idea.  I was just following the install directions, but if I don't
have
> to pay attention to the install directions, I'll find a much better one.
>
> Any hints?  Previous email discussion maybe?  I found some references via
> searching the archives, but I'm not 100% convinced they are applicable to
my
> situation.

I'm not sure what you mean with install directions, Lucene is just a JAR
file and you use it like any other Java class library. There's also the
WAR file with a few demos, which you can just drop into Tomcat.

Perhaps you were trying to build it? I just downloaded the binary
distribution and used it.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Leo Galambos


BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.
store the two columns in an extra table. it would help you.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.
 

did you write it to mysql team?

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: commercial websites powered by Lucene?

2003-06-25 Thread Jonathan_Wasson

Believe JavaCC is used in QueryParser too.



   

  Ulrich Mayring   

  <[EMAIL PROTECTED]>  To:   [EMAIL PROTECTED] 
   
  Sent by: newscc: 

  <[EMAIL PROTECTED]Subject:  Re: commercial websites 
powered by Lucene?
  org> 

   

   

  06/25/2003 11:47 

  AM   

  Please respond to

  "Lucene Users

  List"

   

   





John Takacs wrote:
 >
> I'd love to try Lucene with the above, but the Lucene install fails
because
> of JavaCC issues.  Surprised more people haven't encountered this
problem,
> as the install instructions are out of date.

Well, what do you need JavaCC for? Isn't it just the technology for
building the supplied HTML-Parser? There are much better HTML parsers
out there, which you can use.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: commercial websites powered by Lucene?

2003-06-25 Thread Otis Gospodnetic
> I'd love to try Lucene with the above, but the Lucene install fails
> because
> of JavaCC issues.  Surprised more people haven't encountered this
> problem,
> as the install instructions are out of date.

The JavaCC fix is in the queue.  Check Bugzilla for details (link on
Lucene home page).

Otis


> -Original Message-
> From: Tatu Saloranta [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 25, 2003 12:26 PM
> To: Lucene Users List
> Subject: Re: commercial websites powered by Lucene?
> 
> 
> On Tuesday 24 June 2003 07:36, Ulrich Mayring wrote:
> > Chris Miller wrote:
> ...
> > Well, nothing against Lucene, but it doesn't solve your problem,
> which
> > is an overloaded DB-Server. It may temporarily alleviate the
> effects,
> > but you'll soon be at the same load again. So I'd recommend to
> install
> 
> I don't think that would necessarily be the case. Like you mention
> later on,
> indexing data stored in DB does flatten it to allow faster indexing
> (and
> retrieval), and faster in this context means more efficient, not only
> sharing
> the load between DB and search engine, but potentially lowering total
> load?
> 
> The alternative, data warehouse - like preprocessing of data, for
> faster
> search, would likely be doable too, but it's usually more useful for
> running
> reports. For actual searches Lucene does it job nicely and
> efficiently,
> biggest problems I've seen are more related to relevancy questions.
> But
> that's where tuning of Lucene ranking should be easier than trying to
> build
> your own ranking from raw database hits (except if one uses
> OracleText or
> such that's pretty much a search engine on top of DB itself).
> 
> So, to me it all comes down to "right tool for the job" aspect;  DBs
> are
> good
> at mass retrieval of data, or using aggregate functions (in read-only
> side),
> whereas dedicated search engines are better for, well, searching.
> 
> ...
> > Of course, in real life there may be political obstacles which will
> > prevent you from doing the right thing as detailed above for
> example,
> > and your only chance is to circumvent in some way - and then Lucene
> is a
> > great way to do that. But keep in mind that you are basically
> > reinventing the functionality that is already built-in in a
> database :)
> 
> It depends on type of queries, but Lucene certainly has much more
> advanced
> text searching functionality, even if indexed content comes from a
> rigid
> structure like RDBMS. I'm not sure using a ready product like Lucene
> is
> reinventing much functionality, even considering synchronization
> issues?
> 
> So I would go as far saying that for searching purposes, plain
> vanilla
> RDBMSs
> are not all that great in the first place. Even if queries need not
> use
> advanced search features (advanced as in not just using % and _ in
> addition
> to exact matches) Lucene may well offer better search performance and
> functionality.
> 
> -+ Tatu +-
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Otis Gospodnetic
> Well, what do you need JavaCC for? Isn't it just the technology for 
> building the supplied HTML-Parser? There are much better HTML parsers
> out there, which you can use.

Its primary use in Lucene package is for parsing users' queries.

Otis


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: commercial websites powered by Lucene?

2003-06-25 Thread Tatu Saloranta
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote:
> John Takacs wrote:
> > I'd love to try Lucene with the above, but the Lucene install fails
> > because of JavaCC issues.  Surprised more people haven't encountered this
> > problem, as the install instructions are out of date.
>
> Well, what do you need JavaCC for? Isn't it just the technology for
> building the supplied HTML-Parser? There are much better HTML parsers
> out there, which you can use.

On a related note; has anyone done performance measurements for various
HTML parsers used for indexing?

I have written couple of XML/HTML parsers that were optimized for speed 
(and/or leniency to be able to handle/fix non-valid documents), and was 
wondering if they might be useful for indexing purposes for other people (one 
is in general pretty optimal if document contents are fully in memory 
already, like when fetching from DB; another uses very little memory, while 
being only slightly slower). However, using those as opposed to more standard 
ones would only make sense if there are significant speed improvements.
And to do that, it would be good to have baseline measurements, and/or to know 
what are current best candidates, from performance perspective.

The thing is that creating a parser that only cares about textual content (and 
perhaps in some cases about surrounding element, but not about attributes, or 
structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is 
often the most CPU-intensive part of search engine, it may make sense to try 
to optimize this part heavily, up to and including using specialized parsers.

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]