Re: Why we use Lucene for Database search like Oracle / Sybase ?

Kan Deng Tue, 17 Jan 2006 15:37:56 -0800

For many applications, search engine and database are
competitive solutions. One has to consider in depth to
choose either search engine or database, and in some
cases, the border is blurred.

There are several issues to consider when selecting
database or Lucene or both as the solution to a
specific application. 

1. Data change frequency. 

   Suppose there is a "restaurant" table, which
contains a "current customers" column. Since the
content of this column is updated very frequently,
Lucene is not a good choice. 

   However, for "menu" column, Lucene definitely a
strong candidate for indexing and searching. 

2. Query. 

   Suppose the "restaurant" table which contains a
"menu" column, there are two ways to store the content
into the "menu" column, one is put all the dishes into
it as a big string delimited by comma. Another way is
to separate each dish into different rows, so that one
single restaurant may have multiple rows. 

   If the menu is stored as a big string, then
conventional RDBMS is not capable of doing the search.
Because "select * from restaurant where menu like
'%BBQ%'" will induce a full table scan. 

   However, if the dishes are stored in separated
rows, RDBMS is among the candidates. More analysis
needs to be considered before making the decision. 

3. Performance. 

   Intuitively, B+tree based search sounds faster than
inverted-index search. However, since inverted-index
organizes the data in a sorted way, its search
performance doesn't fall behind B+tree search for too
far. Imagine there is an inverted-index, "0, 1, 2, 3,
4, 5, 6, 7, 8", the "hopping" algorithm can be
implemented in a way identical to the binary search,
if starting with "4", then either "2" or "6", so on.
Lucene uses fixed interval hopping, mainly because it
is convenient to implement with satisfying
performance, even though theoritically not so good as
binary search or B+tree search. 

   The above comparison assume B+tree and
inverted-index are loaded fully into RAM. In case the
index is so big that it is mandatory to swap between
RAM and disk, inverted-index tends to excel B+tree
index. The reason is that B+tree index occupies more
space, which induces more disk IO, and consequently
reduces the overall performance. 

4. Infrastructure. 

   RDBMS products consume a lot of computational
resource, if the user doesn't want to break into the
RDBMS's source code and get rid of the unwanted
functionality like SQL language. As a result, usually
RDBMS is running on a standalone machine, and use
JDBC/ODBC to interact with application which is
running on another machine, via network. Keep in mind
that average it costs 10-1000 milliseconds to transmit
data through network, depending on the size of your
data. If one wants his server running fast and scaling
well like Google.com, even 10 millisecond is a big
issue that he has to think hard how to save. 

   Therefore, for better performance, maybe it is more
appropriate to put the application logic on the same
machine as the data repository. However, if one wants
to use conventional RDBMS as a blackbox, the one tier
solution may not work well because RDBMS consumes so
much computational resource that there is little
remained for application logic to use. 

   As a best practice, an embedded database is more
favorable than the standalone RDBMS blackbox,
especially for non-critical data repository, and
usually with better performance. 

   The comparison between embedded database and Lucene
is subtle. Sometimes it is reasonable to merge Lucene
with embedded database. 

5. Integrate Lucene with embedded database.

   One reason is to integrate Lucene with embedded
database is to support blind query like "select * from
restaurant where menu like '%BBQ%'". 

   It is possible to integrate Lucene with embedded
database, but a difficulty is how to update the
inverted-indexes frequently. This is an interesting
problem, however, to my best knowledge, there is no
good solution for the time being. 

Kan

--- "Peter A. Daly" <[EMAIL PROTECTED]> wrote:

> In many cases that essentially require traditional
> RDBMS but also require
> lucene like functionality, I would use the database
> as the primary data
> store.  I would then either:
> 1.  Update the lucene index using data from the
> database based on a
> scheduled process.
> 2.  As records are added, add them to both Lucene
> and the database.
> 
> It's a little extra work (and space,) but you get
> the best of both worlds.
> 
> -Pete
> 
> On 1/17/06, John Powers <[EMAIL PROTECTED]>
> wrote:
> >
> > Would you say as a best practice that you can use
> both?    When would
> > you and when wouldn't you?  I trust databases more
> then free files, so I
> > am putting my more sensitive and volatile data in
> the database.   If you
> > built a commenting system.. like a blog or an
> flickr type app, would
> > just a lucene solution be best?      The problem
> with both of course is
> > syncing..
> >
> >

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why we use Lucene for Database search like Oracle / Sybase ?

Reply via email to