RE: Why we use Lucene for Database search like Oracle / Sybase ?

John Powers Tue, 17 Jan 2006 13:02:33 -0800

Would you say as a best practice that you can use both?    When would
you and when wouldn't you?  I trust databases more then free files, so I
am putting my more sensitive and volatile data in the database.   If you
built a commenting system.. like a blog or an flickr type app, would
just a lucene solution be best?      The problem with both of course is
syncing..

-----Original Message-----
From: Kan Deng [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 17, 2006 2:23 PM
To: java-user@lucene.apache.org
Subject: Re: Why we use Lucene for Database search like Oracle / Sybase
?

1. The conventional database uses B+tree as the
indexing mechanism, while search engine uses
inverted-index. 

   When user needs to update the data frequently, then
B+tree is a better choice. However, for search engine,
the data and index doesn't change too often. 

   Inverted indexes are tables. For example, .tis
index file is a table of columns including "Field
name", "Field value" and "doc freq". However, there
are no B+tree associated with those tables. 

   Inverted index is less space consuming comparing
with B+tree index. Imagine there is a table which has
a ID column. Suppose there are N rows in this table,
the size of B+tree is O(Nlog(N)), but the size of
inverted index is O(N). 

2. Searching:

   Conventional RDBMS searches among the B+tree. But a
search engine "hops" along the inverted index, which
is sorted. Suppose a field value of inverted index is
"0, 3, 4, 6, 20, 29, 39, 60, 202", to search for a
given value say 6, the search engine may start with
"0", then hop with a fixed length, say 4, to "20",
then to "202". If nesssary, it "hops" backforwards but
with shrinked pace.

   A search engine assumes the inverted indexes are
sorted. This is a strong assumption, especially it is
very hard to maintain if the user can update the data
thus index at any time. B+tree doesn't comply to this
strong assumption. 

   When the dataset and index is small, B+tree is
faster than inverted index search. However, with
gigabytes, inverted index search tends to be faster,
because inverted index is smaller in size, thus less
disk IO required. 

3. Compression. 

  Lucene compressed the data in its inverted indexes,
because it assumes the indexes do not change very
frequently. However, B+tree doesn't compress, because
it doesn't assume the same stability of its indexes. 

4. Inverted index doesn't required fix-length of
columns. 

  Conventional RDBMS assumes that every row of a table
must be of the same columns. When one row may have
some extra columns, a workaround is to use "flex".
However, each Document of Lucene may be of different
fields. 

5. Inverted index is convenient for alias/synonyms. 

  Since inverted indexes refer to the original dataset
by offset pointers, it is convenient to inject alias
and synonym into the inverted indexes, as long as they
point to the same offset of the original dataset. 

  However, it is not very convenient to do the same
job with conventional database enpowered by B+tree. 

6. Ranking. 

  A search engine's implementation makes it convenient
to rank the search result. But with conventional
database, it is not so convenient. 

7. Search engine doesn't bother SQL language. 

  Usually conventional database suppose SQL language
to make it convenient for user to organize data, set
up index, query, etc. However, SQL's convenience comes
with price, because RDBMS engine has to handle the
compilation and figure out execution plan for most
queries. 

  However, SQL language is not mandatory for RDBMS,
the embedded database like BerkeleyDB (Sleepycat)
doesn't support SQL, therefore, it is faster than
using SQL. 

  Lucene doesn't support SQL-like language. But it is
possible to do so if people like SQL's convenient. 

In summary, for many applications, search engine and
database are competitive solutions. One has to
consider in depth to choose either search engine or
database, and in some cases, the border is blurred. 

Kan

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Why we use Lucene for Database search like Oracle / Sybase ?

Reply via email to