Re: Why we use Lucene for Database search like Oracle / Sybase ?

Peter A. Daly Tue, 17 Jan 2006 13:35:45 -0800

In many cases that essentially require traditional RDBMS but also require
lucene like functionality, I would use the database as the primary data
store.  I would then either:
1.  Update the lucene index using data from the database based on a
scheduled process.
2.  As records are added, add them to both Lucene and the database.


It's a little extra work (and space,) but you get the best of both worlds.

-Pete

On 1/17/06, John Powers <[EMAIL PROTECTED]> wrote:
>
> Would you say as a best practice that you can use both?    When would
> you and when wouldn't you?  I trust databases more then free files, so I
> am putting my more sensitive and volatile data in the database.   If you
> built a commenting system.. like a blog or an flickr type app, would
> just a lucene solution be best?      The problem with both of course is
> syncing..
>
> -----Original Message-----
> From: Kan Deng [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 17, 2006 2:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: Why we use Lucene for Database search like Oracle / Sybase
> ?
>
>
> 1. The conventional database uses B+tree as the
> indexing mechanism, while search engine uses
> inverted-index.
>
>    When user needs to update the data frequently, then
> B+tree is a better choice. However, for search engine,
> the data and index doesn't change too often.
>
>    Inverted indexes are tables. For example, .tis
> index file is a table of columns including "Field
> name", "Field value" and "doc freq". However, there
> are no B+tree associated with those tables.
>
>    Inverted index is less space consuming comparing
> with B+tree index. Imagine there is a table which has
> a ID column. Suppose there are N rows in this table,
> the size of B+tree is O(Nlog(N)), but the size of
> inverted index is O(N).
>
> 2. Searching:
>
>    Conventional RDBMS searches among the B+tree. But a
> search engine "hops" along the inverted index, which
> is sorted. Suppose a field value of inverted index is
> "0, 3, 4, 6, 20, 29, 39, 60, 202", to search for a
> given value say 6, the search engine may start with
> "0", then hop with a fixed length, say 4, to "20",
> then to "202". If nesssary, it "hops" backforwards but
> with shrinked pace.
>
>    A search engine assumes the inverted indexes are
> sorted. This is a strong assumption, especially it is
> very hard to maintain if the user can update the data
> thus index at any time. B+tree doesn't comply to this
> strong assumption.
>
>    When the dataset and index is small, B+tree is
> faster than inverted index search. However, with
> gigabytes, inverted index search tends to be faster,
> because inverted index is smaller in size, thus less
> disk IO required.
>
>
> 3. Compression.
>
>   Lucene compressed the data in its inverted indexes,
> because it assumes the indexes do not change very
> frequently. However, B+tree doesn't compress, because
> it doesn't assume the same stability of its indexes.
>
>
> 4. Inverted index doesn't required fix-length of
> columns.
>
>   Conventional RDBMS assumes that every row of a table
> must be of the same columns. When one row may have
> some extra columns, a workaround is to use "flex".
> However, each Document of Lucene may be of different
> fields.
>
>
> 5. Inverted index is convenient for alias/synonyms.
>
>   Since inverted indexes refer to the original dataset
> by offset pointers, it is convenient to inject alias
> and synonym into the inverted indexes, as long as they
> point to the same offset of the original dataset.
>
>   However, it is not very convenient to do the same
> job with conventional database enpowered by B+tree.
>
> 6. Ranking.
>
>   A search engine's implementation makes it convenient
> to rank the search result. But with conventional
> database, it is not so convenient.
>
> 7. Search engine doesn't bother SQL language.
>
>   Usually conventional database suppose SQL language
> to make it convenient for user to organize data, set
> up index, query, etc. However, SQL's convenience comes
> with price, because RDBMS engine has to handle the
> compilation and figure out execution plan for most
> queries.
>
>   However, SQL language is not mandatory for RDBMS,
> the embedded database like BerkeleyDB (Sleepycat)
> doesn't support SQL, therefore, it is faster than
> using SQL.
>
>   Lucene doesn't support SQL-like language. But it is
> possible to do so if people like SQL's convenient.
>
>
> In summary, for many applications, search engine and
> database are competitive solutions. One has to
> consider in depth to choose either search engine or
> database, and in some cases, the border is blurred.
>
>
> Kan
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Why we use Lucene for Database search like Oracle / Sybase ?

Reply via email to