In many cases that essentially require traditional RDBMS but also require lucene like functionality, I would use the database as the primary data store. I would then either: 1. Update the lucene index using data from the database based on a scheduled process. 2. As records are added, add them to both Lucene and the database.
It's a little extra work (and space,) but you get the best of both worlds. -Pete On 1/17/06, John Powers <[EMAIL PROTECTED]> wrote: > > Would you say as a best practice that you can use both? When would > you and when wouldn't you? I trust databases more then free files, so I > am putting my more sensitive and volatile data in the database. If you > built a commenting system.. like a blog or an flickr type app, would > just a lucene solution be best? The problem with both of course is > syncing.. > > -----Original Message----- > From: Kan Deng [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 17, 2006 2:23 PM > To: java-user@lucene.apache.org > Subject: Re: Why we use Lucene for Database search like Oracle / Sybase > ? > > > 1. The conventional database uses B+tree as the > indexing mechanism, while search engine uses > inverted-index. > > When user needs to update the data frequently, then > B+tree is a better choice. However, for search engine, > the data and index doesn't change too often. > > Inverted indexes are tables. For example, .tis > index file is a table of columns including "Field > name", "Field value" and "doc freq". However, there > are no B+tree associated with those tables. > > Inverted index is less space consuming comparing > with B+tree index. Imagine there is a table which has > a ID column. Suppose there are N rows in this table, > the size of B+tree is O(Nlog(N)), but the size of > inverted index is O(N). > > 2. Searching: > > Conventional RDBMS searches among the B+tree. But a > search engine "hops" along the inverted index, which > is sorted. Suppose a field value of inverted index is > "0, 3, 4, 6, 20, 29, 39, 60, 202", to search for a > given value say 6, the search engine may start with > "0", then hop with a fixed length, say 4, to "20", > then to "202". If nesssary, it "hops" backforwards but > with shrinked pace. > > A search engine assumes the inverted indexes are > sorted. This is a strong assumption, especially it is > very hard to maintain if the user can update the data > thus index at any time. B+tree doesn't comply to this > strong assumption. > > When the dataset and index is small, B+tree is > faster than inverted index search. However, with > gigabytes, inverted index search tends to be faster, > because inverted index is smaller in size, thus less > disk IO required. > > > 3. Compression. > > Lucene compressed the data in its inverted indexes, > because it assumes the indexes do not change very > frequently. However, B+tree doesn't compress, because > it doesn't assume the same stability of its indexes. > > > 4. Inverted index doesn't required fix-length of > columns. > > Conventional RDBMS assumes that every row of a table > must be of the same columns. When one row may have > some extra columns, a workaround is to use "flex". > However, each Document of Lucene may be of different > fields. > > > 5. Inverted index is convenient for alias/synonyms. > > Since inverted indexes refer to the original dataset > by offset pointers, it is convenient to inject alias > and synonym into the inverted indexes, as long as they > point to the same offset of the original dataset. > > However, it is not very convenient to do the same > job with conventional database enpowered by B+tree. > > 6. Ranking. > > A search engine's implementation makes it convenient > to rank the search result. But with conventional > database, it is not so convenient. > > 7. Search engine doesn't bother SQL language. > > Usually conventional database suppose SQL language > to make it convenient for user to organize data, set > up index, query, etc. However, SQL's convenience comes > with price, because RDBMS engine has to handle the > compilation and figure out execution plan for most > queries. > > However, SQL language is not mandatory for RDBMS, > the embedded database like BerkeleyDB (Sleepycat) > doesn't support SQL, therefore, it is faster than > using SQL. > > Lucene doesn't support SQL-like language. But it is > possible to do so if people like SQL's convenient. > > > In summary, for many applications, search engine and > database are competitive solutions. One has to > consider in depth to choose either search engine or > database, and in some cases, the border is blurred. > > > Kan > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >