Would you say as a best practice that you can use both? When would you and when wouldn't you? I trust databases more then free files, so I am putting my more sensitive and volatile data in the database. If you built a commenting system.. like a blog or an flickr type app, would just a lucene solution be best? The problem with both of course is syncing..
-----Original Message----- From: Kan Deng [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 17, 2006 2:23 PM To: java-user@lucene.apache.org Subject: Re: Why we use Lucene for Database search like Oracle / Sybase ? 1. The conventional database uses B+tree as the indexing mechanism, while search engine uses inverted-index. When user needs to update the data frequently, then B+tree is a better choice. However, for search engine, the data and index doesn't change too often. Inverted indexes are tables. For example, .tis index file is a table of columns including "Field name", "Field value" and "doc freq". However, there are no B+tree associated with those tables. Inverted index is less space consuming comparing with B+tree index. Imagine there is a table which has a ID column. Suppose there are N rows in this table, the size of B+tree is O(Nlog(N)), but the size of inverted index is O(N). 2. Searching: Conventional RDBMS searches among the B+tree. But a search engine "hops" along the inverted index, which is sorted. Suppose a field value of inverted index is "0, 3, 4, 6, 20, 29, 39, 60, 202", to search for a given value say 6, the search engine may start with "0", then hop with a fixed length, say 4, to "20", then to "202". If nesssary, it "hops" backforwards but with shrinked pace. A search engine assumes the inverted indexes are sorted. This is a strong assumption, especially it is very hard to maintain if the user can update the data thus index at any time. B+tree doesn't comply to this strong assumption. When the dataset and index is small, B+tree is faster than inverted index search. However, with gigabytes, inverted index search tends to be faster, because inverted index is smaller in size, thus less disk IO required. 3. Compression. Lucene compressed the data in its inverted indexes, because it assumes the indexes do not change very frequently. However, B+tree doesn't compress, because it doesn't assume the same stability of its indexes. 4. Inverted index doesn't required fix-length of columns. Conventional RDBMS assumes that every row of a table must be of the same columns. When one row may have some extra columns, a workaround is to use "flex". However, each Document of Lucene may be of different fields. 5. Inverted index is convenient for alias/synonyms. Since inverted indexes refer to the original dataset by offset pointers, it is convenient to inject alias and synonym into the inverted indexes, as long as they point to the same offset of the original dataset. However, it is not very convenient to do the same job with conventional database enpowered by B+tree. 6. Ranking. A search engine's implementation makes it convenient to rank the search result. But with conventional database, it is not so convenient. 7. Search engine doesn't bother SQL language. Usually conventional database suppose SQL language to make it convenient for user to organize data, set up index, query, etc. However, SQL's convenience comes with price, because RDBMS engine has to handle the compilation and figure out execution plan for most queries. However, SQL language is not mandatory for RDBMS, the embedded database like BerkeleyDB (Sleepycat) doesn't support SQL, therefore, it is faster than using SQL. Lucene doesn't support SQL-like language. But it is possible to do so if people like SQL's convenient. In summary, for many applications, search engine and database are competitive solutions. One has to consider in depth to choose either search engine or database, and in some cases, the border is blurred. Kan __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]