For many applications, search engine and database are competitive solutions. One has to consider in depth to choose either search engine or database, and in some cases, the border is blurred.
There are several issues to consider when selecting database or Lucene or both as the solution to a specific application. 1. Data change frequency. Suppose there is a "restaurant" table, which contains a "current customers" column. Since the content of this column is updated very frequently, Lucene is not a good choice. However, for "menu" column, Lucene definitely a strong candidate for indexing and searching. 2. Query. Suppose the "restaurant" table which contains a "menu" column, there are two ways to store the content into the "menu" column, one is put all the dishes into it as a big string delimited by comma. Another way is to separate each dish into different rows, so that one single restaurant may have multiple rows. If the menu is stored as a big string, then conventional RDBMS is not capable of doing the search. Because "select * from restaurant where menu like '%BBQ%'" will induce a full table scan. However, if the dishes are stored in separated rows, RDBMS is among the candidates. More analysis needs to be considered before making the decision. 3. Performance. Intuitively, B+tree based search sounds faster than inverted-index search. However, since inverted-index organizes the data in a sorted way, its search performance doesn't fall behind B+tree search for too far. Imagine there is an inverted-index, "0, 1, 2, 3, 4, 5, 6, 7, 8", the "hopping" algorithm can be implemented in a way identical to the binary search, if starting with "4", then either "2" or "6", so on. Lucene uses fixed interval hopping, mainly because it is convenient to implement with satisfying performance, even though theoritically not so good as binary search or B+tree search. The above comparison assume B+tree and inverted-index are loaded fully into RAM. In case the index is so big that it is mandatory to swap between RAM and disk, inverted-index tends to excel B+tree index. The reason is that B+tree index occupies more space, which induces more disk IO, and consequently reduces the overall performance. 4. Infrastructure. RDBMS products consume a lot of computational resource, if the user doesn't want to break into the RDBMS's source code and get rid of the unwanted functionality like SQL language. As a result, usually RDBMS is running on a standalone machine, and use JDBC/ODBC to interact with application which is running on another machine, via network. Keep in mind that average it costs 10-1000 milliseconds to transmit data through network, depending on the size of your data. If one wants his server running fast and scaling well like Google.com, even 10 millisecond is a big issue that he has to think hard how to save. Therefore, for better performance, maybe it is more appropriate to put the application logic on the same machine as the data repository. However, if one wants to use conventional RDBMS as a blackbox, the one tier solution may not work well because RDBMS consumes so much computational resource that there is little remained for application logic to use. As a best practice, an embedded database is more favorable than the standalone RDBMS blackbox, especially for non-critical data repository, and usually with better performance. The comparison between embedded database and Lucene is subtle. Sometimes it is reasonable to merge Lucene with embedded database. 5. Integrate Lucene with embedded database. One reason is to integrate Lucene with embedded database is to support blind query like "select * from restaurant where menu like '%BBQ%'". It is possible to integrate Lucene with embedded database, but a difficulty is how to update the inverted-indexes frequently. This is an interesting problem, however, to my best knowledge, there is no good solution for the time being. Kan --- "Peter A. Daly" <[EMAIL PROTECTED]> wrote: > In many cases that essentially require traditional > RDBMS but also require > lucene like functionality, I would use the database > as the primary data > store. I would then either: > 1. Update the lucene index using data from the > database based on a > scheduled process. > 2. As records are added, add them to both Lucene > and the database. > > It's a little extra work (and space,) but you get > the best of both worlds. > > -Pete > > On 1/17/06, John Powers <[EMAIL PROTECTED]> > wrote: > > > > Would you say as a best practice that you can use > both? When would > > you and when wouldn't you? I trust databases more > then free files, so I > > am putting my more sensitive and volatile data in > the database. If you > > built a commenting system.. like a blog or an > flickr type app, would > > just a lucene solution be best? The problem > with both of course is > > syncing.. > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]