I'm going to try and cover all replies so far, but for the most part this first one since it had the most help so far. Thanks to everyone who replied so far, you've given me a lot of great ideas to think about and look into. I'm going to begin some small test indexes with our data so we have something to play with today and next week.
>>> erickerick...@gmail.com 3/26/2009 4:49 PM >>> >You've got a great grasp of the issues, comments below. But before you >do, a lot of this kind if thing is incorporated in SOLR, which is build on >Lucene. Particularly updating an index then using it. >So you might take a look over there. It even has a DataImportHandler... We're actually using Zend_Lucene, as our entire database and web structure is based on PHP with the Zend framework, so it will be more efficient for us to do searching with a php based lucene rather than running from within Tomcat. I was considering using Java and looking into SOLR to do the majority of the index work, and using Zend_Lucene to do the searching, but it looks like Zend doesn't support the newer 2.4 lucene indexes yet. For our uses, I'm thinking Zend's implementation is our best option, both because of lack of support for 2.4 and my being more comfortable writing in PHP rather than Java. >How many rows/columns are we talking here? the answer for >100,000,000 is different than for 1,000. For start... 18,000 rows, theoretically 56,000 if we start to scope out Lucene for other uses and some in-house apps. >How many fields is this? And are they searched or displayed? Both? >Neither? If you don't search these fields, you could keep an auxiliary index >with your primary key and the new fields. For that matter, if it was >a small enough amount of data you could keep it in an internal map >keyed by your DB unique id. Size matters here..... >From my initial estimates it looks like 25 searchable fields. A mix of keywords and text, possibly one or two unstored. >I wouldn't rebuild the entire index, just the changed rows. Assuming that >your 15 minute deltas aren't for a large percentage of the rows. I was under the impression that even for a few rows it can still be an expensive process. Am I wrong? >This may be a fine choice. What is the throughput you expect to have to >satisfy? 1 qpm or 1000 qps? And how many rows of results do you expect >to display? How fast can you get rows from the DB? Our server itself is pretty fast, so returning rows isn't a huge problem. Our main web table is returned in ~3 seconds. Our bigger in-house database takes ~16 seconds. In most cases we'll only be returning 100 or so results. >Possible, but it really depends upon whether the difference between >fetching a few fields really shows up compared to fetching them all. Which >in turn depends upon the answers for some of the questions above. The fields that change are just a decimal and a keyword. In comparison to the entire result set we get currently from our search results stored procedure it's minuscule, but like I said, the database speed isn't a huge deal. We are mainly interested in Lucene for it's more accurate full text searching and more powerful searching compared to MS SQL 2000. That being said, however, I am very interested in maximizing our speed and I want to do it in as efficient as well as standardized a way as possible. Since I'm new to Lucene I'm not sure how much of what I'm looking into is on a per-site basis and requires coding both and seeing the performance hit for either, and how much is based on the "proper" uses of Lucene. >It depends (tm). How long does it take to build the index in the first >place? We haven't gotten as far as building the index yet, as to use Lucene we have to make some rather major changes to how our Database is setup to get the most out of Lucene's indexing. I'm simply trying to scope out the project and make sure that I am headed in the right direction and getting the concepts of the software down. >>> chris...@gmail.com 3/26/2009 5:11 PM >>> >There are many things you need to synchronize with database. Besides >just changed fields, you may need to deal with deleted database records, >etc. Records are deleted very infrequently from our database. A once an evening full update of our index is more than enough to handle that. >Seems you are not sure the proper index structure yet. >I think you can use DBSight Free version, to rapidly prototype and >experiment with all these choices, without coding any XML etc. I have a pretty good idea of the index structure so far, I think. The main issue is that there are two specific fields that need to be fairly accurate (cronjob every 15 minutes currently) when returned with search results. The rest can wait ~24 hours to appear live on the site, but would be nice to keep those up to date. So really I'm just trying to find out if it's worth updating the index for that data, or if it's better to store just index information in Lucene and use it for pure searching, and the database to retrieve the data based on the results Lucene returns. I want to make sure this is done right the first time, so we don't have to go back and correct it later if/when we discover it was implemented poorly. Sadly I really don't know an real world numbers on how often the data changes, just that it needs to be up to date. >>> william...@gmail.com 3/26/2009 7:03 PM >>> >Not sure what ORM framework, if any, you might be using, but some >colleagues have had some success using Hibernate Search[1] for this >sorta thing. Currently we entirely use the database. Lucene will be the first time we have to map to a source off of the database. Correct me if I'm wrong though, but from my research it appears Hibernate requires you to use it to actually make your SQL queries which would in turn also make the same change to the Lucene index? Since we are building this on a already large site it would be too large of a project to try and convert all of our current in house applications and web apps to attach to a new service.