I'm going to try and cover all replies so far, but for the most part
this first one since it had the most help so far. Thanks to everyone who
replied so far, you've given me a lot of great ideas to think about and
look into.  I'm going to begin some small test indexes with our data so
we have something to play with today and next week.

>>> erickerick...@gmail.com 3/26/2009 4:49 PM >>>
>You've got a great grasp of the issues, comments below. But before
you
>do, a lot of this kind if thing is incorporated in SOLR, which is
build on
>Lucene. Particularly updating an index then using it.
>So you might take a look over there. It even has a
DataImportHandler...

We're actually using Zend_Lucene, as our entire database and web
structure is based on PHP with the Zend framework, so it will be more
efficient for us to do searching with a php based lucene rather than
running from within Tomcat.  I was considering using Java and looking
into SOLR to do the majority of the index work, and using Zend_Lucene to
do the searching, but it looks like Zend doesn't support the newer 2.4
lucene indexes yet.  For our uses, I'm thinking Zend's implementation is
our best option, both because of lack of support for 2.4 and my being
more comfortable writing in PHP rather than Java.

>How many rows/columns are we talking here? the answer for
>100,000,000 is different than for 1,000.

For start... 18,000 rows, theoretically 56,000 if we start to scope out
Lucene for other uses and some in-house apps.

>How many fields is this? And are they searched or displayed? Both?
>Neither? If you don't search these fields, you could keep an auxiliary
index
>with your primary key and the new fields. For that matter, if it was
>a small enough amount of data you could keep it in an internal map
>keyed by your DB unique id. Size matters here.....
 
>From my initial estimates it looks like 25 searchable fields. A mix of
keywords and text, possibly one or two unstored.

>I wouldn't rebuild the entire index, just the changed rows. Assuming
that
>your 15 minute deltas aren't for a large percentage of the rows.

I was under the impression that even for a few rows it can still be an
expensive process. Am I wrong?

>This may be a fine choice. What is the throughput you expect to have
to
>satisfy? 1 qpm or 1000 qps? And how many rows of results do you
expect
>to display? How fast can you get rows from the DB?
 
Our server itself is pretty fast, so returning rows isn't a huge
problem. Our main web table is returned in ~3 seconds. Our bigger
in-house database takes ~16 seconds. In most cases we'll only be
returning 100 or so results.

>Possible, but it really depends upon whether the difference between
>fetching a few fields really shows up compared to fetching them all.
Which
>in turn depends upon the answers for some of the questions above.

The fields that change are just a decimal and a keyword.  In comparison
to the entire result set we get currently from our search results stored
procedure it's minuscule, but like I said, the database speed isn't a
huge deal.  We are mainly interested in Lucene for it's more accurate
full text searching and more powerful searching compared to MS SQL 2000.
 That being said, however, I am very interested in maximizing our speed
and I want to do it in as efficient as well as standardized a way as
possible.  Since I'm new to Lucene I'm not sure how much of what I'm
looking into is on a per-site basis and requires coding both and seeing
the performance hit for either, and how much is based on the "proper"
uses of Lucene.

>It depends (tm). How long does it take to build the index in the
first
>place? 
 
We haven't gotten as far as building the index yet, as to use Lucene we
have to make some rather major changes to how our Database is setup to
get the most out of Lucene's indexing. I'm simply trying to scope out
the project and make sure that I am headed in the right direction and
getting the concepts of the software down.

>>> chris...@gmail.com 3/26/2009 5:11 PM >>>
>There are many things you need to synchronize with database. Besides 
>just changed fields, you may need to deal with deleted database
records, 
>etc.
 
Records are deleted very infrequently from our database. A once an
evening full update of our index is more than enough to handle that.
 
>Seems you are not sure the proper index structure yet.
>I think you can use DBSight Free version, to rapidly prototype and 
>experiment with all these choices, without coding any XML etc.
 
I have a pretty good idea of the index structure so far, I think. The
main issue is that there are two specific fields that need to be fairly
accurate (cronjob every 15 minutes currently) when returned with search
results.  The rest can wait ~24 hours to appear live on the site, but
would be nice to keep those up to date.  So really I'm just trying to
find out if it's worth updating the index for that data, or if it's
better to store just index information in Lucene and use it for pure
searching, and the database to retrieve the data based on the results
Lucene returns. I want to make sure this is done right the first time,
so we don't have to go back and correct it later if/when we discover it
was implemented poorly. Sadly I really don't know an real world numbers
on how often the data changes, just that it needs to be up to date.
 
>>> william...@gmail.com 3/26/2009 7:03 PM >>>
>Not sure what ORM framework, if any, you might be using, but some
>colleagues have had some success using Hibernate Search[1] for this
>sorta thing.
 
Currently we entirely use the database.  Lucene will be the first time
we have to map to a source off of the database.   Correct me if I'm
wrong though, but from my research it appears Hibernate requires you to
use it to actually make your SQL queries which would in turn also make
the same change to the Lucene index? Since we are building this on a
already large site it would be too large of a project to try and convert
all of our current in house applications and web apps to attach to a new
service.

Reply via email to