: now I have simple Lucene Indexes that  basically re-created once daily and
: that simply isn't doing the job for about 30% of my content.

do you mean it takes to long to index all your content so you can only do
part of it, or do you mean it's not indexing some of your conent "well" ?

: For indexing news articles for instance, I want the article, all reader
: comments, photos, links, multimedia files associated with the article to be
: indexed together as one entity so that if Chris Hostetter commented on the
: "high cost of heating oil in Maine" article, I can find the article by
: searching on your name, etc....

this is a great example of the last 20% of the problem i was talking about
... knowing *when* to reindex a modified record, even if you have a
perfect mechanism for identifing/flattening all of the data that should go
in a Document, and a perfect method for detecting when any of that data
has changed, it probably isn't practical/efficient to reindex every time
.. you might want to say that creating/deleteing or modifying the "core"
aspects of a news article (ie: title, dek, byline, body, categories,
publish date) should trigger an immediate index update, but for things
like user comments it might make more sense to have a batch process that
runs every N minutes and reindexes any article that has had comments added
in the last N minutes ... except maybe you want to be more responsive to
comments added to "recent" articles, so maybe you configur two seperate
instances of that cron job, one where N is small but it only looks at
articles published today, and another where N is larger and it looks at
older articles.

...these are the kinds of tradoffs that typically have to be made between
indexing data quickly and getting good performance out of your index, and
it's why i've never tried to build a "general purpose" indexer for Solr --
the needs of different indexes are too differnet for it to make much
sense.

Besides: if it were that easy, google would have a hosted solution
with a REST API and everyone would just use them to search their sites. :)




-Hoss

Reply via email to