Hi everybody,
     Let's say I had an index with 10M large-ish documents, and as people
logged into a website and viewed them the "last viewed date" was updated to
the current time. We index a document's last-viewed-date because we allow
users to a) search on this last-viewed-date alongside all other searchable
criteria, and b) we can order results of any search by the last-viewed-date.
     The problem is that in a given 5-minute period, we may have many
thousands of updated documents (due to this simple last-viewed-date). We
have a task that looks for changed documents, loads the full documents, and
then feeds them into Solr to update the index, but unfortunately reading
these changed documents and continually feeding them to Solr is generating *
far* more load on our system (both Solr and the database) than any of the
searches. In a given day, *we may have more updates to documents than we
have total documents indexed*. (Databases don't handle this well either, the
contention on rows for updates slows the database down significantly.)
     How should we approach this problem? It seems like such a waste of
resources to be doing so much work in applications/database/solr only for
last-viewed-dates.

     Solutions we've looked at include:
     1) Update only partial document. --Apparently this isn't supported in
Solr yet (we're using nightly Solr 1.4 builds currently).
     2) Use "near-real-time updates". --Not supported yet. Also, the
"freshness" of the data isn't as much as concern as the sheer volume of
changes that we have to make here. For example, we could update Solr
less-fequently, but then we'd just have many more documents to update. The
data only has to be, say, fresh to within 30 minutes.
     3) Use a separate index for the last-viewed-date. --This won't work
because we need to search on the last-viewed-date alongside other criteria,
and we use it as scoring criteria for all our searches.

     Any suggestions?

Sincerely,

     Daryl.

Reply via email to