Re: Commit after every document - alternate approach

Shawn Heisey Fri, 04 Mar 2016 06:40:50 -0800

On 3/3/2016 11:36 PM, sangs8788 wrote:
> When a commit fails, the document doesnt get cleared out from MQ and there is
> a task which runs in a background to republish the files to SOLR. If we do a
> batch commit we will not know we will end up redoing the same batch commit
> again. We currenlty have a client side commit which issue the command to
> SOLR. commit() returns a status code. If we are planning to use
> commitwithin(), I dont think it will actually return any result from solr
> since it is time oriented.


Do your indexing and commits in batches, as already recommended.  I'd
start with 1000 and go up or down from there as needed.  If the batch
indexing fails, or the commit fails, consider the entire batch failed. 
That may not be the end of the world, though -- if the indexing was
successful (and didn't use ConcurrentUpdateSolrClient), then those
updates will be stored in the Solr transaction log, and will be replayed
if Solr is restarted or the core is reloaded.

If you want to be absolutely certain that the update/commit succeeded by
verifying data, one thing you *could* do is send a batch update, do a
commit, and then request every document in the batch with a query that
includes a limited fl parameter, and verify that the document is present
and the values of the fields requested in the fl parameter are correct. 
I would probably do that query with {!cache=false} to avoid polluting
Solr's caches.

Almost every index update you make can be simply made again without
danger.  The exceptions are certain kinds of atomic updates, and certain
situations with deletes.  It's probably best to avoid doing those kinds
of updates, which are described below:

If you're doing atomic updates that increment or decrement a field
value, or atomic updates that add a new value to a multivalued field,
the results will be wrong if that update is repeated, although they
would be correct if the update is replayed from Solr's transaction log,
because atomic updates are no longer atomic when they hit the
transaction log -- they include values for every field in the document,
as if the document were built from scratch.

If you are explicitly deleting a document before replacing it, and those
actions were re-done in the opposite order, then the document would be
missing from the index.  Because Solr handles the deletion automatically
when a document is being updated/replaced, explicit deleting is not
recommended for those situations.

> If we go with SOLR autocommit, is there a way to send a response to MQ
> saying commit successful ?

If commits are completely automatic (autoCommit, autoSoftCommit, or
commitWithin), there's no way for a program to be sure that they have
completed.

The general recommendation for Solr indexing, especially if your
pipeline is multi-threaded, is to simply send your updates, let Solr
handle commits, and rely on the design of Lucene combined with Solr's
transaction logs to keep your data safe.  This approach does mean that
when things go wrong it may be a while before new data is searchable.

Emir's reply is spot on.  Solr is not recommended as a primary data store.

Thanks,
Shawn

Re: Commit after every document - alternate approach

Reply via email to