Tim Owen created SOLR-9918:
------------------------------

             Summary: An UpdateRequestProcessor to skip duplicate inserts and 
ignore updates to missing docs
                 Key: SOLR-9918
                 URL: https://issues.apache.org/jira/browse/SOLR-9918
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: update
            Reporter: Tim Owen


This is an UpdateRequestProcessor and Factory that we have been using in 
production, to handle 2 common cases that were awkward to achieve using the 
existing update pipeline and current processor classes:

* When inserting document(s), if some already exist then quietly skip the new 
document inserts - do not churn the index by replacing the existing documents 
and do not throw a noisy exception that breaks the batch of inserts. By analogy 
with SQL, {{insert if not exists}}. In our use-case, multiple application 
instances can (rarely) process the same input so it's easier for us to de-dupe 
these at Solr insert time than to funnel them into a global ordered queue first.
* When applying AtomicUpdate documents, if a document being updated does not 
exist, quietly do nothing - do not create a new partially-populated document 
and do not throw a noisy exception about missing required fields. By analogy 
with SQL, {{update where id = ..}}. Our use-case relies on this because we 
apply updates optimistically and have best-effort knowledge about what 
documents will exist, so it's easiest to skip the updates (in the same way a 
Database would).

I would have kept this in our own package hierarchy but it relies on some 
package-scoped methods, and seems like it could be useful to others if they 
choose to configure it. Some bits of the code were borrowed from 
{{DocBasedVersionConstraintsProcessorFactory}}.

Attached patch has unit tests to confirm the behaviour.

This class can be used by configuring solrconfig.xml like so..

{noformat}
  <updateRequestProcessorChain name="skipexisting">
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor 
class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
      <bool name="skipInsertIfExists">true</bool>
      <bool name="skipUpdateIfMissing">false</bool> <!-- We will override this 
per-request -->
    </processor>
    <processor class="solr.DistributedUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
{noformat}

and initParams defaults of

{noformat}
      <str name="update.chain">skipexisting</str>
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to