[ https://issues.apache.org/jira/browse/SOLR-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817007#comment-15817007 ]
ASF subversion and git services commented on SOLR-9918: ------------------------------------------------------- Commit 2979a1eacd916201548303245f81705da7f9cc36 in lucene-solr's branch refs/heads/branch_6x from koji [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2979a1e ] SOLR-9918: Add SkipExistingDocumentsProcessor that skips duplicate inserts and ignores updates to missing docs > An UpdateRequestProcessor to skip duplicate inserts and ignore updates to > missing docs > -------------------------------------------------------------------------------------- > > Key: SOLR-9918 > URL: https://issues.apache.org/jira/browse/SOLR-9918 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: update > Reporter: Tim Owen > Assignee: Koji Sekiguchi > Attachments: SOLR-9918.patch, SOLR-9918.patch > > > This is an UpdateRequestProcessor and Factory that we have been using in > production, to handle 2 common cases that were awkward to achieve using the > existing update pipeline and current processor classes: > * When inserting document(s), if some already exist then quietly skip the new > document inserts - do not churn the index by replacing the existing documents > and do not throw a noisy exception that breaks the batch of inserts. By > analogy with SQL, {{insert if not exists}}. In our use-case, multiple > application instances can (rarely) process the same input so it's easier for > us to de-dupe these at Solr insert time than to funnel them into a global > ordered queue first. > * When applying AtomicUpdate documents, if a document being updated does not > exist, quietly do nothing - do not create a new partially-populated document > and do not throw a noisy exception about missing required fields. By analogy > with SQL, {{update where id = ..}}. Our use-case relies on this because we > apply updates optimistically and have best-effort knowledge about what > documents will exist, so it's easiest to skip the updates (in the same way a > Database would). > I would have kept this in our own package hierarchy but it relies on some > package-scoped methods, and seems like it could be useful to others if they > choose to configure it. Some bits of the code were borrowed from > {{DocBasedVersionConstraintsProcessorFactory}}. > Attached patch has unit tests to confirm the behaviour. > This class can be used by configuring solrconfig.xml like so.. > {noformat} > <updateRequestProcessorChain name="skipexisting"> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor > class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory"> > <bool name="skipInsertIfExists">true</bool> > <bool name="skipUpdateIfMissing">false</bool> <!-- We will override > this per-request --> > </processor> > <processor class="solr.DistributedUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > {noformat} > and initParams defaults of > {noformat} > <str name="update.chain">skipexisting</str> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org