[
https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280358#comment-13280358
]
Hoss Man commented on SOLR-3473:
--------------------------------
i'm not entirely sure i'm understanding the problems. here's what i think i
understand...
1) if you put dedup prior to distrib, then regardless of how it is configured
it currently runs twice, which is bad - this seems like it is solved by
SOLR-2822
2) if you want to use dedup to generate a sig for the uniqueKey field, then it
really *has* to come before distrib, otherwise forwarding to the leader just
wont work. (again: SOLR-2822 should make this do-able)
3) if you want to use dedup to generate a sig field that is *not* the uniqueKey
field, *AND* you want to use "overwriteDupes=true" then (currently) this needs
to happen _after_ distrib, because otherwise the info about the deletion --
tracked in
AddUpdateCommand.updateTerm - is lost when distrib does the forward. This
seems like something that the distrib processor should deal with by ensuring it
serializes/deserializes all of the key information in the AddUpdateCommand when
sending/recieving a TOLEADER/FROMLEADER request (using SOLR-2822 vernacular)
3a) it's not enough to ensure that the "updateTerm" is forwarded all the
replicas in the shard, because other docs in other shards may have the same
term value for the hash. (hence Markus's suggestions about doing a
deleteByQuery -- this should be in distribUP when AddUpdateCommand.updateTerm
is non-null)
4) something about document cloning ... i still don't really understand this --
not just in terms of dedup, but in generally i don't really understand why
SOLR-3215 is an issue assuming we fix SOLR-2822.
> Distributed deduplication broken
> --------------------------------
>
> Key: SOLR-3473
> URL: https://issues.apache.org/jira/browse/SOLR-3473
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud, update
> Affects Versions: 4.0
> Reporter: Markus Jelsma
> Fix For: 4.0
>
>
> Solr's deduplication via the SignatureUpdateProcessor is broken for
> distributed updates on SolrCloud.
> Mark Miller:
> {quote}
> Looking again at the SignatureUpdateProcessor code, I think that indeed this
> won't currently work with distrib updates. Could you file a JIRA issue for
> that? The problem is that we convert update commands into solr documents -
> and that can cause a loss of info if an update proc modifies the update
> command.
> I think the reason that you see a multiple values error when you try the
> other order is because of the lack of a document clone (the other issue I
> mentioned a few emails back). Addressing that won't solve your issue though -
> we have to come up with a way to propagate the currently lost info on the
> update command.
> {quote}
> Please see the ML thread for the full discussion:
> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]