Hey Chris, I figured a separate issue while working on CDCR which may relate to your problem. Please see jira: *SOLR-12063* <https://issues.apache.org/jira/projects/SOLR/issues/SOLR-12063>. This is a bug got introduced when we supported the bidirectional approach where an extra flag in tlog entry for cdcr is added.
This part of the code is messing up: *UpdateLog.java.RecentUpdates::update()::* switch (oper) { case UpdateLog.ADD: case UpdateLog.UPDATE_INPLACE: case UpdateLog.DELETE: case UpdateLog.DELETE_BY_QUERY: Update update = new Update(); update.log = oldLog; update.pointer = reader.position(); update.version = version; if (oper == UpdateLog.UPDATE_INPLACE && entry.size() == 5) { update.previousVersion = (Long) entry.get(UpdateLog.PREV_VERSION_IDX); } updatesForLog.add(update); updates.put(version, update); if (oper == UpdateLog.DELETE_BY_QUERY) { deleteByQueryList.add(update); } else if (oper == UpdateLog.DELETE) { deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()-1))); } break; case UpdateLog.COMMIT: break; default: throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unknown Operation! " + oper); } deleteList.add(new DeleteUpdate(version, (byte[])entry.get(entry.size()-1))); is expecting the last entry to be the payload, but everywhere in the project, *pos:[2] *is the index for the payload, while the last entry in source code is *boolean* in / after Solr 7.2, denoting update is cdcr forwarded or typical. UpdateLog.java.RecentUpdates is used to in cdcr sync, checkpoint operations and hence it is a legit bug, slipped the tests I wrote. The immediate fix patch is uploaded and I am awaiting feedback on that. Meanwhile if it is possible for you to apply the patch, build the jar and try it out, please do and let us know. For, *SOLR-9394* <https://issues.apache.org/jira/browse/SOLR-9394>, if you can comment on the JIRA and post the sample docs, solr logs, relevant information, I can give it a thorough look. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 Medium: https://medium.com/@sarkaramrit2 On Wed, Mar 7, 2018 at 1:35 AM, Chris Troullis <cptroul...@gmail.com> wrote: > Hi all, > > We recently upgraded to Solr 7.2.0 as we saw that there were some CDCR bug > fixes and features added that would finally let us be able to make use of > it (bi-directional syncing was the big one). The first time we tried to > implement we ran into all kinds of errors, but this time we were able to > get it mostly working. > > The issue we seem to be having now is that any time a document is deleted > via deleteById from a collection on the primary node, we are flooded with > "Invalid Number" errors followed by a random sequence of characters when > CDCR tries to sync the update to the backup site. This happens on all of > our collections where our id fields are defined as longs (some of them the > ids are compound keys and are strings). > > Here's a sample exception: > > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error > from server at http://ip/solr/collection_shard1_replica_n1: Invalid > Number: ] > -s > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > directUpdate(CloudSolrClient.java:549) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > sendRequest(CloudSolrClient.java:1012) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:883) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:945) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:945) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:945) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:945) > at > org.apache.solr.client.solrj.impl.CloudSolrClient. > requestWithRetryOnStaleState(CloudSolrClient.java:945) > at > org.apache.solr.client.solrj.impl.CloudSolrClient.request( > CloudSolrClient.java:816) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194) > at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) > at > org.apache.solr.handler.CdcrReplicator.sendRequest( > CdcrReplicator.java:140) > at > org.apache.solr.handler.CdcrReplicator.run(CdcrReplicator.java:104) > at > org.apache.solr.handler.CdcrReplicatorScheduler.lambda$null$0( > CdcrReplicatorScheduler.java:81) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor. > lambda$execute$0(ExecutorUtil.java:188) > at > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > > I'm scratching my head as to the cause of this. It's like it is trying to > deleteById for the value "]", even though that is not the ID for the > document that was deleted from the primary. So I don't know if it is > pulling this from the wrong field somehow or where that value if coming > from. > > I found this issue: https://issues.apache.org/jira/browse/SOLR-9394 which > looks related, but doesn't look like it has any traction. > > Has anyone else experienced this issue with CDCR, or have any ideas as to > what could be causing this issue? > > Thanks, > > Chris >