Hi all, We are seeing a strange problem on our 2-node solr4 cluster. This problem has resultet in data loss.
We have two servers, varnish01 and varnish02. Zookeeper is running on varnish02, but in a separate jvm. We index directly to varnish02 and we read from varnish01. Data is thus replicated from varnish02 to varnish01. I found this in the varnish01 log: *INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2} status=0 QTime=42 Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2} status=0 QTime=41 Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2} status=0 QTime=33 Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2} status=0 QTime=33 Dec 13, 2012 12:23:39 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://varnish02.lynero.net:8000/solr/default1_Norwegian/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://varnish02.lynero.net:8000/solr/default1_Norwegian at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:101) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247) at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) ... 11 more Dec 13, 2012 12:23:39 PM org.apache.solr.update.processor.DistributedUpdateProcessor doFinish INFO: try and ask http://varnish02.lynero.net:8000/solr to recover* It looks like it is sending updates from varnish01 to varnish02. I am not sure for what since we only index on varnish02. Updates should never be going from varnish01 to varnish02. Meanwhile on varnish02: *INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=16 Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=15 Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=16 Dec 13, 2012 12:23:42 PM org.apache.solr.handler.admin.CoreAdminHandler handleRequestRecoveryAction INFO: It has been requested that we recover* *Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} status=0 QTime=1 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select/ params={fq=site_guid:(2810678)&q=win} hits=0 status=0 QTime=17 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url= varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} status=0 QTime=1 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} status=0 QTime=1 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url= varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&group.topgroups.groupby_variant_of_item_guid=2963217&group.topgroups.groupby_variant_of_item_guid=2963223&group.topgroups.groupby_variant_of_item_guid=2963219&group.topgroups.groupby_variant_of_item_guid=2963220&group.topgroups.groupby_variant_of_item_guid=2963221&group.topgroups.groupby_variant_of_item_guid=2963222&group.topgroups.groupby_variant_of_item_guid=2963224&group.topgroups.groupby_variant_of_item_guid=2963218&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} status=0 QTime=1 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=26 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=22 Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState doRecovery Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState doRecovery INFO: Running recovery - first canceling any ongoing recovery Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=25 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=24 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=20 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=25 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=23 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=21 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=23 Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from= http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=16 Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=default1_Norwegian recoveringAfterStartup=false Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... Dec 13, 2012 12:23:42 PM org.apache.solr.update.processor.LogUpdateProcessor finish* And less than a second later: *Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Attempting to PeerSync from http://varnish01.lynero.net:8000/solr/default1_Norwegian/core=default1_Norwegian - recoveringAfterStartup=false Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=default1_Norwegian url= http://varnish02.lynero.net:8000/solr START replicas=[ http://varnish01.lynero.net:8000/solr/default1_Norwegian/] nUpdates=100 Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync WARNING: PeerSync: core=default1_Norwegian url= http://varnish02.lynero.net:8000/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: PeerSync Recovery was not successful - trying replication. core=default1_Norwegian Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Starting Replication Recovery. core=default1_Norwegian Dec 13, 2012 12:23:42 PM org.apache.solr.client.solrj.impl.HttpClientUtil createClient INFO: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating...* State change on varnish01 at the same time: *Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating...* * *And a few seconds later on varnish02, the recovery finishes: * Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Replication Recovery was successful - registering as Active. core=default1_Norwegian Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Finished recovery process. core=default1_Norwegian Dec 13, 2012 12:23:48 PM org.apache.solr.core.SolrCore execute INFO: [default1_Danish] webapp=/solr path=/select params={facet=false&sort=item_group_56823_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url= varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397828395&group.field=groupby_variant_of_item_guid&facet.field=itemgroups_int_mv&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_56823_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true} status=0 QTime=8 Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... * Which is picked up on varnish01: *Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating...* It looks like it replicated successfully, only it didnt. The default1_Norwegian core on varnish01 now has 55.071 docs and the same core on varnish02 has 35.088 docs. I checked the log files for both JVM's and no stop-the-world GC were taking place. There is also nothing in the zookeeper log of interest that I can see. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk