[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499853#comment-14499853 ] Mark Miller commented on SOLR-4260: --- This ticket addressed specific issues - please open a new ticket for any further reports. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 4.6.1, Trunk > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496055#comment-14496055 ] Hari Sekhon commented on SOLR-4260: --- I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket - tens to hundreds of thousands of numDocs difference when doing bulk online indexing jobs (hundreds of millions of docs) from Hive. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 4.6.1, Trunk > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897716#comment-13897716 ] Markus Jelsma commented on SOLR-4260: - [~markrmil...@gmail.com] I just checked out the shards again, on one cluster, on replica has 1 document more (or less). They are out of sync again. I can open a new issue but it's really the same discussion as here. What do you think, reopen or new? > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 4.6.1, 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875755#comment-13875755 ] Mark Miller commented on SOLR-4260: --- Thanks everyone. I'll make a new JIRA issue to properly fix this. I'm not sure we should remove this logic, it's a good failsafe, but ideally, we don't run out of runners when there are still updates in the queue. Calling blockUntilFinished is not supposed to be required to make sure the queue is emptied. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874841#comment-13874841 ] Mark Miller commented on SOLR-4260: --- Thanks Shawn - fixed. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874839#comment-13874839 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1559125 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1559125 ] SOLR-4260: Bring back import still used on 4.6 branch. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874835#comment-13874835 ] Joel Bernstein commented on SOLR-4260: -- Ok, just had two clean test runs with trunk. The NPE is no longer occurring and the leaders and replicas are in sync. Running through some more stress tests this morning, but so far so good. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874686#comment-13874686 ] Mikhail Khludnev commented on SOLR-4260: What a great hunt, guys! Thanks a lot! > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874667#comment-13874667 ] Markus Jelsma commented on SOLR-4260: - I believe the whole building now knows i cannot reproduce the problem! > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874434#comment-13874434 ] Shawn Heisey commented on SOLR-4260: This might be old news by now, but I noticed it while updating my test system, so I'm reporting it. The lucene_solr_4_6 branch fails to compile with these fixes committed. One of the changes removes the import for RemoteSolrException from SolrCmdDistributor, but the doRetries method still uses this exception. That method is very different in 4.6 than it is in branch_4x. Everything's good on branch_4x. Re-adding the import fixes the problem, but the discrepancy between the two branches needs some investigation. The specific code that fails to compile with the removed import seems to have been initially added to trunk by revision 1545464 (2013/11/25) and removed from trunk by revision 1546670 (2013/11/29). It was then re-added to lucene_solr_4_6 by revision 1554122 (2013/12/29). > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874364#comment-13874364 ] Mark Miller commented on SOLR-4260: --- This is a fine fix for SolrCloud, especially for 4.6.1 - but there may be a better general fix hidden still - what seems to happen is that we have docs that enter the queue that don't spawn a runner. The current fix means docs can be added that will sit in the queue until you call blockUntilFinished. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874352#comment-13874352 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558998 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558998 ] SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874351#comment-13874351 ] Mark Miller commented on SOLR-4260: --- Committed something for that. As a separate issue, it seems to me that CUSS#shutdown should probably call blockUntilFinished as it's first order of business. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874348#comment-13874348 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558997 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558997 ] SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874345#comment-13874345 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558996 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1558996 ] SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874340#comment-13874340 ] Mark Miller commented on SOLR-4260: --- ChaosMonkeyNothingIsSafeTest is exposing an issue now with ConcurrentUpdateSolrServer - it looks like it's getting stuck in blockUntilFinished because the queue is not empty and no runners are being spawned to empty it. It may be that NPE that would occurred before in this case just kept the docs from being lost 'silently', and this is closer to the actual bug? > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874323#comment-13874323 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558988 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558988 ] SOLR-4260: Guard against NPE. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874321#comment-13874321 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558986 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558986 ] SOLR-4260: Guard against NPE. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874320#comment-13874320 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558985 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1558985 ] SOLR-4260: Guard against NPE. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874317#comment-13874317 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558983 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558983 ] SOLR-4260: Add name to CHANGES > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874311#comment-13874311 ] Mark Miller commented on SOLR-4260: --- bq. The conditions in this statement have changed and I think made it possible for the null pointer to appear. Ah, nice - thanks. I had already made some changes so couldn't line up the src lines - thought you meant the line that was the culprit was the one that the NPE came from. I'll take a closer look. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874314#comment-13874314 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558982 from [~markrmil...@gmail.com] in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558982 ] SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874305#comment-13874305 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558981 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558981 ] SOLR-4260: Add name to CHANGES > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7, 4.6.1 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874301#comment-13874301 ] Mark Miller commented on SOLR-4260: --- Well, this is important for 4.6.1 - given Potter's feedback, in it goes. Please help test and review this guys. Especially around this possible NPE. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874302#comment-13874302 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558979 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558979 ] SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874303#comment-13874303 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558980 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1558980 ] SOLR-4260: Add name to CHANGES > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874300#comment-13874300 ] ASF subversion and git services commented on SOLR-4260: --- Commit 1558978 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1558978 ] SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874294#comment-13874294 ] Joel Bernstein commented on SOLR-4260: -- It's actually the runner that is null: {code} runner.runnerLock.lock(); {code} The conditions in this statement have changed and I think made it possible for the null pointer to appear. {code} if ((runner == null && queue.isEmpty()) || scheduler.isTerminated()) {code} > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874270#comment-13874270 ] Mark Miller commented on SOLR-4260: --- Strange Joel - queue and scheduler are both final and set in the constructor. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874204#comment-13874204 ] Joel Bernstein commented on SOLR-4260: -- In code snippet below it looks runners.peek() can return null and cause the exception: {code} public synchronized void blockUntilFinished(boolean waitForEmptyQueue) { lock = new CountDownLatch(1); try { // Wait until no runners are running for (;;) { Runner runner; synchronized (runners) { runner = runners.peek(); } if (waitForEmptyQueue) { if ((runner == null && queue.isEmpty()) || scheduler.isTerminated()) break; } else { if (runner == null || scheduler.isTerminated()) break; } runner.runnerLock.lock(); runner.runnerLock.unlock(); } } finally { lock.countDown(); lock = null; } } {code} > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874194#comment-13874194 ] Joel Bernstein commented on SOLR-4260: -- I installed the patch and ran it. I'm getting some intermittent null pointers: 1578995 [qtp433857665-17] ERROR org.apache.solr.servlet.SolrDispatchFilter – null:java.lang.NullPointerException at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:401) at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:99) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:69) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:606) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1449) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:764) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874120#comment-13874120 ] Timothy Potter commented on SOLR-4260: -- Did another couple of million docs in an oversharded env. 24 replicas on 6 nodes (m1.mediums so I didn't want to overload them too much) ... still looking good. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874098#comment-13874098 ] Timothy Potter commented on SOLR-4260: -- So far so good, Mark! I applied the patch to latest rev of branch_4x and have indexed about 3M docs without hitting the issue, before the patch, I would see this issue within a few minutes. So jury is still out and I'll keep stress testing it, but looks promising. Nice work! > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874031#comment-13874031 ] Mark Miller commented on SOLR-4260: --- That's all I have come up with so far - though I'm not even completely sold on it. Because we are using CUSS with a single thread, all the previous doc adds should have hit the request method and so a Runner should be going for them if necessary. It's all pretty tricky logic to understand clearly though. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873964#comment-13873964 ] Mark Miller commented on SOLR-4260: --- For a long time, I've wanted to try putting in a check that the queue is empty as well for blockUntilFinished when we use it in this case - I just need a test that sees this so I can check if it works :) Without that, it seems there is a window where we can bail before we are done sending everything in the queue. Shutdown doesn't help much, because it can't even wait for the executor to shutdown in this case. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873951#comment-13873951 ] Timothy Potter commented on SOLR-4260: -- Added some more logging on the leader ... as a bit of context, the replica received doc with ID 41029 and then 41041 and didn't receive 41033 and 41038 in between ... here's the log on the leader of activity between 41029 and then 41041. 2014-01-16 16:03:02,523 [updateExecutor-1-thread-1] INFO solrj.impl.ConcurrentUpdateSolrServer - sent docs to [http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1] , 41003, 41 005, 41007, 41010, 41014, 41015, 41026, 41029 2014-01-16 16:03:02,527 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41033 2014-01-16 16:03:02,527 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41033 2014-01-16 16:03:02,527 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41033 2014-01-16 16:03:02,528 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41038 2014-01-16 16:03:02,528 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41038 2014-01-16 16:03:02,528 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41038 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished starting http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished is done for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shutting down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shut down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 Not quite sure what this means but I think you're hunch about blockUntilFinished being involved is getting warmer > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873888#comment-13873888 ] Timothy Potter commented on SOLR-4260: -- bq. So strange - it would be a different CUSS instance used for each server right, I was just mentioning that I did check to make sure there wasn't a bug in the routing logic or anything like that ... agreed on the need for a unit test to reproduce this and am working on the same. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873882#comment-13873882 ] Mark Miller commented on SOLR-4260: --- I have various theories, but with a test that fails, it's hard to test out anything - so I've been putting most of my efforts into a unit test that can get this, but it's been surprisingly difficult for me to trigger in a test. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873874#comment-13873874 ] Joel Bernstein commented on SOLR-4260: -- That was a blind alley, a faulty test was causing the effect I described above. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873868#comment-13873868 ] Mark Miller commented on SOLR-4260: --- bq. I've checked the logs on all the other replicas and the docs didn't go there either. So strange - it would be a different CUSS instance used for each server > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873851#comment-13873851 ] Shikhar Bhushan commented on SOLR-4260: --- This may be unrelated - I have not done much digging or looked at the full context, but was just looking at CUSS out of curiosity. Why do we flush() the OutputStream, but then write() on stuff like ending tags? Shouldn't the flush be after all those writes()'s? https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java#L205 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873841#comment-13873841 ] Timothy Potter commented on SOLR-4260: -- I was able to reproduce this issue on EC2 without any over-sharding (on latest rev on branch_4x) ... basically 6 Solr nodes with 3 shards and RF=2, i.e. each replica gets its own Solr instance. Here's the output from my client app that traps the inconsistency: >> Found 1 shards with mis-matched doc counts. At January 16, 2014 12:18:08 PM MST shard2: { http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/ = 62984 LEADER http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ = 62980 diff:4 } Details: shard2 >> finished querying leader, found 62984 documents (62984) >> finished querying >> http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/, >> found 62980 documents Doc [182866] not found in replica: 182866test-12573452420.926573630.5259114828332452this is a test1457415570117885953 Doc [182859] not found in replica: 182859test9913669090.53117160.10846350752086309this is a test1457415570117885952 Doc [182872] not found in replica: 182872test8245128970.8303660.6560223698806142this is a test1457415570117885954 Doc [182876] not found in replica: 182876test-16578314730.48779650.9214420679315872this is a test1457415570117885955 Sending hard commit after mis-match and then will wait for user to handle it ... << So four missing docs: 182866, 182859, 182872, 182876 Now I'm thinking this might be in the ConcurrentUpdateSolrServer logic. I added some detailed logging to show when JavabinLoader unmarshals a doc and when it is offered on the CUSS queue (to be sent to the replica). On the leader, here's the log around some messages that were lost: 2014-01-16 14:16:37,534 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182857 2014-01-16 14:16:37,534 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182857 / 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182859 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182859 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182866 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182866 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182872 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182872 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182876 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182876 2014-01-16 14:16:37,558 [qtp417447538-17] INFO update.processor.LogUpdateProcessor - [test_shard2_replica2] webapp=/solr path=/update params={wt=javabin&version=2} {add=[182704 (1457415570048679936), 182710 (1457415570049728512), 182711 (1457415570049728513), 182717 (1457415570056019968), 182720 (1457415570056019969), 182722 (1457415570057068544), 182723 (1457415570057068545), 182724 (1457415570058117120), 182730 (1457415570058117121), 182735 (1457415570059165696), ... (61 adds)]} 0 72 / 2014-01-16 14:16:37,764 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182880 2014-01-16 14:16:37,764 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182880 As you can see, the leader received doc with ID:182859 at 2014-01-16 14:16:37,552 and the queued it on the CUSS queue to be sent to the replica. On the replica, the log shows it receiving 182857 and then 182880 ... the 4 missing docs (182866, 182859, 182872, 182876) were definitely queued in CUSS on the leader. I've checked the logs on all the other replicas and the docs didn't go there either. 2014-01-16 14:16:37,292 [qtp417447538-14] INFO handler.loader.JavabinLoader - test_shard2_replica1 add: 182857 2014-01-16 14:16:37,293 [qtp417447538-14] INFO update.processor.LogUpdateProcessor - [test_shard2_replica1] webapp=/solr path=/update params={distrib.from=http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[182841 (1457415570096914432), 182842 (1457415570096914433), 182843 (1457415570096914434), 182844 (1457415570096914435), 182846 (1457415570097963008), 182848 (1457415570097963009), 182850 (1457415570099011584), 182854 (145741557009
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873788#comment-13873788 ] Joel Bernstein commented on SOLR-4260: -- I'm betting it's something in the streaming. This afternoon I'm going to put some debugging in to see if the docs being flushed by the commit were already written to the stream. My bet is that they were, and that the commit is pushing them all the way through to the replica. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873657#comment-13873657 ] Mark Miller commented on SOLR-4260: --- I spent some time a while back trying to find a fault in ConcurrentSolrServer#blockUntilFinished - didn't uncover anything yet though. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873638#comment-13873638 ] Joel Bernstein commented on SOLR-4260: -- The commit behavior is interesting. I'm seeing docs flushing from the leader to replica following a manual hard commit issued long after indexing has stopped. That means somewhere along the way docs are buffered and waiting for an event to flush them to the replica. I haven't figured out just yet where the buffering is occurring but I'm trying to track it down. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873457#comment-13873457 ] Markus Jelsma commented on SOLR-4260: - Seems autocommit has something to do with triggering the problem, at least in my case. * 13th build without autocommit: out of sync very soon * 13th build with autocommit: out of sync after a while * 6th build without autocommit: out of sync after a while * 6th build with autocommit: out of sync after many more documents > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873334#comment-13873334 ] Markus Jelsma commented on SOLR-4260: - Correction: it happens on a build of the 6th as well, although it doens't look that bad as when index to a 13th build. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873303#comment-13873303 ] Markus Jelsma commented on SOLR-4260: - Did something crucial change recently? Since at least the 14th, maybe earlier, indexing small segments from Nutch in several cycles (few hundred per cycle), some shards get out of sync really quick! I did lots of tests before that but didn't see it happening before. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872666#comment-13872666 ] Mark Miller commented on SOLR-4260: --- Well the affects I was seeing related to having a control collection with a core named collection1 and another collection called collection1. Over shard, and that causes some similar looking effects. I've addressed that and will see if ramping up my tests can spot anything - so far cannot replicate in a test though. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13871878#comment-13871878 ] Markus Jelsma commented on SOLR-4260: - Mark, no, each node holds a single JVM and single core. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870984#comment-13870984 ] Mark Miller commented on SOLR-4260: --- [~markus17], are you indexing to an oversharded cluster? > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870868#comment-13870868 ] Markus Jelsma commented on SOLR-4260: - I also think i'm seeing this happening right now with a trunk build of yesterday. I am slowly indexing few hundred docs every few minutes for quite some time for fixing a Nutch issue. Looks like i can restart it because replica's are already out of sync :) > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870859#comment-13870859 ] Mark Miller commented on SOLR-4260: --- FYI, I also had to overshard to see anything. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870865#comment-13870865 ] Markus Jelsma commented on SOLR-4260: - Mark - We use CloudSolrServer and send batches of around 380 documents from Nutch. I am not sure what actual implementation we get back when connecting. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870858#comment-13870858 ] Mark Miller commented on SOLR-4260: --- [~markus17], are you loading docs via the bulk methods or cuss or what? [~tim.potter], I think I'm seeing your issue. Have not gotten to the bottom of it yet, but if I am seeing the same thing, it seems those docs are being setup to send to 0 replicas. Trying to figure out why/how. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870197#comment-13870197 ] Timothy Potter commented on SOLR-4260: -- Oddly enough, just 1 indexing thread on the client side and batches of around 30-40 docs per shard (ie I set my batch size so that direct updates send about 30-40 per shard to the leaders from the client side). > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870068#comment-13870068 ] Mark Miller commented on SOLR-4260: --- How many threads are you using to load docs? How large are the batches? > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869916#comment-13869916 ] Timothy Potter commented on SOLR-4260: -- Make sense about not waiting because of the penalty now that I've had a chance to get into the details of that code. I spent a lot of time on Friday and over the weekend trying to track down the docs getting dropped. Unfortunately have not been able to track down the source of the issue yet. I'm fairly certain the issue happens before docs get submitted to CUSS, meaning that the lost docs never seemed to hit the queue in ConcurrentUpdateSolrServer. My original thinking was that given the complex nature of CUSS, there might be some sort of race condition but after having added a log of what hit the queue, it seems that the documents that get lost never hit the queue. Not to mention that the actual use of CUSS is mostly single-threaded because StreamingSolrServers construct them with a threadCount of 1. As a side note, one thing I noticed while is that direct updates don't necessarily hit the correct core initially when a Solr node hosts more than one shard per collection. In other words, if host X had shard1 and shard3 of collection foo, then some update requests would hit shard1 on host X when they should go to shard3 on the same host; shard1 correctly forwards them on but it's still an extra hop. Of course that is probably not a big deal in production as it would be rare to host multiple shards of the same collection in the same Solr host, unless they are over-sharding. In terms of this issue, here's what I'm seeing: Assume a SolrCloud environment with shard1 having replicas on host A and B; A is the current leader client sends direct update request to shard1 on host A containing 3 docs (1,2,3) (for example) batch from client gets broken up into individual docs (during request parsing) docs 1,2,3 get indexed on host A (the leader) docs 1 and 2 get queued into CUSS and sent on to the replica on host B (sometimes in the same request, sometimes in separate requests) doc 3 never makes it and from what I can tell, never hits the queue This may be anecdotal but from what I can tell, it's always docs on the end of a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3 make it and 2 not ... maybe useful, maybe not. The only other thing I'll mention is it does seem timing / race condition related as it's almost impossible to reproduce this on my Mac when running 2 shards across 2 nodes but much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier my CPU is, the easier it is to see docs getting dropped. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869137#comment-13869137 ] Mark Miller commented on SOLR-4260: --- SOLR-5625: Add to testing for SolrCmdDistributor > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869134#comment-13869134 ] Mark Miller commented on SOLR-4260: --- In this case there is no wait due to the massive penalty it puts on doc per request speed. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866031#comment-13866031 ] Timothy Potter commented on SOLR-4260: -- Cuss on CUSS ;-) Thanks, I sometimes forget that the client-side batch gets broken into individual AddUpdateCommands when sending on the replicas. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865984#comment-13865984 ] Yonik Seeley commented on SOLR-4260: {code} Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up? {code} ConcurrentUpdateSolrServer? If another doc doesn't come in quickly enough (250ms by default), it ends the batch. I thought there used to be a doc count limit too or something... but after a quick scan, I'm not seeing it. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865963#comment-13865963 ] Timothy Potter commented on SOLR-4260: -- Still digging into it ... I'm curious why a batch of 34 adds on the leader gets processed as several sub-batches on the replica? Here's what I'm seeing the logs around the documents that are missing from the replica. Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up? Having trouble seeing that in the logs ;-) On the leader: 2014-01-08 12:23:21,501 [qtp604104855-17] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2} {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449), 82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), ... (34 adds)]} 0 34 2014-01-08 12:23:21,600 [qtp604104855-17] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2} {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624), 83018 (1456683668287258625), 83019 (1456683668289355776), 83023 (1456683668289355777), 83024 (1456683668289355778), ... (43 adds)]} 0 32 On the replica: 2014-01-08 12:23:21,126 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449)]} 0 1 2014-01-08 12:23:21,134 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), 82919 (1456683668188692485), 82922 (1456683668188692486)]} 0 2 2014-01-08 12:23:21,139 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82923 (1456683668188692487), 82926 (1456683668190789632), 82928 (1456683668190789633), 82932 (1456683668190789634), 82939 (1456683668192886784), 82945 (1456683668192886785), 82946 (1456683668192886786), 82947 (1456683668193935360), 82952 (1456683668193935361), 82962 (1456683668193935362), ... (12 adds)]} 0 3 2014-01-08 12:23:21,144 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82967 (1456683668199178240)]} 0 0 9 Docs Missing here 2014-01-08 12:23:21,227 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624)]} 0 2 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each c
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865769#comment-13865769 ] Mark Miller commented on SOLR-4260: --- No, wait, it could jive. We only check the last 99 docs on peer sync - if bunch of docs just didn't show up well before that, it wouldn't be detected by peer sync. I still think SolrCmdDistributor is the first place to look. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865744#comment-13865744 ] Mark Miller commented on SOLR-4260: --- Although that doesn't really jive with the tran logs being identical...hmm... > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865742#comment-13865742 ] Mark Miller commented on SOLR-4260: --- I've noticed something like this too - but nothing i could reproduce easily. I imagine it's likely an issue in SolrCmdDistributor. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864141#comment-13864141 ] Markus Jelsma commented on SOLR-4260: - Ok, I followed all the great work here and in related tickets and yesterday i had the time to rebuild Solr and check for this issue. I hadn't seen it yesterday but it is right in front of me again, using a fresh build from January 6th. Leader has Num Docs: 379659 Replica has Num Docs: 379661 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856078#comment-13856078 ] Mark Miller commented on SOLR-4260: --- That's interesting. The logging makes it look like it's not creating it's new ephemeral live node for some reason...or the leader is not getting an updated view of the live node... > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856052#comment-13856052 ] Timothy Potter commented on SOLR-4260: -- Thanks Mark, I suspected my test case was a little cherry picked ... something interesting happened when I also severed the connection between the replica and ZK (ie. same test as above but I also dropped the ZK connection on the replica). 2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - zkClient has disconnected >>> fixed the connection between replica and ZK here <<< 2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None 2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 2013-12-23 15:40:45,580 [main-EventThread] INFO common.cloud.DefaultConnectionStrategy - Connection expired - starting a new one... 2013-12-23 15:40:45,586 [main-EventThread] INFO common.cloud.ConnectionManager - Waiting for client to connect to ZooKeeper 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Client is connected to ZooKeeper 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Connection with ZooKeeper reestablished. 2013-12-23 15:40:45,596 [main-EventThread] WARN solr.cloud.RecoveryStrategy - Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3 2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=down 2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:45,905 [qtp2124890785-14] INFO handler.admin.CoreAdminHandler - It has been requested that we recover 2013-12-23 15:40:45,906 [qtp2124890785-14] INFO solr.servlet.SolrDispatchFilter - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2} status=0 QTime=2 2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering 2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:45,920 [Thread-17] INFO solr.update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2013-12-23 15:40:45,921 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Starting recovery process. core=cloud_shard1_replica3 recoveringAfterStartup=false 2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering 2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:48,613 [qtp2124890785-15] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 2013-12-23 15:42:42,770 [qtp2124890785-13] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController - There was a problem making a request to the leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state down for cloud86:8986_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856013#comment-13856013 ] Mark Miller commented on SOLR-4260: --- bq. so we kind of punted The other thing to note is that if you restart the shard or that node or the cluster, you should be able to do it without losing any data. It will recover from the leader when everything else is working correctly. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856009#comment-13856009 ] Mark Miller commented on SOLR-4260: --- Yeah, that's currently expected. We don't expect the case where you can talk to ZooKeeper but not your replicas to be common, so we kind of punted on this scenario for the first phase. Some related JIRA issues: SOLR-5482 SOLR-5450 SOLR-5495 I think we should do all that, but the key is really, in this case, we need to pass the order to recover through ZooKeeper to the partitioned off replica. With an eventually consistent model, it can be off for a short time, but it needs to recover in a timely manner. I think this is the right solution because the replica is sure to either get the information to recover from ZooKeeper or lose it's connection to ZooKeeper in which case it will have to recover anyway. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856002#comment-13856002 ] Timothy Potter commented on SOLR-4260: -- Found another interesting case that may or may not be valid depending on whether we think HTTP requests between a leader and replica can fail even if the ZooKeeper session on the replica does not drop? Specifically, what I'm seeing is that if an update request between the leader and replica fails, but the replica doesn't lose it's session with ZK, then the replica can get out-of-sync with the leader. In a real network partition, the ZK connection would also likely be lost and the replica would get marked as down. So as long as the HTTP connection timeout between the leader and replica exceeds the ZK client timeout, the replica would probably recover correctly, rendering this test case invalid. So maybe the main question here is whether we think it's possible for HTTP requests between a leader and replica to fail even though the ZooKeeper connection stays alive? Here's the steps I used to reproduce this case (all using revision 1553150 in branch_4x): *> STEP 1: Setup a collection named “cloud” containing 1 shard and 2 replicas on hosts: cloud84 (127.0.0.1:8984) and cloud85 (127.0.0.1:8985)* SOLR_TOP=/home/ec2-user/branch_4x/solr $SOLR_TOP/cloud84/cloud-scripts/zkcli.sh -zkhost $ZK_HOST -cmd upconfig -confdir $SOLR_TOP/cloud84/solr/cloud/conf -confname cloud API=http://localhost:8984/solr/admin/collections curl -v "$API?action=CREATE&name=cloud&replicationFactor=2&numShards=1&collection.configName=cloud" Replica on cloud84 is elected as the initial leader. /clusterstate.json looks like: {"cloud":{ "shards":{"shard1":{ "range":"8000-7fff", "state":"active", "replicas":{ "core_node1":{ "state":"active", "base_url":"http://cloud84:8984/solr";, "core":"cloud_shard1_replica1", "node_name":"cloud84:8984_solr", "leader":"true"}, "core_node2":{ "state":"active", "base_url":"http://cloud85:8985/solr";, "core":"cloud_shard1_replica2", "node_name":"cloud85:8985_solr", "maxShardsPerNode":"1", "router":{"name":"compositeId"}, "replicationFactor":"2"}} *> STEP 2: Simulate network partition* sudo iptables -I INPUT 1 -i lo -p tcp --sport 8985 -j DROP; sudo iptables -I INPUT 2 -i lo -p tcp --dport 8985 -j DROP Various ways to do this, but to keep it simple, I'm just dropping inbound traffic on localhost to port 8985. *> STEP 3: Send document with ID “doc1” to leader on cloud84* curl "http://localhost:8984/solr/cloud/update"; -H 'Content-type:application/xml' \ --data-binary 'doc1bar' The update request takes some time because the replica is down but ultimately succeeds on the leader. In the logs on the leader, we have (some stack trace lines removed for clarity): 2013-12-23 10:59:33,688 [updateExecutor-1-thread-1] ERROR solr.update.StreamingSolrServers - error org.apache.http.conn.HttpHostConnectException: Connection to http://cloud85:8985 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190) ... at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232) ... Caused by: java.net.ConnectException: Connection timed out ... 2013-12-23 10:59:33,695 [qtp1073932139-16] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica1] webapp=/solr path=/update params={} {add=[doc1 (1455228778490888192)]} 0 63256 2013-12-23 10:59:33,702 [updateExecutor-1-thread-2] INFO update.processor.DistributedUpdateProcessor - try and ask http://cloud85:8985/solr to recover 2013-12-23 10:59:48,718 [updateExecutor-1-thread-2] ERROR update.processor.DistributedUpdateProcessor - http://cloud85:8985/solr: Could not tell a replica to recover:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://cloud85:8985/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:507) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.update.processor.DistributedUpdateProcessor$1.run(DistributedUpdateProcessor.java:657) ... Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to cloud85:8985 timed out ... Of course these log messages are expected. The key is that the leader accepted the update and now has one doc with ID "doc1" > STEP 4: Heal the network partition sudo service iptables restart (undoes the DROP rules we added above) *> STEP 5: Send document with ID “doc2” to leader on cloud84* curl "http://localhost:8984/solr/cloud/update";
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855273#comment-13855273 ] Mark Miller commented on SOLR-4260: --- SOLR-5552 investigation has also led to SOLR-5569 and SOLR-5568 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846812#comment-13846812 ] Timothy Potter commented on SOLR-4260: -- Mark -> https://issues.apache.org/jira/browse/SOLR-5552 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846775#comment-13846775 ] Mark Miller commented on SOLR-4260: --- Ah, thanks for the explanation. I think we should roll that specific issue into a new JIRA issue. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846755#comment-13846755 ] Timothy Potter commented on SOLR-4260: -- I'm sorry for being unclear; "waiting" was probably the wrong term ... and they definitely continue right on down the path of selecting the wrong leader. Here's what I know so far, which admittedly isn't much: As cloud85 (replica before it crashed) is initializing, it enters the wait process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is expected and a good thing. Some short amount of time in the future, cloud84 (leader before it crashed) begins initializing and gets to a point where it adds itself as a possible leader for the shard (by creating a znode under /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 being able to return from waitForReplicasToComeUp and try to determine who should be the leader. cloud85 then tries to run the SyncStrategy, which can never work because in this scenario the Jetty HTTP listener is not active yet on either node, so all replication work that uses HTTP requests fails on both nodes ... PeerSync treats these failures as indicators that the other replicas in the shard are unavailable (or whatever) and assumes success. Here's the log message: 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success The Jetty HTTP listener doesn't start accepting connections until long after this process has completed and already selected the wrong leader. >From what I can see, we seem to have a leader recovery process that is based >partly on HTTP requests to the other nodes, but the HTTP listener on those >nodes isn't active yet. We need a leader recovery process that doesn't rely on >HTTP requests. Perhaps, leader recovery for a shard w/o a current leader may >need to work differently than leader election in a shard that has replicas >that can respond to HTTP requests? All of what I'm seeing makes perfect sense >for leader election when there are active replicas and the current leader >fails. All this aside, I'm not asserting that this is the only cause for the out-of-sync issues reported in this ticket, but it definitely seems like it could happen in a real cluster. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846694#comment-13846694 ] Mark Miller commented on SOLR-4260: --- bq. because they are waiting on each other; That doesn't make sense to me - the wait should be until all the replicas for a shard are up - so what exactly are they both waiting on? If they are both waiting, there should be enough replicas up to continue... > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846593#comment-13846593 ] Timothy Potter commented on SOLR-4260: -- Agreed on the wait being necessary (which I actually annotated in the comment above). The crux of the issue here is that the replica (cloud85) can't sync with the previous leader (cloud84) because they are waiting on each other; much like a dead-lock. Eventually, they both give up and one wins; unfortunately in my test case, cloud85 wins which leads to the shard being out-of-sync because the wrong leader is selected in this scenario (cloud84 should have been selected). I'm continuing to dig into this but have come to the conclusion that tweaking the waitForReplicasToComeUp process is a dead end and it's working as well as it can. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846572#comment-13846572 ] Mark Miller commented on SOLR-4260: --- The other issue is expected as well. It's the safety mechanism - we don't let you just start one node and let it becomes leader - ideally you want all replicas to be involved in the election to prevent data loss. You have to be explicit if you want to have this work with no wait. It might be nice if we added a startup sys prop that caused it not to wait on first startup. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845895#comment-13845895 ] Timothy Potter commented on SOLR-4260: -- ok - cool ... just wanted to make sure that "stale" situation was expected ... the more I dig into ShardLeaderElectionContext's decision making process, I think looking at state won't work because both are in the "down" state while this is happening. I think some determination of is the node "reachable" so that PeerSync can get good information from it is what needs to be factored into ShardLeaderElectionContext. Or maybe there is another state "trying to figure out my role in the world as I come back up" ;-) > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845887#comment-13845887 ] Mark Miller commented on SOLR-4260: --- A lot there! I'll respond to most of it later. As far as the stale state, that is expected. You cannot tell the state just from clusterstate.json - it is a mix of clusterstate.json and the live_nodes list. If the livenode for anything in clusterstate.json is missing, it's considered not up. This is just currently by design - without live_nodes, you don't know the state. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845847#comment-13845847 ] Timothy Potter commented on SOLR-4260: -- I don't have fix yet, but I wanted to post an update here to get some feedback on what I'm seeing ... I have a simple SolrCloud configuration setup locally: 1 collection named "cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create it: curl "http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud"; The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 with cloud84 being assigned the leader. Here's an outline of the process I used to get my collection out-of-sync during indexing: 1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to the leader and replica remains in sync for as long as I let this process run 2) kill -9 the process for the replica cloud85 3) let indexing continue against cloud84 for a few seconds (just to get the leader and replica out-of-sync once I bring the replica back online) 4) kill -9 the process for the leader cloud84 ... indexing halts of course as there are no running servers 5) start the replica cloud85 but do not start the previous leader cloud84 Here are some key log messages as cloud85 - the replica - fires up ... my annotations of the log messages are prefixed by [TJP >> 2013-12-11 11:43:22,076 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 1) 2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139841 [TJP >> This looks good and is expected because cloud85 was not the leader before it died, so it should not immediately assume it is the leader until it sees more replicas 6) now start the previous leader cloud84 ... Here are some key log messages from cloud85 as the previous leader cloud84 is coming up ... 2013-12-11 11:43:24,085 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (2) 2013-12-11 11:43:24,136 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,137 [Thread-13] INFO common.cloud.ZkStateReader - Updating cloud state from ZooKeeper... 2013-12-11 11:43:24,138 [Thread-13] INFO solr.cloud.Overseer - Update state numShards=1 message={ "operation":"state", "state":"down", "base_url":"http://cloud84:8984/solr";, "core":"cloud_shard1_replica2", "roles":null, "node_name":"cloud84:8984_solr", "shard":"shard1", "shard_range":null, "shard_state":"active", "shard_parent":null, "collection":"cloud", "numShards":"1", "core_node_name":"core_node1"} [TJP >> state of cloud84 looks correct as it is still initializing ... 2013-12-11 11:43:24,140 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,141 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2) 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue. [TJP >> h ... cloud84 is listed in /live_nodes but it isn't "active" yet or even recovering (see state above - it's currently "down") ... My thinking here is that the ShardLeaderElectionContext needs to take the state of the replica into account before deciding it should continue. 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync replicas to http://cloud85:8985/solr/cloud_shard1_replica1/ 2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr START replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/] nUpdates=100 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success [TJP >> whoops! of course it couldn't connect to cloud84 as it's still initializing ... 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_re
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845459#comment-13845459 ] Timothy Potter commented on SOLR-4260: -- I have some cycles to work on this issue over the next couple of days. I'm starting by trying to reproduce it in my environment. Please let me know of any tasks that I can help out on (beyond the long wait stuff you mentioned above). > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841896#comment-13841896 ] Mark Miller commented on SOLR-4260: --- I've fixed some things since 4.6 - I only had time to focus on the leader not going down case for 4.6, I spent a bunch more time on this case after 4.6 was released. Unfortunately, I think there are a couple of issues at play here - some of the new changes makes existing holes easier to spot and the chaos monkey tests where accidentally disabled for some time, so small issues may have crept in. I *think* the remaining issue is mostly around SOLR-5516. Need to come up with a better idea than a really long wait though - but if someone wants to help test, putting in a long wait and stressing this would be useful to see if it is indeed the main remaining issue. I recently put in a lot of time improving the situation and I need to focus on other things for a bit, but that I'll keep coming back to this as I can. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841870#comment-13841870 ] Yago Riveiro commented on SOLR-4260: Replicas are still losing docs in Solr 4.6 :(. I'm wondering if we can't have a pair (version, numDocs) to track the increments of docs between versions. Also we can save the last 10 tlogs in each replica as backups after be commited and make a diff to see what is missing in case the replicas are out of sync, replay the transaction and avoid a not synchronized replica and a full-recovery that probably will be heaviest that make the diff. It's only and idea and of course find the bug must be the priority. This issue compromisse Solr to be "the main" storage. If re-index data is not possible, we can't guarantee that no data is missing, and worse, we lost the data forever :(. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841162#comment-13841162 ] Markus Jelsma commented on SOLR-4260: - I'm sorry, i've got three replica's having one document less than the leader. We're on a december, 3th build. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836414#comment-13836414 ] Markus Jelsma commented on SOLR-4260: - I'll check it out! > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835883#comment-13835883 ] Mark Miller commented on SOLR-4260: --- [~markus17], hopefully that's SOLR-5516 then. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835436#comment-13835436 ] Mark Miller commented on SOLR-4260: --- What's the exact version / checkout? > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835368#comment-13835368 ] Rafał Kuć commented on SOLR-4260: - Happened to me two, collection with two four shards, each having a single replica. The replicas were our of sync. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834760#comment-13834760 ] Markus Jelsma commented on SOLR-4260: - I've got some bad news, it happened again on one of our clusters using a build of november 19th.Three replica's went out of sync. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826535#comment-13826535 ] Mark Miller commented on SOLR-4260: --- Should probably bring it up on the user list - we need someone like [~rcmuir] to weigh in. I assume it all works the same way - you merge each field to the default impl and then back to what they were. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826533#comment-13826533 ] Yago Riveiro commented on SOLR-4260: I'm using to enable per-field DocValues formats. I think that this aspect about docValues it doesn't explained on wiki in a proper way. There is no example how we can do the switch to default, do the forceMerge and switch back to the original implementation. If I can't have the security that all will work fine, I can't do the upgrade. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826529#comment-13826529 ] Mark Miller commented on SOLR-4260: --- According to the wiki, it depends on the doc values impl you are using - the default one will upgrade fine. Others require that you forceMerge your index to rewrite it with the default and then upgrade, then I guess you can forceMerge back to that impl. Honestly, I have not had a chance to play with doc values yet though. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826356#comment-13826356 ] Yago Riveiro commented on SOLR-4260: It's safe upgrade from 4.5.1 to 4.6?. I have docValues and I read that it's not linear upgraded and I can't reindex the data. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826350#comment-13826350 ] Markus Jelsma commented on SOLR-4260: - I updated our machines to include SOLR-5397. Everything works fine now, it may take quite some time before we can say it is fixed :) > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Assignee: Mark Miller >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825107#comment-13825107 ] Mark Miller commented on SOLR-4260: --- Would love if you guys could try with 4.6 and report back. SOLR-5397 was introduced when we fixed a similar issue, so that has really been an issue for a few releases. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824921#comment-13824921 ] Mark Miller commented on SOLR-4260: --- This could be related to SOLR-5397 > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824299#comment-13824299 ] Mark Miller commented on SOLR-4260: --- Right - but that's just impl, not design. The idea is that, since we add locally first, there is not much reason it should fail on a replica - unless that replica has crashed or lost connectivity or something really bad. In that case, it will have to reconnect to zk and recover or restart and recover. Just in case, as a precaution, we try and tell it to recover - then if it's still got connectivity or it was an intermittent problem, it won't run around acting active. I think I have a note about perhaps doing more retries in background threads for that recovery request, but I've never gotten to it. If you are finding a scenario that eludes that, we should strengthen the impl. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica
[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824295#comment-13824295 ] Jessica Cheng commented on SOLR-4260: - {quote} This shouldn't be the case, because those updates will only have been ack'd if each replica received them. {quote} That's what I thought too, but doesn't seem to be the case in the code. If you take a look at DistributedUpdateProcessor.doFinish(), {quote} // if its a forward, any fail is a problem - // otherwise we assume things are fine if we got it locally // until we start allowing min replication param if (errors.size() > 0) { // if one node is a RetryNode, this was a forward request if (errors.get(0).req.node instanceof RetryNode) { rsp.setException(errors.get(0).e); } // else // for now we don't error - we assume if it was added locally, we // succeeded } {quote} It then starts a thread to urge the replica to recover, but if that fails, it just completely gives up. > Inconsistent numDocs between leader and replica > --- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 > Environment: 5.0.0.2013.01.04.15.31.51 >Reporter: Markus Jelsma >Priority: Critical > Fix For: 5.0 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org