[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2015-04-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499853#comment-14499853
 ] 

Mark Miller commented on SOLR-4260:
---

This ticket addressed specific issues - please open a new ticket for any 
further reports.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 4.6.1, Trunk
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2015-04-15 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496055#comment-14496055
 ] 

Hari Sekhon commented on SOLR-4260:
---

I've seen discrepancies between leader and followers of much higher numbers on 
newer versions of Solr than in this ticket - tens to hundreds of thousands of 
numDocs difference when doing bulk online indexing jobs (hundreds of millions 
of docs) from Hive.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 4.6.1, Trunk
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-02-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897716#comment-13897716
 ] 

Markus Jelsma commented on SOLR-4260:
-

[~markrmil...@gmail.com] I just checked out the shards again, on one cluster, 
on replica has 1 document more (or less). They are out of sync again. I can 
open a new issue but it's really the same discussion as here. What do you 
think, reopen or new?


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 4.6.1, 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875755#comment-13875755
 ] 

Mark Miller commented on SOLR-4260:
---

Thanks everyone. I'll make a new JIRA issue to properly fix this. I'm not sure 
we should remove this logic, it's a good failsafe, but ideally, we don't run 
out of runners when there are still updates in the queue. Calling 
blockUntilFinished is not supposed to be required to make sure the queue is 
emptied.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874841#comment-13874841
 ] 

Mark Miller commented on SOLR-4260:
---

Thanks Shawn - fixed.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874839#comment-13874839
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1559125 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1559125 ]

SOLR-4260: Bring back import still used on 4.6 branch.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-17 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874835#comment-13874835
 ] 

Joel Bernstein commented on SOLR-4260:
--

Ok, just had two clean test runs with trunk. The NPE is no longer occurring and 
the leaders and replicas are in sync. Running through some more stress tests 
this morning, but so far so good.



> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-17 Thread Mikhail Khludnev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874686#comment-13874686
 ] 

Mikhail Khludnev commented on SOLR-4260:


What a great hunt, guys! Thanks a lot!

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-17 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874667#comment-13874667
 ] 

Markus Jelsma commented on SOLR-4260:
-

I believe the whole building now knows i cannot reproduce the problem!

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874434#comment-13874434
 ] 

Shawn Heisey commented on SOLR-4260:


This might be old news by now, but I noticed it while updating my test system, 
so I'm reporting it.

The lucene_solr_4_6 branch fails to compile with these fixes committed.  One of 
the changes removes the import for RemoteSolrException from SolrCmdDistributor, 
but the doRetries method still uses this exception.  That method is very 
different in 4.6 than it is in branch_4x.  Everything's good on branch_4x.  
Re-adding the import fixes the problem, but the discrepancy between the two 
branches needs some investigation.

The specific code that fails to compile with the removed import seems to have 
been initially added to trunk by revision 1545464 (2013/11/25) and removed from 
trunk by revision 1546670 (2013/11/29).  It was then re-added to 
lucene_solr_4_6 by revision 1554122 (2013/12/29).


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874364#comment-13874364
 ] 

Mark Miller commented on SOLR-4260:
---

This is a fine fix for SolrCloud, especially for 4.6.1 - but there may be a 
better general fix hidden still - what seems to happen is that we have docs 
that enter the queue that don't spawn a runner. The current fix means docs can 
be added that will sit in the queue until you call blockUntilFinished.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874352#comment-13874352
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558998 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1558998 ]

SOLR-4260: If in blockUntilFinished and there are no Runners running and the 
queue is not empty, start a new Runner.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874351#comment-13874351
 ] 

Mark Miller commented on SOLR-4260:
---

Committed something for that.

As a separate issue, it seems to me that CUSS#shutdown should probably call 
blockUntilFinished as it's first order of business.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874348#comment-13874348
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558997 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1558997 ]

SOLR-4260: If in blockUntilFinished and there are no Runners running and the 
queue is not empty, start a new Runner.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874345#comment-13874345
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558996 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1558996 ]

SOLR-4260: If in blockUntilFinished and there are no Runners running and the 
queue is not empty, start a new Runner.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874340#comment-13874340
 ] 

Mark Miller commented on SOLR-4260:
---

ChaosMonkeyNothingIsSafeTest is exposing an issue now with 
ConcurrentUpdateSolrServer - it looks like it's getting stuck in 
blockUntilFinished because the queue is not empty and no runners are being 
spawned to empty it.

It may be that NPE that would occurred before in this case just kept the docs 
from being lost 'silently', and this is closer to the actual bug?

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874323#comment-13874323
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558988 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1558988 ]

SOLR-4260: Guard against NPE.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874321#comment-13874321
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558986 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1558986 ]

SOLR-4260: Guard against NPE.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874320#comment-13874320
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558985 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1558985 ]

SOLR-4260: Guard against NPE.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874317#comment-13874317
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558983 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1558983 ]

SOLR-4260: Add name to CHANGES

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874311#comment-13874311
 ] 

Mark Miller commented on SOLR-4260:
---

bq. The conditions in this statement have changed and I think made it possible 
for the null pointer to appear.

Ah, nice - thanks. I had already made some changes so couldn't line up the src 
lines - thought you meant the line that was the culprit was the one that the 
NPE came from.

I'll take a closer look.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874314#comment-13874314
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558982 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1558982 ]

SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all 
previously added updates have finished. This could cause distributed updates 
meant for replicas to be lost.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874305#comment-13874305
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558981 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1558981 ]

SOLR-4260: Add name to CHANGES

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874301#comment-13874301
 ] 

Mark Miller commented on SOLR-4260:
---

Well, this is important for 4.6.1 - given Potter's feedback, in it goes. Please 
help test and review this guys. Especially around this possible NPE.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874302#comment-13874302
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558979 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1558979 ]

SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all 
previously added updates have finished. This could cause distributed updates 
meant for replicas to be lost.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874303#comment-13874303
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558980 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1558980 ]

SOLR-4260: Add name to CHANGES

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874300#comment-13874300
 ] 

ASF subversion and git services commented on SOLR-4260:
---

Commit 1558978 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1558978 ]

SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all 
previously added updates have finished. This could cause distributed updates 
meant for replicas to be lost.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874294#comment-13874294
 ] 

Joel Bernstein commented on SOLR-4260:
--

It's actually the runner that is null:
{code}
runner.runnerLock.lock();
{code}

The conditions in this statement have changed and I think made it possible for 
the null pointer to appear.
{code}
if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
{code}

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874270#comment-13874270
 ] 

Mark Miller commented on SOLR-4260:
---

Strange Joel - queue and scheduler are both final and set in the constructor.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874204#comment-13874204
 ] 

Joel Bernstein commented on SOLR-4260:
--

In code snippet below it looks runners.peek() can return null and cause the 
exception:

{code}
 public synchronized void blockUntilFinished(boolean waitForEmptyQueue) {
lock = new CountDownLatch(1);
try {
  // Wait until no runners are running
  for (;;) {
Runner runner;
synchronized (runners) {
  runner = runners.peek();
}
if (waitForEmptyQueue) {
  if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
break;
} else {
  if (runner == null || scheduler.isTerminated())
break;
}
runner.runnerLock.lock();
runner.runnerLock.unlock();
  }
} finally {
  lock.countDown();
  lock = null;
}
  }
{code}

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874194#comment-13874194
 ] 

Joel Bernstein commented on SOLR-4260:
--

I installed the patch and ran it. 

I'm getting some intermittent null pointers:

1578995 [qtp433857665-17] ERROR org.apache.solr.servlet.SolrDispatchFilter  – 
null:java.lang.NullPointerException
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:401)
at 
org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:99)
at 
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:69)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:606)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1449)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:764)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874120#comment-13874120
 ] 

Timothy Potter commented on SOLR-4260:
--

Did another couple of million docs in an oversharded env. 24 replicas on 6 
nodes (m1.mediums so I didn't want to overload them too much) ... still looking 
good.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874098#comment-13874098
 ] 

Timothy Potter commented on SOLR-4260:
--

So far so good, Mark! I applied the patch to latest rev of branch_4x and have 
indexed about 3M docs without hitting the issue, before the patch, I would see 
this issue within a few minutes. So jury is still out and I'll keep stress 
testing it, but looks promising. Nice work!

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874031#comment-13874031
 ] 

Mark Miller commented on SOLR-4260:
---

That's all I have come up with so far - though I'm not even completely sold on 
it. Because we are using CUSS with a single thread, all the previous doc adds 
should have hit the request method and so a Runner should be going for them if 
necessary.

It's all pretty tricky logic to understand clearly though.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, SOLR-4260.patch, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873964#comment-13873964
 ] 

Mark Miller commented on SOLR-4260:
---

For a long time, I've wanted to try putting in a check that the queue is empty 
as well for blockUntilFinished when we use it in this case - I just need a test 
that sees this so I can check if it works :)

Without that, it seems there is a window where we can bail before we are done 
sending everything in the queue. Shutdown doesn't help much, because it can't 
even wait for the executor to shutdown in this case.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873951#comment-13873951
 ] 

Timothy Potter commented on SOLR-4260:
--

Added some more logging on the leader ... as a bit of context, the replica 
received doc with ID 41029 and then 41041 and didn't receive 41033 and 41038 in 
between ... here's the log on the leader of activity between 41029 and then 
41041.

2014-01-16 16:03:02,523 [updateExecutor-1-thread-1] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - sent docs to 
[http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1]
 , 41003, 41
005, 41007, 41010, 41014, 41015, 41026, 41029
2014-01-16 16:03:02,527 [qtp417447538-16] INFO  handler.loader.JavabinLoader  - 
test3_shard3_replica2 add: 41033
2014-01-16 16:03:02,527 [qtp417447538-16] INFO  
update.processor.DistributedUpdateProcessor  - doLocalAdd 41033
2014-01-16 16:03:02,527 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test3_shard3_replica2 queued (to: 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1):
 41033
2014-01-16 16:03:02,528 [qtp417447538-16] INFO  handler.loader.JavabinLoader  - 
test3_shard3_replica2 add: 41038
2014-01-16 16:03:02,528 [qtp417447538-16] INFO  
update.processor.DistributedUpdateProcessor  - doLocalAdd 41038
2014-01-16 16:03:02,528 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test3_shard3_replica2 queued (to: 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1):
 41038
2014-01-16 16:03:02,559 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - blockUntilFinished starting 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
2014-01-16 16:03:02,559 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - blockUntilFinished is done for 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
2014-01-16 16:03:02,559 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - shutting down CUSS for 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
2014-01-16 16:03:02,559 [qtp417447538-16] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - shut down CUSS for 
http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1

Not quite sure what this means but I think you're hunch about 
blockUntilFinished being involved is getting warmer

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873888#comment-13873888
 ] 

Timothy Potter commented on SOLR-4260:
--

bq. So strange - it would be a different CUSS instance used for each server

right, I was just mentioning that I did check to make sure there wasn't a bug 
in the routing logic or anything like that ...

agreed on the need for a unit test to reproduce this and am working on the same.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873882#comment-13873882
 ] 

Mark Miller commented on SOLR-4260:
---

I have various theories, but with a test that fails, it's hard to test out 
anything - so I've been putting most of my efforts into a unit test that can 
get this, but it's been surprisingly difficult for me to trigger in a test.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873874#comment-13873874
 ] 

Joel Bernstein commented on SOLR-4260:
--

That was a blind alley, a faulty test was causing the effect I described above.


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873868#comment-13873868
 ] 

Mark Miller commented on SOLR-4260:
---

bq.  I've checked the logs on all the other replicas and the docs didn't go 
there either.

So strange - it would be a different CUSS instance used for each server

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Shikhar Bhushan (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873851#comment-13873851
 ] 

Shikhar Bhushan commented on SOLR-4260:
---

This may be unrelated - I have not done much digging or looked at the full 
context, but was just looking at CUSS out of curiosity.

Why do we flush() the OutputStream, but then write() on stuff like ending tags? 
Shouldn't the flush be after all those writes()'s?

https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java#L205

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873841#comment-13873841
 ] 

Timothy Potter commented on SOLR-4260:
--

I was able to reproduce this issue on EC2 without any over-sharding (on latest 
rev on branch_4x) ... basically 6 Solr nodes with 3 shards and RF=2, i.e. each 
replica gets its own Solr instance. Here's the output from my client app that 
traps the inconsistency:

>>
Found 1 shards with mis-matched doc counts.
At January 16, 2014 12:18:08 PM MST
shard2: {

http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/
 = 62984 LEADER

http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ 
= 62980 diff:4
}
Details:
shard2
>> finished querying leader, found 62984 documents (62984)
>> finished querying 
>> http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/,
>>  found 62980 documents
Doc [182866] not found in replica: 182866test-12573452420.926573630.5259114828332452this is a 
test1457415570117885953
Doc [182859] not found in replica: 182859test9913669090.53117160.10846350752086309this is a 
test1457415570117885952
Doc [182872] not found in replica: 182872test8245128970.8303660.6560223698806142this is a 
test1457415570117885954
Doc [182876] not found in replica: 182876test-16578314730.48779650.9214420679315872this is a 
test1457415570117885955
Sending hard commit after mis-match and then will wait for user to handle it ...
<<

So four missing docs: 182866, 182859, 182872, 182876

Now I'm thinking this might be in the ConcurrentUpdateSolrServer logic. I added 
some detailed logging to show when JavabinLoader unmarshals a doc and when it 
is offered on the CUSS queue (to be sent to the replica). On the leader, here's 
the log around some messages that were lost:

2014-01-16 14:16:37,534 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182857
2014-01-16 14:16:37,534 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182857
/
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182859
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182859
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182866
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182866
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182872
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182872
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182876
2014-01-16 14:16:37,552 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182876
2014-01-16 14:16:37,558 [qtp417447538-17] INFO  
update.processor.LogUpdateProcessor  - [test_shard2_replica2] webapp=/solr 
path=/update params={wt=javabin&version=2} {add=[182704 (1457415570048679936), 
182710 (1457415570049728512), 182711 (1457415570049728513), 182717 
(1457415570056019968), 182720 (1457415570056019969), 182722 
(1457415570057068544), 182723 (1457415570057068545), 182724 
(1457415570058117120), 182730 (1457415570058117121), 182735 
(1457415570059165696), ... (61 adds)]} 0 72
/
2014-01-16 14:16:37,764 [qtp417447538-17] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica2 add: 182880
2014-01-16 14:16:37,764 [qtp417447538-17] INFO  
solrj.impl.ConcurrentUpdateSolrServer  - test_shard2_replica2 queued: 182880
 

As you can see, the leader received doc with ID:182859 at 2014-01-16 
14:16:37,552 and the queued it on the CUSS queue to be sent to the replica. On 
the replica, the log shows it receiving 182857 and then 182880 ... the 4 
missing docs (182866, 182859, 182872, 182876) were definitely queued in CUSS on 
the leader. I've checked the logs on all the other replicas and the docs didn't 
go there either.


2014-01-16 14:16:37,292 [qtp417447538-14] INFO  handler.loader.JavabinLoader  - 
test_shard2_replica1 add: 182857
2014-01-16 14:16:37,293 [qtp417447538-14] INFO  
update.processor.LogUpdateProcessor  - [test_shard2_replica1] webapp=/solr 
path=/update 
params={distrib.from=http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[182841 (1457415570096914432), 182842 (1457415570096914433), 182843 
(1457415570096914434), 182844 (1457415570096914435), 182846 
(1457415570097963008), 182848 (1457415570097963009), 182850 
(1457415570099011584), 182854 (145741557009

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873788#comment-13873788
 ] 

Joel Bernstein commented on SOLR-4260:
--

I'm betting it's something in the streaming. This afternoon I'm going to put 
some debugging in to see if the docs being flushed by the commit were already 
written to the stream. My bet is that they were, and that the commit is pushing 
them all the way through to the replica. 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873657#comment-13873657
 ] 

Mark Miller commented on SOLR-4260:
---

I spent some time a while back trying to find a fault in 
ConcurrentSolrServer#blockUntilFinished - didn't uncover anything yet though.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873638#comment-13873638
 ] 

Joel Bernstein commented on SOLR-4260:
--

The commit behavior is interesting. I'm seeing docs flushing from the leader to 
replica following a manual hard commit issued long after indexing has stopped. 
That means somewhere along the way docs are buffered and waiting for an event 
to flush them to the replica. I haven't figured out just yet where the 
buffering is occurring but I'm trying to track it down. 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873457#comment-13873457
 ] 

Markus Jelsma commented on SOLR-4260:
-

Seems autocommit has something to do with triggering the problem, at least in 
my case.

* 13th build without autocommit: out of sync very soon
* 13th build with autocommit: out of sync after a while
* 6th build without autocommit: out of sync after a while
* 6th build with autocommit: out of sync after many more documents

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873334#comment-13873334
 ] 

Markus Jelsma commented on SOLR-4260:
-

Correction: it happens on a build of the 6th as well, although it doens't look 
that bad as when index to a 13th build.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873303#comment-13873303
 ] 

Markus Jelsma commented on SOLR-4260:
-

Did something crucial change recently? Since at least the 14th, maybe earlier, 
indexing small segments from Nutch in several cycles (few hundred per cycle), 
some shards get out of sync really quick! I did lots of tests before that but 
didn't see it happening before.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872666#comment-13872666
 ] 

Mark Miller commented on SOLR-4260:
---

Well the affects I was seeing related to having a control collection with a 
core named collection1 and another collection called collection1. Over shard, 
and that causes some similar looking effects.

I've addressed that and will see if ramping up my tests can spot anything - so 
far cannot replicate in a test though.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13871878#comment-13871878
 ] 

Markus Jelsma commented on SOLR-4260:
-

Mark, no, each node holds a single JVM and single core.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870984#comment-13870984
 ] 

Mark Miller commented on SOLR-4260:
---

[~markus17], are you indexing to an oversharded cluster?

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870868#comment-13870868
 ] 

Markus Jelsma commented on SOLR-4260:
-

I also think i'm seeing this happening right now with a trunk build of 
yesterday. I am slowly indexing few hundred docs every few minutes for quite 
some time for fixing a Nutch issue. Looks like i can restart it because 
replica's are already out of sync :)

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870859#comment-13870859
 ] 

Mark Miller commented on SOLR-4260:
---

FYI, I also had to overshard to see anything.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870865#comment-13870865
 ] 

Markus Jelsma commented on SOLR-4260:
-

Mark - We use CloudSolrServer and send batches of around 380 documents from 
Nutch. I am not sure what actual implementation we get back when connecting.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870858#comment-13870858
 ] 

Mark Miller commented on SOLR-4260:
---

[~markus17], are you loading docs via the bulk methods or cuss or what?

[~tim.potter], I think I'm seeing your issue. Have not gotten to the bottom of 
it yet, but if I am seeing the same thing, it seems those docs are being setup 
to send to 0 replicas. Trying to figure out why/how.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-13 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870197#comment-13870197
 ] 

Timothy Potter commented on SOLR-4260:
--

Oddly enough, just 1 indexing thread on the client side and batches of around 
30-40 docs per shard (ie I set my batch size so that direct updates send about 
30-40 per shard to the leaders from the client side).

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870068#comment-13870068
 ] 

Mark Miller commented on SOLR-4260:
---

How many threads are you using to load docs? How large are the batches?

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-13 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869916#comment-13869916
 ] 

Timothy Potter commented on SOLR-4260:
--

Make sense about not waiting because of the penalty now that I've had a chance 
to get into the details of that code.

I spent a lot of time on Friday and over the weekend trying to track down the 
docs getting dropped. Unfortunately have not been able to track down the source 
of the issue yet. I'm fairly certain the issue happens before docs get 
submitted to CUSS, meaning that the lost docs never seemed to hit the queue in 
ConcurrentUpdateSolrServer. My original thinking was that given the complex 
nature of CUSS, there might be some sort of race condition but after having 
added a log of what hit the queue, it seems that the documents that get lost 
never hit the queue. Not to mention that the actual use of CUSS is mostly 
single-threaded because StreamingSolrServers construct them with a threadCount 
of 1.

As a side note, one thing I noticed while is that direct updates don't 
necessarily hit the correct core initially when a Solr node hosts more than one 
shard per collection. In other words, if host X had shard1 and shard3 of 
collection foo, then some update requests would hit shard1 on host X when they 
should go to shard3 on the same host; shard1 correctly forwards them on but 
it's still an extra hop. Of course that is probably not a big deal in 
production as it would be rare to host multiple shards of the same collection 
in the same Solr host, unless they are over-sharding.

In terms of this issue, here's what I'm seeing:

Assume a SolrCloud environment with shard1 having replicas on host A and B; A 
is the current leader
client sends direct update request to shard1 on host A containing 3 docs 
(1,2,3) (for example)
batch from client gets broken up into individual docs (during request parsing)
docs 1,2,3 get indexed on host A (the leader)
docs 1 and 2 get queued into CUSS and sent on to the replica on host B 
(sometimes in the same request, sometimes in separate requests)
doc 3 never makes it and from what I can tell, never hits the queue

This may be anecdotal but from what I can tell, it's always docs on the end of 
a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3 
make it and 2 not ... maybe useful, maybe not. The only other thing I'll 
mention is it does seem timing / race condition related as it's almost 
impossible to reproduce this on my Mac when running 2 shards across 2 nodes but 
much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier 
my CPU is, the easier it is to see docs getting dropped.



> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869137#comment-13869137
 ] 

Mark Miller commented on SOLR-4260:
---

SOLR-5625: Add to testing for SolrCmdDistributor

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869134#comment-13869134
 ] 

Mark Miller commented on SOLR-4260:
---

In this case there is no wait due to the massive penalty it puts on doc per 
request speed.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866031#comment-13866031
 ] 

Timothy Potter commented on SOLR-4260:
--

Cuss on CUSS ;-) Thanks, I sometimes forget that the client-side batch gets 
broken into individual AddUpdateCommands when sending on the replicas.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865984#comment-13865984
 ] 

Yonik Seeley commented on SOLR-4260:


{code}
Basically, there are 34 docs on the leader and only 25 processed in 4 separate 
batches (from my counting of the logs) on the replica. Why wouldn't it just be 
one for one? The docs are all roughly the same size ... and what's breaking it 
up? 
{code}

ConcurrentUpdateSolrServer?  If another doc doesn't come in quickly enough 
(250ms by default), it ends the batch.
I thought there used to be a doc count limit too or something... but after a 
quick scan, I'm not seeing it.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865963#comment-13865963
 ] 

Timothy Potter commented on SOLR-4260:
--

Still digging into it ... I'm curious why a batch of 34 adds on the leader gets 
processed as several sub-batches on the replica? Here's what I'm seeing the 
logs around the documents that are missing from the replica. Basically, there 
are 34 docs on the leader and only 25 processed in 4 separate batches (from my 
counting of the logs) on the replica. Why wouldn't it just be one for one? The 
docs are all roughly the same size ... and what's breaking it up? Having 
trouble seeing that in the logs ;-)

On the leader:

2014-01-08 12:23:21,501 [qtp604104855-17] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2} 
{add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 
(1456683668181352449), 82904 (1456683668181352450), 82912 
(1456683668187643904), 82913 (1456683668188692480), 82914 
(1456683668188692481), 82916 (1456683668188692482), 82917 
(1456683668188692483), 82918 (1456683668188692484), ... (34 adds)]} 0 34

2014-01-08 12:23:21,600 [qtp604104855-17] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2} 
{add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 
(1456683668286210049), 83011 (1456683668286210050), 83012 
(1456683668286210051), 83013 (1456683668287258624), 83018 
(1456683668287258625), 83019 (1456683668289355776), 83023 
(1456683668289355777), 83024 (1456683668289355778), ... (43 adds)]} 0 32


On the replica:

2014-01-08 12:23:21,126 [qtp604104855-22] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica2] webapp=/solr path=/update 
params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 
{add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 
(1456683668181352449)]} 0 1

2014-01-08 12:23:21,134 [qtp604104855-22] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica2] webapp=/solr path=/update 
params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 
{add=[82904 (1456683668181352450), 82912 (1456683668187643904), 82913 
(1456683668188692480), 82914 (1456683668188692481), 82916 
(1456683668188692482), 82917 (1456683668188692483), 82918 
(1456683668188692484), 82919 (1456683668188692485), 82922 
(1456683668188692486)]} 0 2

2014-01-08 12:23:21,139 [qtp604104855-22] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica2] webapp=/solr path=/update 
params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 
{add=[82923 (1456683668188692487), 82926 (1456683668190789632), 82928 
(1456683668190789633), 82932 (1456683668190789634), 82939 
(1456683668192886784), 82945 (1456683668192886785), 82946 
(1456683668192886786), 82947 (1456683668193935360), 82952 
(1456683668193935361), 82962 (1456683668193935362), ... (12 adds)]} 0 3

2014-01-08 12:23:21,144 [qtp604104855-22] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica2] webapp=/solr path=/update 
params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 
{add=[82967 (1456683668199178240)]} 0 0


 9 Docs Missing here 

2014-01-08 12:23:21,227 [qtp604104855-22] INFO  
update.processor.LogUpdateProcessor  - 
[demo_shard1_replica2] webapp=/solr path=/update 
params={distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 
{add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 
(1456683668286210049), 83011 (1456683668286210050), 83012 
(1456683668286210051), 83013 (1456683668287258624)]} 0 2


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each c

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865769#comment-13865769
 ] 

Mark Miller commented on SOLR-4260:
---

No, wait, it could jive. We only check the last 99 docs on peer sync - if bunch 
of docs just didn't show up well before that, it wouldn't be detected by peer 
sync. I still think SolrCmdDistributor is the first place to look.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865744#comment-13865744
 ] 

Mark Miller commented on SOLR-4260:
---

Although that doesn't really jive with the tran logs being identical...hmm...

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865742#comment-13865742
 ] 

Mark Miller commented on SOLR-4260:
---

I've noticed something like this too - but nothing i could reproduce easily. I 
imagine it's likely an issue in SolrCmdDistributor.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2014-01-07 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864141#comment-13864141
 ] 

Markus Jelsma commented on SOLR-4260:
-

Ok, I followed all the great work here and in related tickets and yesterday i 
had the time to rebuild Solr and check for this issue. I hadn't seen it 
yesterday but it is right in front of me again, using a fresh build from 
January 6th.

Leader has Num Docs: 379659
Replica has Num Docs: 379661

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856078#comment-13856078
 ] 

Mark Miller commented on SOLR-4260:
---

That's interesting. The logging makes it look like it's not creating it's new 
ephemeral live node for some reason...or the leader is not getting an updated 
view of the live node...

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-23 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856052#comment-13856052
 ] 

Timothy Potter commented on SOLR-4260:
--

Thanks Mark, I suspected my test case was a little cherry picked ... something 
interesting happened when I also severed the connection between the replica and 
ZK (ie. same test as above but I also dropped the ZK connection on the replica).

2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:Disconnected type:None path:null path:null 
type:None
2013-12-23 15:39:57,170 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - zkClient has disconnected

>>> fixed the connection between replica and ZK here <<<

2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:Expired type:None path:null path:null type:None
2013-12-23 15:40:45,579 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Our previous ZooKeeper session was expired. Attempting to reconnect to 
recover relationship with ZooKeeper...
2013-12-23 15:40:45,580 [main-EventThread] INFO  
common.cloud.DefaultConnectionStrategy  - Connection expired - starting a new 
one...
2013-12-23 15:40:45,586 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Waiting for client to connect to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 
name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 
got event WatchedEvent state:SyncConnected type:None path:null path:null 
type:None
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Client is connected to ZooKeeper
2013-12-23 15:40:45,595 [main-EventThread] INFO  common.cloud.ConnectionManager 
 - Connection with ZooKeeper reestablished.
2013-12-23 15:40:45,596 [main-EventThread] WARN  solr.cloud.RecoveryStrategy  - 
Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - 
publishing core=cloud_shard1_replica3 state=down
2013-12-23 15:40:45,597 [main-EventThread] INFO  solr.cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:45,905 [qtp2124890785-14] INFO  handler.admin.CoreAdminHandler 
 - It has been requested that we recover
2013-12-23 15:40:45,906 [qtp2124890785-14] INFO  
solr.servlet.SolrDispatchFilter  - [admin] webapp=null path=/admin/cores 
params={action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2} 
status=0 QTime=2 
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - publishing 
core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,909 [Thread-17] INFO  solr.cloud.ZkController  - numShards 
not found on descriptor - reading it from system property
2013-12-23 15:40:45,920 [Thread-17] INFO  solr.update.DefaultSolrCoreState  - 
Running recovery - first canceling any ongoing recovery
2013-12-23 15:40:45,921 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - 
Starting recovery process.  core=cloud_shard1_replica3 
recoveringAfterStartup=false
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - 
publishing core=cloud_shard1_replica3 state=recovering
2013-12-23 15:40:45,924 [RecoveryThread] INFO  solr.cloud.ZkController  - 
numShards not found on descriptor - reading it from system property
2013-12-23 15:40:48,613 [qtp2124890785-15] INFO  solr.core.SolrCore  - 
[cloud_shard1_replica3] webapp=/solr path=/select 
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 
2013-12-23 15:42:42,770 [qtp2124890785-13] INFO  solr.core.SolrCore  - 
[cloud_shard1_replica3] webapp=/solr path=/select 
params={q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 
2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController  - 
There was a problem making a request to the 
leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I 
was asked to wait on state down for cloud86:8986_solr but I still do not see 
the requested state. I see state: recovering live:false
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at 
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434)
at 
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856013#comment-13856013
 ] 

Mark Miller commented on SOLR-4260:
---

bq. so we kind of punted

The other thing to note is that if you restart the shard or that node or the 
cluster, you should be able to do it without losing any data. It will recover 
from the leader when everything else is working correctly.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856009#comment-13856009
 ] 

Mark Miller commented on SOLR-4260:
---

Yeah, that's currently expected. We don't expect the case where you can talk to 
ZooKeeper but not your replicas to be common, so we kind of punted on this 
scenario for the first phase.

Some related JIRA issues:

SOLR-5482
SOLR-5450
SOLR-5495   

I think we should do all that, but the key is really, in this case, we need to 
pass the order to recover through ZooKeeper to the partitioned off replica. 
With an eventually consistent model, it can be off for a short time, but it 
needs to recover in a timely manner.

I think this is the right solution because the replica is sure to either get 
the information to recover from ZooKeeper or lose it's connection to ZooKeeper 
in which case it will have to recover anyway.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-23 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856002#comment-13856002
 ] 

Timothy Potter commented on SOLR-4260:
--

Found another interesting case that may or may not be valid depending on 
whether we think HTTP requests between a leader and replica can fail even if 
the ZooKeeper session on the replica does not drop?

Specifically, what I'm seeing is that if an update request between the leader 
and replica fails, but the replica doesn't lose it's session with ZK, then the 
replica can get out-of-sync with the leader. In a real network partition, the 
ZK connection would also likely be lost and the replica would get marked as 
down. So as long as the HTTP connection timeout between the leader and replica 
exceeds the ZK client timeout, the replica would probably recover correctly, 
rendering this test case invalid. So maybe the main question here is whether we 
think it's possible for HTTP requests between a leader and replica to fail even 
though the ZooKeeper connection stays alive?

Here's the steps I used to reproduce this case (all using revision 1553150 in 
branch_4x):

*> STEP 1: Setup a collection named “cloud” containing 1 shard and 2 replicas 
on hosts: cloud84 (127.0.0.1:8984) and cloud85 (127.0.0.1:8985)*

SOLR_TOP=/home/ec2-user/branch_4x/solr
$SOLR_TOP/cloud84/cloud-scripts/zkcli.sh -zkhost $ZK_HOST -cmd upconfig 
-confdir $SOLR_TOP/cloud84/solr/cloud/conf -confname cloud
API=http://localhost:8984/solr/admin/collections
curl -v 
"$API?action=CREATE&name=cloud&replicationFactor=2&numShards=1&collection.configName=cloud"

Replica on cloud84 is elected as the initial leader. /clusterstate.json looks 
like:

{"cloud":{
"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node1":{
"state":"active",
"base_url":"http://cloud84:8984/solr";,
"core":"cloud_shard1_replica1",
"node_name":"cloud84:8984_solr",
"leader":"true"},
  "core_node2":{
"state":"active",
"base_url":"http://cloud85:8985/solr";,
"core":"cloud_shard1_replica2",
"node_name":"cloud85:8985_solr",
"maxShardsPerNode":"1",
"router":{"name":"compositeId"},
"replicationFactor":"2"}}


*> STEP 2: Simulate network partition*

sudo iptables -I INPUT 1 -i lo -p tcp --sport 8985 -j DROP; sudo iptables -I 
INPUT 2 -i lo -p tcp --dport 8985 -j DROP

Various ways to do this, but to keep it simple, I'm just dropping inbound 
traffic on localhost to port 8985.

*> STEP 3: Send document with ID “doc1” to leader on cloud84*

curl "http://localhost:8984/solr/cloud/update"; -H 
'Content-type:application/xml' \
  --data-binary 'doc1bar'

The update request takes some time because the replica is down but ultimately 
succeeds on the leader. In the logs on the leader, we have (some stack trace 
lines removed for clarity):

2013-12-23 10:59:33,688 [updateExecutor-1-thread-1] ERROR 
solr.update.StreamingSolrServers  - error
org.apache.http.conn.HttpHostConnectException: Connection to 
http://cloud85:8985 refused
at 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
...
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232)
...
Caused by: java.net.ConnectException: Connection timed out
...
2013-12-23 10:59:33,695 [qtp1073932139-16] INFO  
update.processor.LogUpdateProcessor  - [cloud_shard1_replica1] webapp=/solr 
path=/update params={} {add=[doc1 (1455228778490888192)]} 0 63256
2013-12-23 10:59:33,702 [updateExecutor-1-thread-2] INFO  
update.processor.DistributedUpdateProcessor  - try and ask 
http://cloud85:8985/solr to recover
2013-12-23 10:59:48,718 [updateExecutor-1-thread-2] ERROR 
update.processor.DistributedUpdateProcessor  - http://cloud85:8985/solr: Could 
not tell a replica to recover:org.apache.solr.client.solrj.SolrServerException: 
IOException occured when talking to server at: http://cloud85:8985/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:507)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor$1.run(DistributedUpdateProcessor.java:657)
...
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to 
cloud85:8985 timed out
...

Of course these log messages are expected. The key is that the leader accepted 
the update and now has one doc with ID "doc1"

> STEP 4: Heal the network partition

sudo service iptables restart (undoes the DROP rules we added above)

*> STEP 5: Send document with ID “doc2” to leader on cloud84*

curl "http://localhost:8984/solr/cloud/update";

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855273#comment-13855273
 ] 

Mark Miller commented on SOLR-4260:
---

SOLR-5552 investigation has also led to SOLR-5569 and SOLR-5568 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846812#comment-13846812
 ] 

Timothy Potter commented on SOLR-4260:
--

Mark -> https://issues.apache.org/jira/browse/SOLR-5552

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846775#comment-13846775
 ] 

Mark Miller commented on SOLR-4260:
---

Ah, thanks for the explanation.  I think we should roll that specific issue 
into a new JIRA issue. 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846755#comment-13846755
 ] 

Timothy Potter commented on SOLR-4260:
--

I'm sorry for being unclear; "waiting" was probably the wrong term ... and they 
definitely continue right on down the path of selecting the wrong leader. 

Here's what I know so far, which admittedly isn't much:

As cloud85 (replica before it crashed) is initializing, it enters the wait 
process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is expected 
and a good thing.

Some short amount of time in the future, cloud84 (leader before it crashed) 
begins initializing and gets to a point where it adds itself as a possible 
leader for the shard (by creating a znode under 
/collections/cloud/leaders_elect/shard1/election), which leads to cloud85 being 
able to return from waitForReplicasToComeUp and try to determine who should be 
the leader.

cloud85 then tries to run the SyncStrategy, which can never work because in 
this scenario the Jetty HTTP listener is not active yet on either node, so all 
replication work that uses HTTP requests fails on both nodes ... PeerSync 
treats these failures as indicators that the other replicas in the shard are 
unavailable (or whatever) and assumes success. Here's the log message:

2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync 
- PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't 
connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success

The Jetty HTTP listener doesn't start accepting connections until long after 
this process has completed and already selected the wrong leader.

>From what I can see, we seem to have a leader recovery process that is based 
>partly on HTTP requests to the other nodes, but the HTTP listener on those 
>nodes isn't active yet. We need a leader recovery process that doesn't rely on 
>HTTP requests. Perhaps, leader recovery for a shard w/o a current leader may 
>need to work differently than leader election in a shard that has replicas 
>that can respond to HTTP requests? All of what I'm seeing makes perfect sense 
>for leader election when there are active replicas and the current leader 
>fails.

All this aside, I'm not asserting that this is the only cause for the 
out-of-sync issues reported in this ticket, but it definitely seems like it 
could happen in a real cluster.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846694#comment-13846694
 ] 

Mark Miller commented on SOLR-4260:
---

bq.  because they are waiting on each other;

That doesn't make sense to me - the wait should be until all the replicas for a 
shard are up - so what exactly are they both waiting on? If they are both 
waiting, there should be enough replicas up to continue...

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846593#comment-13846593
 ] 

Timothy Potter commented on SOLR-4260:
--

Agreed on the wait being necessary (which I actually annotated in the comment 
above). The crux of the issue here is that the replica (cloud85) can't sync 
with the previous leader (cloud84) because they are waiting on each other; much 
like a dead-lock. Eventually, they both give up and one wins; unfortunately in 
my test case, cloud85 wins which leads to the shard being out-of-sync because 
the wrong leader is selected in this scenario (cloud84 should have been 
selected). 

I'm continuing to dig into this but have come to the conclusion that tweaking 
the waitForReplicasToComeUp process is a dead end and it's working as well as 
it can.


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846572#comment-13846572
 ] 

Mark Miller commented on SOLR-4260:
---

The other issue is expected as well. It's the safety mechanism - we don't let 
you just start one node and let it becomes leader - ideally you want all 
replicas to be involved in the election to prevent data loss. You have to be 
explicit if you want to have this work with no wait. It might be nice if we 
added a startup sys prop that caused it not to wait on first startup. 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-11 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845895#comment-13845895
 ] 

Timothy Potter commented on SOLR-4260:
--

ok - cool ... just wanted to make sure that "stale" situation was expected ...

the more I dig into ShardLeaderElectionContext's decision making process, I 
think looking at state won't work because both are in the "down" state while 
this is happening. I think some determination of is the node "reachable" so 
that PeerSync can get good information from it is what needs to be factored 
into ShardLeaderElectionContext. Or maybe there is another state "trying to 
figure out my role in the world as I come back up" ;-)

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845887#comment-13845887
 ] 

Mark Miller commented on SOLR-4260:
---

A lot there! I'll respond to most of it later.

As far as the stale state, that is expected. You cannot tell the state just 
from clusterstate.json - it is a mix of clusterstate.json and the live_nodes 
list. If the livenode for anything in clusterstate.json is missing, it's 
considered not up. This is just currently by design - without live_nodes, you 
don't know the state.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-11 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845847#comment-13845847
 ] 

Timothy Potter commented on SOLR-4260:
--

I don't have fix yet, but I wanted to post an update here to get some feedback 
on what I'm seeing ...

I have a simple SolrCloud configuration setup locally: 1 collection named 
"cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create 
it:
curl 
"http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud";

The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 
with cloud84 being assigned the leader.

Here's an outline of the process I used to get my collection out-of-sync during 
indexing:

1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to 
the leader and replica remains in sync for as long as I let this process run
2) kill -9 the process for the replica cloud85
3) let indexing continue against cloud84 for a few seconds (just to get the 
leader and replica out-of-sync once I bring the replica back online)
4) kill -9 the process for the leader cloud84 ... indexing halts of course as 
there are no running servers
5) start the replica cloud85 but do not start the previous leader cloud84

Here are some key log messages as cloud85 - the replica - fires up ... my 
annotations of the log messages are prefixed by [TJP >>

2013-12-11 11:43:22,076 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 1)
2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Waiting until we see more replicas up 
for shard shard1: total=2 found=1 timeoutin=139841

[TJP >> This looks good and is expected because cloud85 was not the leader 
before it died, so it should not immediately assume it is the leader until it 
sees more replicas

6) now start the previous leader cloud84 ...

Here are some key log messages from cloud85 as the previous leader cloud84 is 
coming up ... 

2013-12-11 11:43:24,085 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
Updating live nodes... (2)
2013-12-11 11:43:24,136 [main-EventThread] INFO  solr.cloud.DistributedQueue  - 
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2013-12-11 11:43:24,137 [Thread-13] INFO  common.cloud.ZkStateReader  - 
Updating cloud state from ZooKeeper... 
2013-12-11 11:43:24,138 [Thread-13] INFO  solr.cloud.Overseer  - Update state 
numShards=1 message={
  "operation":"state",
  "state":"down",
  "base_url":"http://cloud84:8984/solr";,
  "core":"cloud_shard1_replica2",
  "roles":null,
  "node_name":"cloud84:8984_solr",
  "shard":"shard1",
  "shard_range":null,
  "shard_state":"active",
  "shard_parent":null,
  "collection":"cloud",
  "numShards":"1",
  "core_node_name":"core_node1"}

[TJP >> state of cloud84 looks correct as it is still initializing ...

2013-12-11 11:43:24,140 [main-EventThread] INFO  solr.cloud.DistributedQueue  - 
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2013-12-11 11:43:24,141 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 2)

2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Enough replicas found to continue.

[TJP >> h ... cloud84 is listed in /live_nodes but it isn't "active" yet or 
even recovering (see state above - it's currently "down") ... My thinking here 
is that the ShardLeaderElectionContext needs to take the state of the replica 
into account before deciding it should continue.


2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - I may be the new leader - try and sync
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.SyncStrategy  - Sync replicas to 
http://cloud85:8985/solr/cloud_shard1_replica1/
2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_replica1 
url=http://cloud85:8985/solr START 
replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/] nUpdates=100
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_replica1 
url=http://cloud85:8985/solr  couldn't connect to 
http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success


[TJP >> whoops! of course it couldn't connect to cloud84 as it's still 
initializing ...


2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_re

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-11 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845459#comment-13845459
 ] 

Timothy Potter commented on SOLR-4260:
--

I have some cycles to work on this issue over the next couple of days. I'm 
starting by trying to reproduce it in my environment. Please let me know of any 
tasks that I can help out on (beyond the long wait stuff you mentioned above). 

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841896#comment-13841896
 ] 

Mark Miller commented on SOLR-4260:
---

I've fixed some things since 4.6 - I only had time to focus on the leader not 
going down case for 4.6, I spent a bunch more time on this case after 4.6 was 
released. Unfortunately, I think there are a couple of issues at play here - 
some of the new changes makes existing holes easier to spot and the chaos 
monkey tests where accidentally disabled for some time, so small issues may 
have crept in.

I *think* the remaining issue is mostly around SOLR-5516. Need to come up with 
a better idea than a really long wait though - but if someone wants to help 
test, putting in a long wait and stressing this would be useful to see if it is 
indeed the main remaining issue.

I recently put in a lot of time improving the situation and I need to focus on 
other things for a bit, but that I'll keep coming back to this as I can.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-06 Thread Yago Riveiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841870#comment-13841870
 ] 

Yago Riveiro commented on SOLR-4260:


Replicas are still losing docs in Solr 4.6 :(.

I'm wondering if we can't have a pair (version, numDocs) to track the 
increments of docs between versions. Also we can save the last 10 tlogs in each 
replica as backups after be commited and make a diff to see what is missing in 
case the replicas are out of sync, replay the transaction and avoid a not 
synchronized replica and a full-recovery that probably will be heaviest that 
make the diff.

It's only and idea and of course find the bug must be the priority.

This issue compromisse Solr to be "the main" storage. If re-index data is not 
possible, we can't guarantee that no data is missing,  and worse, we lost the 
data forever :(.


> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-06 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841162#comment-13841162
 ] 

Markus Jelsma commented on SOLR-4260:
-

I'm sorry, i've got three replica's having one document less than the leader. 
We're on a december, 3th build.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-12-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836414#comment-13836414
 ] 

Markus Jelsma commented on SOLR-4260:
-

I'll check it out!

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835883#comment-13835883
 ] 

Mark Miller commented on SOLR-4260:
---

[~markus17], hopefully that's SOLR-5516 then.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-29 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835436#comment-13835436
 ] 

Mark Miller commented on SOLR-4260:
---

What's the exact version / checkout?

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835368#comment-13835368
 ] 

Rafał Kuć commented on SOLR-4260:
-

Happened to me two, collection with two four shards, each having a single 
replica. The replicas were our of sync.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834760#comment-13834760
 ] 

Markus Jelsma commented on SOLR-4260:
-

I've got some bad news, it happened again on one of our clusters using a build 
of november 19th.Three replica's went out of sync.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826535#comment-13826535
 ] 

Mark Miller commented on SOLR-4260:
---

Should probably bring it up on the user list - we need someone like [~rcmuir] 
to weigh in. I assume it all works the same way - you merge each field to the 
default impl and then back to what they were.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-19 Thread Yago Riveiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826533#comment-13826533
 ] 

Yago Riveiro commented on SOLR-4260:


I'm using   to enable per-field 
DocValues formats.

I think that this aspect about docValues it doesn't  explained on wiki in a 
proper way. There is no example how we can do the switch to default, do the 
forceMerge and switch back to the original implementation.

If I can't have the security that all will work fine,  I can't do the upgrade.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826529#comment-13826529
 ] 

Mark Miller commented on SOLR-4260:
---

According to the wiki, it depends on the doc values impl you are using - the 
default one will upgrade fine. Others require that you forceMerge your index to 
rewrite it with the default and then upgrade, then I guess you can forceMerge 
back to that impl. Honestly, I have not had a chance to play with doc values 
yet though.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-19 Thread Yago Riveiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826356#comment-13826356
 ] 

Yago Riveiro commented on SOLR-4260:


It's safe upgrade from 4.5.1 to 4.6?. I have docValues and I read that it's not 
linear upgraded and I can't reindex the data.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826350#comment-13826350
 ] 

Markus Jelsma commented on SOLR-4260:
-

I updated our machines to include SOLR-5397. Everything works fine now, it may 
take quite some time before we can say it is fixed :)

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Assignee: Mark Miller
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825107#comment-13825107
 ] 

Mark Miller commented on SOLR-4260:
---

Would love if you guys could try with 4.6 and report back. SOLR-5397 was 
introduced when we fixed a similar issue, so that has really been an issue for 
a few releases.



> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824921#comment-13824921
 ] 

Mark Miller commented on SOLR-4260:
---

This could be related to SOLR-5397

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824299#comment-13824299
 ] 

Mark Miller commented on SOLR-4260:
---

Right - but that's just impl, not design. The idea is that, since we add 
locally first, there is not much reason it should fail on a replica - unless 
that replica has crashed or lost connectivity or something really bad. In that 
case, it will have to reconnect to zk and recover or restart and recover. Just 
in case, as a precaution, we try and tell it to recover - then if it's still 
got connectivity or it was an intermittent problem, it won't run around acting 
active. I think I have a note about perhaps doing more retries in background 
threads for that recovery request, but I've never gotten to it.

If you are finding a scenario that eludes that, we should strengthen the impl.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

2013-11-15 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824295#comment-13824295
 ] 

Jessica Cheng commented on SOLR-4260:
-

{quote}
This shouldn't be the case, because those updates will only have been ack'd if 
each replica received them.
{quote}

That's what I thought too, but doesn't seem to be the case in the code. If you 
take a look at DistributedUpdateProcessor.doFinish(),

{quote}
// if its a forward, any fail is a problem - 
// otherwise we assume things are fine if we got it locally
// until we start allowing min replication param
if (errors.size() > 0) {
  // if one node is a RetryNode, this was a forward request
  if (errors.get(0).req.node instanceof RetryNode) {
rsp.setException(errors.get(0).e);
  }
  // else
  // for now we don't error - we assume if it was added locally, we
  // succeeded 
}
{quote}

It then starts a thread to urge the replica to recover, but if that fails, it 
just completely gives up.

> Inconsistent numDocs between leader and replica
> ---
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
> Environment: 5.0.0.2013.01.04.15.31.51
>Reporter: Markus Jelsma
>Priority: Critical
> Fix For: 5.0
>
> Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >