[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-13 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955911#comment-14955911
 ] 

ASF subversion and git services commented on SOLR-8152:
---

Commit 1708539 from gcha...@apache.org in branch 'dev/trunk'
[ https://svn.apache.org/r1708539 ]

SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
> Fix For: 5.4, Trunk
>
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-13 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955908#comment-14955908
 ] 

ASF subversion and git services commented on SOLR-8152:
---

Commit 1708538 from gcha...@apache.org in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1708538 ]

SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-10 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952153#comment-14952153
 ] 

Shalin Shekhar Mangar commented on SOLR-8152:
-

Okay I understand now. So we first create the response node as an 
EPHEMERAL_SEQUENTIAL and then use its sequence ID to create the persistent 
request node. Sounds good to me. Thanks for explaining.

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-10 Thread Gregory Chanan (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952126#comment-14952126
 ] 

Gregory Chanan commented on SOLR-8152:
--

Thanks for taking a look, Shalin.

bq. Your patch changes the response node from EPHEMERAL to 
EPHEMERAL_SEQUENTIAL. Was that intentional?

Yes, see below.

bq.  Another option could be to create both nodes together in a 'multi' 
operation but this approach also looks fine.

I considered using the 'multi' operation, but it doesn't appear trivial.  The 
issue is that the request and response node share a suffix as determined by 
ZooKeeper's SEQUENTIAL, but you don't know that until you actually create the 
node.  So, to use multi to create the request and response znodes together, 
you'd have to do something like first create a SEQUENTIAL node, note the 
suffix, then use multi to create the request and response nodes together.  This 
seemed to lead to more of a tracking / cleanup issue than not using multi.  The 
fact that the response is EPHEMERAL_SEQUENTIAL rather than EPHEMERAL is because 
it is now created first (so we don't miss the response).

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-09 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951608#comment-14951608
 ] 

Shalin Shekhar Mangar commented on SOLR-8152:
-

Thanks Gregory, good catch! Your patch changes the response node from EPHEMERAL 
to EPHEMERAL_SEQUENTIAL. Was that intentional?

Another option could be to create both nodes together in a 'multi' operation 
but this approach also looks fine.

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8152) Overseer Task Processor/Queue can miss responses, leading to timeouts

2015-10-09 Thread Gregory Chanan (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951446#comment-14951446
 ] 

Gregory Chanan commented on SOLR-8152:
--

There are a few ways to solve this, the most straightforward of which appears 
to be:
1) Create the response node first using SEQUENTIAL (to generate the path to the 
request node)
2) Watch the response node, so we can't possible miss the response (because the 
request node isn't even created yet)
3) Create the request mode

At this point, before we wait, we check that the watch didn't already fire 
(otherwise we will wait unnecessarily).

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> -
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Gregory Chanan
>Assignee: Gregory Chanan
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>   CreateMode.PERSISTENT_SEQUENTIAL);
>   String watchID = createData(
>   dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>   null, CreateMode.EPHEMERAL);
>   Object lock = new Object();
>   LatchWatcher watcher = new LatchWatcher(lock);
>   synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
>   watcher.await(timeout);
> }
>   }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org