[jira] [Commented] (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012407#comment-13012407 ] Vadim Kisselmann commented on SOLR-1144: After a few tests, I think I've located the problem. It's probably the Solr caches. If I deactivate the caches in solrconfig.xml, replication works fine. But if any of them are active, the replication slows down. Disabling the caches isn't an option for me since the query times gets way too long. replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012047#comment-13012047 ] Vadim Kisselmann commented on SOLR-1144: I have Solr running on one master and two slaves (load balanced) via Solr 1.4.1 native replication. If the load is low, both slaves replicate with around 100MB/s from master. After a couple of hours the replication slows down to 100KB/s. So the problem is still there. I tested it with both Jetty and Tomcat. It looks like that aggressive JVM-Options can delay the problem, but then it starts anyway. My Index is about 100GB, i use 10GB for JVM, 24GB total. The slaves polls every 5 minutes. replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884644#action_12884644 ] Toby Cole commented on SOLR-1144: - Just over a year since it was first spotted, I'm consistently getting the same symptoms as this bug. We've got a single master, with two slaves polling it, both slaves have stalled at exactly the same point in the replication. Here's the relevent section of the replication handler's 'details' response: Node A {code:xml} str name=numFilesDownloaded18/str str name=replicationStartTimeFri Jul 02 10:40:00 BST 2010/str str name=timeElapsed6683s/str str name=currentFile_9du.prx/str str name=currentFileSize8.17 MB/str str name=currentFileSizeDownloaded8.17 MB/str str name=currentFileSizePercent100.0/str str name=bytesDownloaded40.55 MB/str str name=totalPercent0.0/str str name=timeRemaining8290722s/str str name=downloadSpeed6.21 KB/str {code} Node B {code:xml} str name=numFilesDownloaded18/str str name=replicationStartTimeFri Jul 02 10:40:00 BST 2010/str str name=timeElapsed6752s/str str name=currentFile_9du.prx/str str name=currentFileSize8.17 MB/str str name=currentFileSizeDownloaded8.17 MB/str str name=currentFileSizePercent100.0/str str name=bytesDownloaded40.55 MB/str str name=totalPercent0.0/str str name=timeRemaining8376322s/str str name=downloadSpeed6.15 KB/str {code} replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884690#action_12884690 ] Yonik Seeley commented on SOLR-1144: Thanks for the stack traces Toby! Interesting... seems like the commit in the slave blocked... {code} at org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(ConcurrentLRUCache.java:276) {code} So perhaps another thread locked, but didn't unlock the lock? SOLR-1538 did fix something that could possibly lead to a deadlock, but it's super unlikely (a very small object allocation would have to fail at just the right spot). Still, if this is easy enough to reproduce, could you try Solr 1.4.1 and see if it's fixed? (and if it hangs again, be sure to get stack traces... they are super helpful!) replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884693#action_12884693 ] Toby Cole commented on SOLR-1144: - Oh yes, should have mentioned... we're already on Solr 1.4.1 in production as of yesterday (we don't hang about y'know ;) ). replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884717#action_12884717 ] Yonik Seeley commented on SOLR-1144: The odd thing is that the line numbers in the stack traces don't match up for either 1.4.0 or 1.4.1 Specifically ConcurrentLRUCache.java:276 is in the middle of markAndSweep() in both versions (as opposed to getLatestAccessedItems() which your stack trace would suggest). Are these stack traces from stock 1.4.0 or 1.41? If so, does anyone have a clue why the line numbers would be off? replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884719#action_12884719 ] Toby Cole commented on SOLR-1144: - I know exactly why the line numbers would be off. I just remembered we're using a custom war package so we can add our own plugins in (yes, I know we can use solr.home/lib, but we've not got round to that yet). The only classes we're overriding from solr are ConcurrentLRUCache and FastLRUCache. This was from pre solr 1.4, when the cache implementations were slowing faceting right down. I have a feeling if I remove those overridden classes and use the new (bug-free) ones, the hang may stop. I'll give it a go now, sorry in advance if it was my oversight that is causing this bug to re-appear. T replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Assignee: Noble Paul Fix For: 1.4 Attachments: stacktrace-master.txt, stacktrace-slave-1.txt, stacktrace-slave-2.txt It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707405#action_12707405 ] Yonik Seeley commented on SOLR-1144: bq. ReplicationHandler does not cause the hang on the master. The slave is waiting forever, but it *could* be due to a bug on either the master or the slave, and it could be due to the replication handler. It could also be another Solr bug somewhere, or it could be a Tomcat bug. What is apparent is that since there is no replication stack trace on the master, it thinks it finished the file send (either that or got an exception), but the slave is still expecting more for some reason. Perhaps if we used non-persistent connections for replication, the master would close the connection when it thought it had sent everything? replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707421#action_12707421 ] Noble Paul commented on SOLR-1144: -- The master closes the connection if everything is written. if the download of a file is complete slave also closes the stream . The fact that the slave continued to wait means the file has not been downloaded completely. replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707425#action_12707425 ] Yonik Seeley commented on SOLR-1144: bq. The master closes the connection if everything is written. Hmmm, that doesn't jive with the slave hanging on a read though... seems like the only way read() should block is if there is no more data to read currently and the socket is still open. replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706868#action_12706868 ] Noble Paul commented on SOLR-1144: -- ReplicationHandler does not cause the hang on the master. On the slave the SnapPuller was waiting forever which I hope would have fixed with SOLR-1096 replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706199#action_12706199 ] Yonik Seeley commented on SOLR-1144: Hmmm, I had trouble finding SOLR-1096 before. But it looks like it was used mainly for adding a timeout. There's still an underlying bug somewhere, right? replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706302#action_12706302 ] Noble Paul commented on SOLR-1144: -- the stacktrace http://markmail.org/message/ecr6m4rf4iy2d652 . I suspect the following two threads are blocked {code} 'NioBlockingSelector.BlockPoller-2' Id=10, RUNNABLE on lock=, total cpu time=5580.ms user time=2120.ms at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSe lector.java:305) 'NioBlockingSelector.BlockPoller-1' Id=9, RUNNABLE on lock=, total cpu time=333280.ms user time=107520.ms at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollrrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSe lector.java:305) {code} replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Fix For: 1.4 It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1144) replication hang
[ https://issues.apache.org/jira/browse/SOLR-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705891#action_12705891 ] Noble Paul commented on SOLR-1144: -- isn't this same as SOLR-1096 ? replication hang Key: SOLR-1144 URL: https://issues.apache.org/jira/browse/SOLR-1144 Project: Solr Issue Type: Bug Reporter: Yonik Seeley It seems that replication can sometimes hang. http://www.lucidimagination.com/search/document/403305a3fda18599 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.