[jira] [Created] (SOLR-5340) Add support for named snapshots
Mike Schrag created SOLR-5340: - Summary: Add support for named snapshots Key: SOLR-5340 URL: https://issues.apache.org/jira/browse/SOLR-5340 Project: Solr Issue Type: Improvement Components: SolrCloud Affects Versions: 4.5 Reporter: Mike Schrag It would be really nice if Solr supported named snapshots. Right now if you snapshot a SolrCloud cluster, every node potentially records a slightly different timestamp. Correlating those back together to effectively restore the entire cluster to a consistent snapshot is pretty tedious. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752699#comment-13752699 ] Mike Schrag commented on SOLR-5081: --- I think we tracked this down on our side. We noticed when testing another part of the system that we had SYN flood warnings in the system logs. I believe the kernel was blocking traffic to the Solr port once it believed that Hadoop was attacking it. By turning off net.ipv4.tcp_syncookies and increasing the net.ipv4.tcp_max_syn_backlog, the problem seems to have gone away. This also explains why I was able to connect to Solr and insert still from another machine even when accessed died from the Hadoop cluster. > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726848#comment-13726848 ] Mike Schrag commented on SOLR-5081: --- I actually did this exact test when I was in this state originally, and the insert _worked_, which totally confused the situation for me. However, in light of seeing nothing in the traces, it supports the theory that the cluster isn't hung, but rather I'm somehow not even getting that far in the Hadoop cluster. ZK was my best guess as something that maybe could be an earlier stage failure, but even that I would expect to have hang the test-insert. So I need to do a little more forensics here and see if I can get a better picture of wtf is going on. > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726801#comment-13726801 ] Mike Schrag commented on SOLR-5081: --- I grabbed more and they all look basically the same as the attached, which is to say, it sort of looks like Solr isn't doing ANYTHING. I'm going to look into whether I'm crushing ZooKeeper, and maybe my requests aren't even getting to Solr. > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723996#comment-13723996 ] Mike Schrag commented on SOLR-5081: --- I'll kill it again today and grab traces from a few of the nodes. > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723402#comment-13723402 ] Mike Schrag commented on SOLR-5081: --- 1. numShards=20 2. RF=3 3. maxShardsPerNode=1000 (aka "just a big number" .. we overcommit shards in this environment) 4. not very big ... maybe 0.5-1k 5. -Xms10g -Xmx10g -XX:MaxPermSize=1G -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSInitiatingOccupancy Fraction=60 -XX:-OmitStackTraceInFastThrow 6. SolrJ + CloudSolrServer + when you say clients, do you mean threads, or actual client JVM instances? Talking more generically in terms of threads, I know it works at around 15-20 threads, but 100 threads makes it go sadfaced > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721780#comment-13721780 ] Mike Schrag commented on SOLR-5081: --- No luck :( Whatever this hang is doesn't appear to be the same as that. > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721781#comment-13721781 ] Mike Schrag commented on SOLR-5081: --- (that's with the latest SOLR-4816 patch applied) > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721764#comment-13721764 ] Mike Schrag commented on SOLR-5081: --- btw, I dropped the hadoop cluster to doing single-record batches in the run corresponding to that stack dump. I saw a note somewhere (I think from you?) that suggested increasing the semaphore permits, which I was about to test, too. It's not clear what a reasonable value is, but I jacked it up from *16 to *1024 and figured I'd go for broke :) > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721761#comment-13721761 ] Mike Schrag commented on SOLR-5081: --- I attached a jstack of one of them (threads.txt) during this event. Do you want me to reproduce and grab more? > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Schrag updated SOLR-5081: -- Attachment: threads.txt stack dump of one of the nodes while hung > Highly parallel document insertion hangs SolrCloud > -- > > Key: SOLR-5081 > URL: https://issues.apache.org/jira/browse/SOLR-5081 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 4.3.1 >Reporter: Mike Schrag > Attachments: threads.txt > > > If I do a highly parallel document load using a Hadoop cluster into an 18 > node solrcloud cluster, I can deadlock solr every time. > The ulimits on the nodes are: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1031181 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 32768 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 515590 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > The open file count is only around 4000 when this happens. > If I bounce all the servers, things start working again, which makes me think > this is Solr and not ZK. > I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
Mike Schrag created SOLR-5081: - Summary: Highly parallel document insertion hangs SolrCloud Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org