[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-09-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13458679#comment-13458679
 ] 

Markus Jelsma commented on SOLR-3685:
-

{quote}So what we're seeing here is the mmapped nodes use more RES and SHR than 
the NIO node. VIRT is as expected. I'll change another node to NIO and keep 
them running again for the next few days and keep sending documents and firing 
queries.{quote}

there is still an issue with mmap and high RES opposed to NIOFS but the actual 
issue is already resolved. I'll open a new issue.

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log, oom-killer.log, pmap.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-09-18 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457914#comment-13457914
 ] 

Robert Muir commented on SOLR-3685:
---

Whats happening with this issue: is it still one? should it be critical/block 
4.0?

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log, oom-killer.log, pmap.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436182#comment-13436182
 ] 

Markus Jelsma commented on SOLR-3685:
-

Finally! Two nodes failed again and got killed by the OS. All nodes have a lot 
of off-heap RES memory, sometimes 3x higher than the heap which is a meager 
256MB.

Got a name suggestion for the memory issue? I'll open one tomorrow and link to 
this one.

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436187#comment-13436187
 ] 

Mark Miller commented on SOLR-3685:
---

Are there any crash dump files? I don't think I've seen a java process crash 
without seeing one of these.

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436194#comment-13436194
 ] 

Markus Jelsma commented on SOLR-3685:
-

One node also got rsyslogd killed but the other survived. I assume the 
OOMkiller output of Linux is what you refer to?

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436209#comment-13436209
 ] 

Yonik Seeley commented on SOLR-3685:


May also want to try specifying NIOFSDirectoryFactory in solrconfig.xml to see 
if it's related to mmap?

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log, oom-killer.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436214#comment-13436214
 ] 

Markus Jelsma commented on SOLR-3685:
-

We didn't think mmap could be the cause but nevertheless we tried that once on 
a smaller cluster and got a lot of memory consumption again, after which it got 
killed.
I can see if i can run one or two of the nodes with NIOFS but let the other run 
with mmap. We don't automatically restart cores so it should run fine if we 
temporarily change the config in zookeeper and restart two nodes.

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log, oom-killer.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

2012-08-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436243#comment-13436243
 ] 

Uwe Schindler commented on SOLR-3685:
-

Hi,
I also don't think MMap is the reason for this, but it's good that you test it. 
You are saying that this happened with NIOFS, too, so my only guess is:

As noted before (in my last comment), there seems to be something using 
off-heap memory (RES does not contain mmap, so if RES raises, its definitely 
not mmap), but other direct memory. I am not sure about other components in 
solr, that might use direct memory. Maybe Zookeeper? Its hard to find those 
things in external libraries. Can you try to limit the -XX:MaxDirectMemorySize 
to zero and see if exceptions occur? Also it would be good to have the output 
of pmap pid, this shows allocated and mapped memory, we should look at 
anonymous mappings and how many are there. Pmap is in procutils package.

 Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
 tlog flags not being cleared when no updates were buffered during a previous 
 replication.
 -

 Key: SOLR-3685
 URL: https://issues.apache.org/jira/browse/SOLR-3685
 Project: Solr
  Issue Type: Bug
  Components: replication (java), SolrCloud
Affects Versions: 4.0-ALPHA
 Environment: Debian GNU/Linux Squeeze 64bit
 Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
Reporter: Markus Jelsma
Assignee: Yonik Seeley
Priority: Critical
 Fix For: 4.0, 5.0

 Attachments: info.log, oom-killer.log


 There's a serious problem with restarting nodes, not cleaning old or unused 
 index directories and sudden replication and Java being killed by the OS due 
 to excessive memory allocation. Since SOLR-1781 was fixed index directories 
 get cleaned up when a node is being restarted cleanly, however, old or unused 
 index directories still pile up if Solr crashes or is being killed by the OS, 
 happening here.
 We have a six-node 64-bit Linux test cluster with each node having two 
 shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
 so about 50MB per node, this fits easily and works fine. However, if a node 
 is being restarted, Solr will consistently crash because it immediately eats 
 up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
 after start up.
 This cannot be solved by restarting Solr, it will just crash again and leave 
 index directories in place until the disk is full. The only way i can restart 
 a node safely is to delete the index directories and have it replicate from 
 another node. If i then restart the node it will crash almost consistently.
 I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org