[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392618#comment-14392618 ] anselme dewavrin commented on AMQ-5082: --- Well, it depends... When do you think v5.12 is planned for release ? How do I patch a 5.11 ? Thank you, we have been using ActiveMQ a lot in production for 20 months and it works well. We could give testimonial. Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Assignee: Christian Posta Priority: Critical Fix For: 5.12.0 Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392950#comment-14392950 ] anselme dewavrin commented on AMQ-5082: --- Thanks Jim for these precisions. doing it very soon ! Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Assignee: Christian Posta Priority: Critical Fix For: 5.12.0 Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390730#comment-14390730 ] anselme dewavrin commented on AMQ-5082: --- A big Thank you Jim, we were truly annoyed by this ! Christian, will this be included in the main branch of the downloadable binaries ? In which version ? Thanks in advance, Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Assignee: Christian Posta Priority: Critical Fix For: 5.12.0 Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AMQ-5235) erroneous temp percent used
[ https://issues.apache.org/jira/browse/AMQ-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303376#comment-14303376 ] anselme dewavrin commented on AMQ-5235: --- yes, the deletion is lazy. Several minutes somtimes. Also, one message alone can make a whole data file (the .log) necessary, which are 100MB each by default. We reduced this problem by setting logSize to 10 MB. For empty queues that still make temp percent non-zero, MAybe you should try to stop/restart your broker ? Anselme erroneous temp percent used --- Key: AMQ-5235 URL: https://issues.apache.org/jira/browse/AMQ-5235 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian (quality testing and production) Reporter: anselme dewavrin Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/dirty.index/05.sst 76809821 2048 -rw-r--r-- 1 anselme anselme 2097152 Jun 19 11:30 activemq-data/dirty.index/12.log 76809818 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/dirty.index/MANIFEST-10 768098160 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/lock 76809815 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 19 11:30 activemq-data/00f0faaf.log 76809823 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 16 11:50 activemq-data/00385f46.log 768098074 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/00f0faaf.index 76809808 420 -rw-r--r-- 1 anselme anselme429264 Jun 16 16:13 activemq-data/00f0faaf.index/09.log 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/00f0faaf.index/08.sst 768098124 -rw-r--r-- 1 anselme anselme 165 Jun 16 16:13 activemq-data/00f0faaf.index/MANIFEST-07 768098094 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/00f0faaf.index/CURRENT 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/00f0faaf.index/05.sst 76809814 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 12 21:06 activemq-data/.log 768098024 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/plist.index 768098034 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/plist.index/CURRENT 768098060 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/plist.index/LOCK 76809805 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/03.log 76809804 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/MANIFEST-02 The problem is that in our production system it once blocked producers with a tempusage at 122%, even if the disk was empty. So we invesigated and executed the broker in a debugger, and found how the usage is calculated. If it in the scala leveldb files : It is not based on what is on disk, but on what it thinks is on the disk. It multiplies the size of one log by the number of logs known by a certain hashmap. I think the entries of the hashmap are not removed when the log files are purged. Could you confirm ? Thanks in advance Anselme -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047684#comment-14047684 ] anselme dewavrin commented on AMQ-4987: --- still up and running in production. io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Fix For: 5.9.0 Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041885#comment-14041885 ] anselme dewavrin commented on AMQ-5082: --- Dear all, We had the exact same problem due to backups at our hosting company, that we could not avoid. We worked around by increasing the ticktime in zookeeper. Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Priority: Critical Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AMQ-5105) leveldb fails to startup because of NoSuchMethodError
[ https://issues.apache.org/jira/browse/AMQ-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041890#comment-14041890 ] anselme dewavrin commented on AMQ-5105: --- I worked for us, thanks Anselme leveldb fails to startup because of NoSuchMethodError - Key: AMQ-5105 URL: https://issues.apache.org/jira/browse/AMQ-5105 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.10.0 Environment: Any Reporter: Netlancer Priority: Minor leveldb persistence fails to start due to errors as given below Caused by: java.lang.NoSuchMethodError: com.google.common.base.Objects.firstNonNull(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; at com.google.common.cache.CacheBuilder.getKeyStrength(CacheBuilder.java:530) at com.google.common.cache.LocalCache.init(LocalCache.java:238) at com.google.common.cache.LocalCache$LocalLoadingCache.init(LocalCache.java:4861) at com.google.common.cache.CacheBuilder.build(CacheBuilder.java:803) at org.iq80.leveldb.impl.TableCache.init(TableCache.java:46) at org.iq80.leveldb.impl.DbImpl.init(DbImpl.java:155) at org.iq80.leveldb.impl.Iq80DBFactory.open(Iq80DBFactory.java:59) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply$mcV$sp(LevelDBClient.scala:661) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply(LevelDBClient.scala:657) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply(LevelDBClient.scala:657) at org.apache.activemq.leveldb.LevelDBClient.might_fail(LevelDBClient.scala:549) The problem seems to be because of multiple jars having the same classes guava-12.jar pax-url-aether-1.5.2.jar The class present in pax-url-aether-1.5.2.jar gets loaded causing level db to fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (AMQ-5105) leveldb fails to startup because of NoSuchMethodError
[ https://issues.apache.org/jira/browse/AMQ-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041890#comment-14041890 ] anselme dewavrin edited comment on AMQ-5105 at 6/24/14 8:58 AM: It worked for us, thanks Anselme was (Author: adewavrin): I worked for us, thanks Anselme leveldb fails to startup because of NoSuchMethodError - Key: AMQ-5105 URL: https://issues.apache.org/jira/browse/AMQ-5105 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.10.0 Environment: Any Reporter: Netlancer Priority: Minor leveldb persistence fails to start due to errors as given below Caused by: java.lang.NoSuchMethodError: com.google.common.base.Objects.firstNonNull(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; at com.google.common.cache.CacheBuilder.getKeyStrength(CacheBuilder.java:530) at com.google.common.cache.LocalCache.init(LocalCache.java:238) at com.google.common.cache.LocalCache$LocalLoadingCache.init(LocalCache.java:4861) at com.google.common.cache.CacheBuilder.build(CacheBuilder.java:803) at org.iq80.leveldb.impl.TableCache.init(TableCache.java:46) at org.iq80.leveldb.impl.DbImpl.init(DbImpl.java:155) at org.iq80.leveldb.impl.Iq80DBFactory.open(Iq80DBFactory.java:59) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply$mcV$sp(LevelDBClient.scala:661) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply(LevelDBClient.scala:657) at org.apache.activemq.leveldb.LevelDBClient$$anonfun$replay_init$2.apply(LevelDBClient.scala:657) at org.apache.activemq.leveldb.LevelDBClient.might_fail(LevelDBClient.scala:549) The problem seems to be because of multiple jars having the same classes guava-12.jar pax-url-aether-1.5.2.jar The class present in pax-url-aether-1.5.2.jar gets loaded causing level db to fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041885#comment-14041885 ] anselme dewavrin edited comment on AMQ-5082 at 6/24/14 9:03 AM: Dear all, We had the exact same problem due to backups at our hosting company, that we could not avoid. So we can consider that the cluster behaves normally : when unable to communicate with each others, brokers try to fail over. We increased the ticktime in zookeeper and it works perfectly. 4 months in production already. Anselme was (Author: adewavrin): Dear all, We had the exact same problem due to backups at our hosting company, that we could not avoid. We worked around by increasing the ticktime in zookeeper. Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Priority: Critical Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
[ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041885#comment-14041885 ] anselme dewavrin edited comment on AMQ-5082 at 6/24/14 9:03 AM: Dear all, We had the exact same problem due to backups at our hosting company, that we could not avoid. So we can consider that the cluster behaves normally : when unable to communicate with each others, the brokers try to fail over. We increased the ticktime in zookeeper and it works perfectly. 4 months in production already. Anselme was (Author: adewavrin): Dear all, We had the exact same problem due to backups at our hosting company, that we could not avoid. So we can consider that the cluster behaves normally : when unable to communicate with each others, brokers try to fail over. We increased the ticktime in zookeeper and it works perfectly. 4 months in production already. Anselme ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening --- Key: AMQ-5082 URL: https://issues.apache.org/jira/browse/AMQ-5082 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0, 5.10.0 Reporter: Scott Feldstein Priority: Critical Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure, zookeeper.out-cluster.failure I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence adapter. {code} persistenceAdapter replicatedLevelDB directory=${activemq.data}/leveldb replicas=3 bind=tcp://0.0.0.0:0 zkAddress=zookeep0:2181 zkPath=/activemq/leveldb-stores/ /persistenceAdapter {code} After about a day or so of sitting idle there are cascading failures and the cluster completely stops listening all together. I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the time where the cluster starts having issues. The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds). The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue. If you need more data it should be pretty easy to get whatever is needed since it is consistently reproducible. This bug may be related to AMQ-5026, but looks different enough to file a separate issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AMQ-5235) erroneous temp percent used
[ https://issues.apache.org/jira/browse/AMQ-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038587#comment-14038587 ] anselme dewavrin commented on AMQ-5235: --- Timothy, Thank you for your answer and suggestion We have the same behavior with 5.10 snapshot 20 december that we have in production. Anselme erroneous temp percent used --- Key: AMQ-5235 URL: https://issues.apache.org/jira/browse/AMQ-5235 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian (quality testing and production) Reporter: anselme dewavrin Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/dirty.index/05.sst 76809821 2048 -rw-r--r-- 1 anselme anselme 2097152 Jun 19 11:30 activemq-data/dirty.index/12.log 76809818 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/dirty.index/MANIFEST-10 768098160 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/lock 76809815 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 19 11:30 activemq-data/00f0faaf.log 76809823 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 16 11:50 activemq-data/00385f46.log 768098074 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/00f0faaf.index 76809808 420 -rw-r--r-- 1 anselme anselme429264 Jun 16 16:13 activemq-data/00f0faaf.index/09.log 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/00f0faaf.index/08.sst 768098124 -rw-r--r-- 1 anselme anselme 165 Jun 16 16:13 activemq-data/00f0faaf.index/MANIFEST-07 768098094 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/00f0faaf.index/CURRENT 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/00f0faaf.index/05.sst 76809814 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 12 21:06 activemq-data/.log 768098024 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/plist.index 768098034 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/plist.index/CURRENT 768098060 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/plist.index/LOCK 76809805 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/03.log 76809804 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/MANIFEST-02 The problem is that in our production system it once blocked producers with a tempusage at 122%, even if the disk was empty. So we invesigated and executed the broker in a debugger, and found how the usage is calculated. If it in the scala leveldb files : It is not based on what is on disk, but on what it thinks is on the disk. It multiplies the size of one log by the number of logs known by a certain hashmap. I think the entries of the hashmap are not removed when the log files are purged. Could you confirm ? Thanks in advance Anselme -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (AMQ-5235) erroneous temp percent used
[ https://issues.apache.org/jira/browse/AMQ-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038602#comment-14038602 ] anselme dewavrin commented on AMQ-5235: --- In fact I would greatly appreciate if an author confirms that the tempUsage is computed in LevelDBClient.scala at line : def size: Long = logRefs.size * store.logSize Thank you in advance Anselme erroneous temp percent used --- Key: AMQ-5235 URL: https://issues.apache.org/jira/browse/AMQ-5235 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian (quality testing and production) Reporter: anselme dewavrin Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/dirty.index/05.sst 76809821 2048 -rw-r--r-- 1 anselme anselme 2097152 Jun 19 11:30 activemq-data/dirty.index/12.log 76809818 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/dirty.index/MANIFEST-10 768098160 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/lock 76809815 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 19 11:30 activemq-data/00f0faaf.log 76809823 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 16 11:50 activemq-data/00385f46.log 768098074 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/00f0faaf.index 76809808 420 -rw-r--r-- 1 anselme anselme429264 Jun 16 16:13 activemq-data/00f0faaf.index/09.log 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/00f0faaf.index/08.sst 768098124 -rw-r--r-- 1 anselme anselme 165 Jun 16 16:13 activemq-data/00f0faaf.index/MANIFEST-07 768098094 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/00f0faaf.index/CURRENT 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/00f0faaf.index/05.sst 76809814 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 12 21:06 activemq-data/.log 768098024 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/plist.index 768098034 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/plist.index/CURRENT 768098060 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/plist.index/LOCK 76809805 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/03.log 76809804 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/MANIFEST-02 The problem is that in our production system it once blocked producers with a tempusage at 122%, even if the disk was empty. So we invesigated and executed the broker in a debugger, and found how the usage is calculated. If it in the scala leveldb files : It is not based on what is on disk, but on what it thinks is on the disk. It multiplies the size of one log by the number of logs known by a certain hashmap. I think the entries of the hashmap are not removed when the log files are purged. Could you confirm ? Thanks in advance Anselme -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (AMQ-5235) erroneous temp percent used
anselme dewavrin created AMQ-5235: - Summary: erroneous temp percent used Key: AMQ-5235 URL: https://issues.apache.org/jira/browse/AMQ-5235 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian (quality testing and production) Reporter: anselme dewavrin Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/dirty.index/05.sst 76809821 2048 -rw-r--r-- 1 anselme anselme 2097152 Jun 19 11:30 activemq-data/dirty.index/12.log 76809818 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/dirty.index/MANIFEST-10 768098160 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/lock 76809815 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 19 11:30 activemq-data/00f0faaf.log 76809823 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 16 11:50 activemq-data/00385f46.log 768098074 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/00f0faaf.index 76809808 420 -rw-r--r-- 1 anselme anselme429264 Jun 16 16:13 activemq-data/00f0faaf.index/09.log 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/00f0faaf.index/08.sst 768098124 -rw-r--r-- 1 anselme anselme 165 Jun 16 16:13 activemq-data/00f0faaf.index/MANIFEST-07 768098094 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/00f0faaf.index/CURRENT 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/00f0faaf.index/05.sst 76809814 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 12 21:06 activemq-data/.log 768098024 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/plist.index 768098034 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/plist.index/CURRENT 768098060 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/plist.index/LOCK 76809805 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/03.log 76809804 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/MANIFEST-02 The problem is that in our production system it once blocked producers with a tempusage at 122%, even if the disk was empty. So we invesigated and executed the broker in a debugger, and found how the usage is calculated. If it in the scala leveldb files : It is not based on what is on disk, but on what it thinks is on the disk. It multiplies the size of one log by the number of logs recorded in a hashmap. I think the entries of the hashmap are not removed when the log files are purged. Could you confirm ? Thanks in advance Anselme -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (AMQ-5235) erroneous temp percent used
[ https://issues.apache.org/jira/browse/AMQ-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin updated AMQ-5235: -- Description: Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/dirty.index/05.sst 76809821 2048 -rw-r--r-- 1 anselme anselme 2097152 Jun 19 11:30 activemq-data/dirty.index/12.log 76809818 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/dirty.index/MANIFEST-10 768098160 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/lock 76809815 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 19 11:30 activemq-data/00f0faaf.log 76809823 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 16 11:50 activemq-data/00385f46.log 768098074 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/00f0faaf.index 76809808 420 -rw-r--r-- 1 anselme anselme429264 Jun 16 16:13 activemq-data/00f0faaf.index/09.log 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/00f0faaf.index/08.sst 768098124 -rw-r--r-- 1 anselme anselme 165 Jun 16 16:13 activemq-data/00f0faaf.index/MANIFEST-07 768098094 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/00f0faaf.index/CURRENT 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51 activemq-data/00f0faaf.index/05.sst 76809814 102400 -rw-r--r-- 1 anselme anselme 104857600 Jun 12 21:06 activemq-data/.log 768098024 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/plist.index 768098034 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/plist.index/CURRENT 768098060 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/plist.index/LOCK 76809805 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/03.log 76809804 1024 -rw-r--r-- 1 anselme anselme 1048576 Jun 16 16:13 activemq-data/plist.index/MANIFEST-02 The problem is that in our production system it once blocked producers with a tempusage at 122%, even if the disk was empty. So we invesigated and executed the broker in a debugger, and found how the usage is calculated. If it in the scala leveldb files : It is not based on what is on disk, but on what it thinks is on the disk. It multiplies the size of one log by the number of logs known by a certain hashmap. I think the entries of the hashmap are not removed when the log files are purged. Could you confirm ? Thanks in advance Anselme was: Dear all, We have an activemq 5.9 configured with 1GB of tempUsage allowed. Just by security because we only use persistent messages (about 6000 messages per day). After severall days of use, the temp usage increases, and even shows values that are above the total amount of the data on disk. Here it shows 45% of its 1GB limit for the following files : find activemq-data -ls 768098014 drwxr-xr-x 5 anselme anselme 4096 Jun 19 10:24 activemq-data 768098134 -rw-r--r-- 1 anselme anselme24 Jun 16 16:13 activemq-data/store-version.txt 768098174 drwxr-xr-x 2 anselme anselme 4096 Jun 16 16:13 activemq-data/dirty.index 768098114 -rw-r--r-- 2 anselme anselme 2437 Jun 16 12:06 activemq-data/dirty.index/08.sst 768098204 -rw-r--r-- 1 anselme anselme16 Jun 16 16:13 activemq-data/dirty.index/CURRENT 76809819 80 -rw-r--r-- 1 anselme anselme 80313 Jun 16 16:13 activemq-data/dirty.index/11.sst 768098220 -rw-r--r-- 1 anselme anselme 0 Jun 16 16:13 activemq-data/dirty.index/LOCK 76809810 300 -rw-r--r-- 2 anselme anselme305206 Jun 16 11:51
[jira] [Resolved] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin resolved AMQ-4987. --- Resolution: Fixed Fix Version/s: 5.9.0 This fsync impeachment works perfectly. Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Fix For: 5.9.0 Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (AMQ-5063) when restarted, slaves do not really sync
anselme dewavrin created AMQ-5063: - Summary: when restarted, slaves do not really sync Key: AMQ-5063 URL: https://issues.apache.org/jira/browse/AMQ-5063 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian 6, x86-64, jdk1.7 Reporter: anselme dewavrin Dear All, I worked several days on replication tests with 5.9 (and snapshot 5.10), using replicatedLevelDB configuration as explained on activemq website (details follow). Everything is replicated well as long as every node is up. But if I -stop a slave, -inject messages, -restart it, then I see in the logs that it caught up. But if I make it become the master, the new messages are not there... Did I misunderstood the goal of replication ? Is this normal ? Thank you all, Anselme persistenceAdapter replicatedLevelDB directory=/usr2/talend/activemq/data sync=quorum_disk weight=2 replicas=3 bind=tcp://auchanhmi-was02-prod.ope.cloud.mbs:1 zkAddress=a-was01-prod.ope.cloud.mbs:2190,a-was02-prod.ope.cloud.mbs:2190,ahmi-was10-prod.ope.cloud.mbs:2190 zkPassword=password zkPath=/activemq/leveldb-stores hostname=a-was02-prod.ope.cloud.mbs verifyChecksums=true paranoidChecks=true / /persistenceAdapter -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903946#comment-13903946 ] anselme dewavrin edited comment on AMQ-4987 at 2/18/14 12:54 PM: - After 3 weeks this fsync impeachment works perfectly. Anselme was (Author: adewavrin): This fsync impeachment works perfectly. Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Fix For: 5.9.0 Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903946#comment-13903946 ] anselme dewavrin edited comment on AMQ-4987 at 2/18/14 12:55 PM: - After 3 weeks this fsync impeachment still works perfectly. Anselme was (Author: adewavrin): After 3 weeks this fsync impeachment works perfectly. Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Fix For: 5.9.0 Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (AMQ-5063) when restarted, slaves do not really sync
[ https://issues.apache.org/jira/browse/AMQ-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin updated AMQ-5063: -- Description: Dear All, I worked several days on replication tests with 5.9 (and snapshot 5.10), using replicatedLevelDB configuration as explained on activemq website (details follow). Everything is replicated well as long as every node is up. But if I -stop a slave, -inject messages, -restart it, then I see in the logs that it caught up. But if I make it become the master, the messages are not there... Did I misunderstood the goal of replication ? Is this normal ? Thank you all, Anselme For instance the conf of 2nd node is : persistenceAdapter replicatedLevelDB directory=/usr2/talend/activemq/data sync=quorum_disk weight=2 replicas=3 bind=tcp://awas02:1 zkAddress=awas01:2190,awas02:2190,awas10:2190 zkPassword=password zkPath=/activemq/leveldb-stores hostname=awas02 verifyChecksums=true paranoidChecks=true / /persistenceAdapter was: Dear All, I worked several days on replication tests with 5.9 (and snapshot 5.10), using replicatedLevelDB configuration as explained on activemq website (details follow). Everything is replicated well as long as every node is up. But if I -stop a slave, -inject messages, -restart it, then I see in the logs that it caught up. But if I make it become the master, the new messages are not there... Did I misunderstood the goal of replication ? Is this normal ? Thank you all, Anselme persistenceAdapter replicatedLevelDB directory=/usr2/talend/activemq/data sync=quorum_disk weight=2 replicas=3 bind=tcp://auchanhmi-was02-prod.ope.cloud.mbs:1 zkAddress=a-was01-prod.ope.cloud.mbs:2190,a-was02-prod.ope.cloud.mbs:2190,ahmi-was10-prod.ope.cloud.mbs:2190 zkPassword=password zkPath=/activemq/leveldb-stores hostname=a-was02-prod.ope.cloud.mbs verifyChecksums=true paranoidChecks=true / /persistenceAdapter when restarted, slaves do not really sync - Key: AMQ-5063 URL: https://issues.apache.org/jira/browse/AMQ-5063 Project: ActiveMQ Issue Type: Bug Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian 6, x86-64, jdk1.7 Reporter: anselme dewavrin Dear All, I worked several days on replication tests with 5.9 (and snapshot 5.10), using replicatedLevelDB configuration as explained on activemq website (details follow). Everything is replicated well as long as every node is up. But if I -stop a slave, -inject messages, -restart it, then I see in the logs that it caught up. But if I make it become the master, the messages are not there... Did I misunderstood the goal of replication ? Is this normal ? Thank you all, Anselme For instance the conf of 2nd node is : persistenceAdapter replicatedLevelDB directory=/usr2/talend/activemq/data sync=quorum_disk weight=2 replicas=3 bind=tcp://awas02:1 zkAddress=awas01:2190,awas02:2190,awas10:2190 zkPassword=password zkPath=/activemq/leveldb-stores hostname=awas02 verifyChecksums=true paranoidChecks=true / /persistenceAdapter -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin updated AMQ-4987: -- Description: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme was: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Many thanks in advance Yours, Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879805#comment-13879805 ] anselme dewavrin commented on AMQ-4987: --- Dear All, The iowait and huge disk activity is due to frequent fsyncs on the slaves. I demonstrated this by preloading a library that disables fsyncs (with a LD_PRELOAD=libeatmydata.so). With this trick the iowait disappear. I think the leveldb replicated stores is in itself secure enough without fsync-ing each update, because of the synchronous replication (sync=quorum_mem) on different machines that should fail within the same seconds. This is why the many fsyncs on the slaves are not useful, in my opinion. For my purpose I will live with the LD_PRELOAD, but it could be profitable for the community to make the levelDB replicated store evolve. Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin updated AMQ-4987: -- Description: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme was: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (AMQ-4987) io wait on replicated levelDB slaves
[ https://issues.apache.org/jira/browse/AMQ-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anselme dewavrin updated AMQ-4987: -- Description: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme was: Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster on 3 different machines, as explained on the activemq site (with zookeeper etc.). I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 100 messages/s, each message is 10k). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. And the iowait is justified by 3000 bo/s (blocks out) in the vmstat report. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Why is the leveldb slaves writing so much to disk ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (AMQ-4987) io wait on replicated levelDB slaves
anselme dewavrin created AMQ-4987: - Summary: io wait on replicated levelDB slaves Key: AMQ-4987 URL: https://issues.apache.org/jira/browse/AMQ-4987 Project: ActiveMQ Issue Type: Test Components: activemq-leveldb-store Affects Versions: 5.9.0 Environment: debian VM 2.6.32-5-amd64, jdk7 Reporter: anselme dewavrin Priority: Minor Dear all, I set up a 3-nodes replicatedLevelDB activeMQ cluster as explained on the activemq site. I made a message injector using the php stomp client described here : http://stomp.fusesource.org/documentation/php/book.html Then I injected persistent messages as fast as possible (giving about 600 messages/s). Everything works fine, then I measured the servers' activity with vmstat 1. I saw no iowait on the master node, but 20% on both slaves. This would impeach scalabitity I suppose. The machines are not swapping (paging). Here is what I tried, without success : -specify explicitly sync=quorum_mem -JNI implementation of the leveldb store (and verified it is used) -setting flushDelay to 2000 Does anyone have an idea that I could try ? Many thanks in advance Yours, Anselme -- This message was sent by Atlassian JIRA (v6.1.5#6160)