C* throws OOM error despite use of automatic paging
Hi - We have an ETL application that reads all rows from Cassandra (2.1.2), filters them and stores a small subset in an RDBMS. Our application is using Datastax's Java driver (2.1.4) to fetch data from the C* nodes. Since the Java driver supports automatic paging, I was under the impression that SELECT queries should not cause an OOM error on the C* nodes. However, even with just 16GB data on each nodes, the C* nodes start throwing OOM error as soon as the application starts iterating through the rows of a table. The application code looks something like this: Statement stmt = new SimpleStatement("SELECT x,y,z FROM cf").setFetchSize(5000); ResultSet rs = session.execute(stmt); while (!rs.isExhausted()){ row = rs.one() process(row) } Even after we reduced the page size to 1000, the C* nodes still crash. C* is running on M3.xlarge machines (4-cores, 15GB). We manually increased the heap size to 8GB just to see how much heap C* consumes. With 10-15 minutes, the heap usage climbs up to 7.6GB. That does not make sense. Either automatic paging is not working or we are missing something. Does anybody have insights as to what could be happening? Thanks. Mohammed
Re: High read latency after data volume increased
On Thu, Jan 8, 2015 at 6:38 PM, Roni Balthazar wrote: > We downgraded to 2.1.1, but got the very same result. The read latency is > still high, but we figured out that it happens only using a specific > keyspace. > Note that downgrading is officially unsupported, but is probably safe between those two versions. Enable tracing and paste results for a high latency query. Also, how much RAM is used for heap? =Rob
Re: High read latency after data volume increased
Hi Robert, We downgraded to 2.1.1, but got the very same result. The read latency is still high, but we figured out that it happens only using a specific keyspace. Please see the graphs below... Trying another keyspace with 600+ reads/sec, we are getting the acceptable ~30ms read latency. Let me know if I need to provide more information. Thanks, Roni Balthazar On Thu, Jan 8, 2015 at 5:23 PM, Robert Coli wrote: > On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar > wrote: > >> We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2. >> > > https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ > > 2.1.2 in particular is known to have significant issues. You'd be better > off running 2.1.1 ... > > =Rob > >
Re: Rebooted cassandra node timing out all requests but recovers after a while
I will keep an eye for that if it happens again. Times at this point are synchronized On Wed, Jan 7, 2015 at 10:37 PM, Duncan Sands wrote: > Hi Anand, > > > On 08/01/15 02:02, Anand Somani wrote: > >> Hi, >> >> We have a 3 node cluster (on VM). Eg. host1, host2, host3. One of the VM >> rebooted (host1) and when host1 came up it would see the others as down >> and the >> others (host2 and host3) see it as down. So we restarted host2 and now >> the ring >> seems fine(everybody sees everybody as up). >> >> But now the clients timeout talking to host1. Have not figured out what is >> causing it. There is nothing in the logs that indicates a problem. >> Looking for >> indicators/help on what debug/tracing to turn on to find out what could be >> causing it. >> >> Now this happens only when a VM reboots (not otherwise), also it seems to >> have >> recovered itself after some hours!!( or restarts) not sure which one. >> >> This is 1.2.15, we are using ssl and cassandra authorizers. >> > > perhaps time is not synchronized between the nodes to begin with, and > eventually becomes synchronized. > > Ciao, Duncan. >
Re: High read latency after data volume increased
On Thu, Jan 8, 2015 at 11:14 AM, Roni Balthazar wrote: > We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2. > https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ 2.1.2 in particular is known to have significant issues. You'd be better off running 2.1.1 ... =Rob
High read latency after data volume increased
Hi there, We are using C* 2.1.2 with 2 DCs. 30 nodes DC1 and 10 nodes DC2. While our data volume is increasing (34 TB now), we are running into some problems: 1) Read latency is around 1000 ms when running 600 reads/sec (DC1 CL.LOCAL_ONE). At the same time the load average is about 20-30 on all DC1 nodes(8 cores CPU - 32 GB RAM). C* starts timing out connections. Still in this scenario OpsCenter has some issues as well. Opscenter resets all Graphs layout and backs to the default layout on every refresh. It doesn't back to normal after the load decrease. I only managed to put OpsCenter to it's normal behavior after reinstalling it. Just for reference, we are using SATA HDDs on all nodes and running hdparm to check disk performance under this load, some nodes are reporting very low read rates (under 10 MB/sec), while others above 100 MB/sec. Under low load average this rate is above 250 MB/sec. 2) Repair takes at least 4-5 days to complete. Last repair was 20 days ago. Running repair under high loads is bringing some nodes down with the exception: "JVMStabilityInspector.java:94 - JVM state determined to be unstable. Exiting forcefully due to: java.lang.OutOfMemoryError: Java heap space" Any hints? Regards, Roni Balthazar
Re: incremental repairs
On Thu, Jan 8, 2015 at 12:28 AM, Marcus Eriksson wrote: > But, if you are running 2.1 in production, I would recommend that you wait > until 2.1.3 is out, https://issues.apache.org/jira/browse/CASSANDRA-8316 > fixes a bunch of issues with incremental repairs > There are other serious issues with 2.1.2, I +1 recommend no one run it in production. :D =Rob
Re: incremental repairs
Hi Marcus, thanks a lot for those pointers. Now further testing can begin - and I'll wait for 2.1.3. Right now on production repair times are really painful, maybe that will become better. At least I hope so :-)
Updated JMX metrics overview
I am adding some jmx metrics to our monitoring tool. Is there a good overview of all existing jmx metrics and their description? http://wiki.apache.org/cassandra/Metrics has only a few metrics. In particular I have questions about these now: org.apache.cassandra.db type=StorageProxy TotalHints - is this the number of hints since the node was started or a lifetime value org.apache.cassandra.db type=StorageProxy ReadRepairRepairedBackground - is this the total number of background read repair requests that the node received since restart? org.apache.cassandra.db type=StorageProxy ReadRepairRepairedBlocking - same as above, I assume thats the number of read repairs that have blocked a query (when does that even happen - aren't read repairs run in the background) org.apache.cassandra.metrics type=Storage name=Exceptions Count - is this the total number of unhandled exceptions since node restart?
Re: How to bulkload into a specific data center?
Just noticed you'd sent this to the dev list, this is a question for only the user list, and please do not send questions of this type to the developer list. On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla wrote: > The nature of replication factor is such that writes will go wherever > there is replication. If you're wanting responses to be faster, and not > involve the REST data center in the spark job for response I suggest using > a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the > spark cassandra connector here > https://github.com/datastax/spark-cassandra-connector ) . While write > traffic will still be replicated to the REST service data center, because > you do want those results available, you will not be waiting on the remote > data center to respond "successful". > > Final point, bulk loading sends a copy per replica across the wire, so > lets say you have RF3 in each data center that means bulk loading will send > out 6 copies from that client at once, with normal mutations via thrift or > cql writes between data centers go out as 1 copy, then that node will > forward on to the other replicas. This means intra data center traffic in > this case would be 3x more with the bulk loader than with using a > traditional cql or thrift based client. > > > > On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang wrote: > >> I set up two virtual data centers, one for analytics and one for REST >> service. The analytics data center sits top on Hadoop cluster. I want to >> bulk load my ETL results into the analytics data center so that the REST >> service won't have the heavy load. I'm using CQLTableInputFormat in my >> Spark Application, and I gave the nodes in analytics data center as >> Intialial address. >> >> However, I found my jobs were connecting to the REST service data center. >> >> How can I specify the data center? >> > > > > -- > > Thanks, > Ryan Svihla > > -- Thanks, Ryan Svihla
Re: How to bulkload into a specific data center?
The nature of replication factor is such that writes will go wherever there is replication. If you're wanting responses to be faster, and not involve the REST data center in the spark job for response I suggest using a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark cassandra connector here https://github.com/datastax/spark-cassandra-connector ) . While write traffic will still be replicated to the REST service data center, because you do want those results available, you will not be waiting on the remote data center to respond "successful". Final point, bulk loading sends a copy per replica across the wire, so lets say you have RF3 in each data center that means bulk loading will send out 6 copies from that client at once, with normal mutations via thrift or cql writes between data centers go out as 1 copy, then that node will forward on to the other replicas. This means intra data center traffic in this case would be 3x more with the bulk loader than with using a traditional cql or thrift based client. On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang wrote: > I set up two virtual data centers, one for analytics and one for REST > service. The analytics data center sits top on Hadoop cluster. I want to > bulk load my ETL results into the analytics data center so that the REST > service won't have the heavy load. I'm using CQLTableInputFormat in my > Spark Application, and I gave the nodes in analytics data center as > Intialial address. > > However, I found my jobs were connecting to the REST service data center. > > How can I specify the data center? > -- Thanks, Ryan Svihla
Re: Why does C* repeatedly compact the same tables over and over?
> Are you using Leveled compaction strategy? And if you're using Date Tiered compaction strategy on a table that isn't time-series data, for example deletes happen, you find it compacting over and over. ~mck
Re: Why does C* repeatedly compact the same tables over and over?
Are you using Leveled compaction strategy? If you fall behind on compaction in leveled (and you will during bootstrap), by default Cassandra will fall back to size tiered compaction in level 0. This will cause SSTables larger than sstable_size_in_mb, and those will be recompacted away into level 1. When level 1 gets full, those will be recompacted away into level 2. If level 2 gets full, into level 3 and so on. Leveled compaction is more I/O intensive than size tiered compaction for reasons such as this. The same data can be compacted numerous times before a node settles down. The upside is that it puts a practical upper bound on the number of sstables which can be involved in a read (not more than one per level, not counting false positives from bloom filters). Leveled compaction is essentially trading increased I/O at write time (including and especially bootstrap or repair) for decreased I/O at read time. On Thu, Jan 8, 2015 at 6:12 AM, Robert Wille wrote: > After bootstrapping a node, the node repeatedly compacts the same tables > over and over, even though my cluster is completely idle. I’ve noticed the > same behavior after extended periods of heavy writes. I realize that during > bootstrapping (or extended periods of heavy writes) that compaction could > get seriously behind, but once a table has been compacted, I don’t see the > need to recompact the table dozens of more times. > > Possibly related, I often see that OpsCenter reports that nodes have a > large number of pending tasks, when Pending column of the Thread Pool Stats > doesn’t reflect that. > > Robert > >
Why does C* repeatedly compact the same tables over and over?
After bootstrapping a node, the node repeatedly compacts the same tables over and over, even though my cluster is completely idle. I’ve noticed the same behavior after extended periods of heavy writes. I realize that during bootstrapping (or extended periods of heavy writes) that compaction could get seriously behind, but once a table has been compacted, I don’t see the need to recompact the table dozens of more times. Possibly related, I often see that OpsCenter reports that nodes have a large number of pending tasks, when Pending column of the Thread Pool Stats doesn’t reflect that. Robert
User audit in Cassandra
Hi, Is there a way to enable user audit or trace if we have enabled PasswordAuthenticator in cassandra.yaml and set up the users as well. I noticed there are keyspaces system_auth and system_trace. But there is no way to find out which user initiated which session. Is there anyway to find out?. Also is it recommended to enable system_trace in production or to know how many sessions started by a user? Thanks Ajay
Re: incremental repairs
Yes, you should reenable autocompaction /Marcus On Thu, Jan 8, 2015 at 10:33 AM, Roland Etzenhammer < r.etzenham...@t-online.de> wrote: > Hi Marcus, > > thanks for that quick reply. I did also look at: > > http://www.datastax.com/documentation/cassandra/2.1/ > cassandra/operations/ops_repair_nodes_c.html > > which describes the same process, it's 2.1.x, so I see that 2.1.2+ is not > covered there. I did upgrade my testcluster to 2.1.2 and with your hint I > take a look at sstablemetadata from a non "migrated" node and there are > indeed "Repaired at" entries on some sstables already. So if I got this > right, in 2.1.2+ there is nothing to do to switch to incremental repairs > (apart from running the repairs themself). > > But one thing I see during testing is that there are many sstables, with > small size: > > - in total there are 5521 sstables on one node > - 115 sstables are bigger than 1MB > - 4949 sstables are smaller than 10kB > > I don't know where they came from - I found one piece of information where > this happend when cassandra was low on heap which happend to me while > running tests (the suggested solution is to trigger compaction via JMX). > > Question for me: I did disable autocompaction on some nodes of our test > cluster as the blog and docs said. Should/can I reenable autocompaction > again with incremental repairs? > > Cheers, > Roland > > > >
Re: incremental repairs
Hi Marcus, thanks for that quick reply. I did also look at: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html which describes the same process, it's 2.1.x, so I see that 2.1.2+ is not covered there. I did upgrade my testcluster to 2.1.2 and with your hint I take a look at sstablemetadata from a non "migrated" node and there are indeed "Repaired at" entries on some sstables already. So if I got this right, in 2.1.2+ there is nothing to do to switch to incremental repairs (apart from running the repairs themself). But one thing I see during testing is that there are many sstables, with small size: - in total there are 5521 sstables on one node - 115 sstables are bigger than 1MB - 4949 sstables are smaller than 10kB I don't know where they came from - I found one piece of information where this happend when cassandra was low on heap which happend to me while running tests (the suggested solution is to trigger compaction via JMX). Question for me: I did disable autocompaction on some nodes of our test cluster as the blog and docs said. Should/can I reenable autocompaction again with incremental repairs? Cheers, Roland
Re: incremental repairs
If you are on 2.1.2+ (or using STCS) you don't those steps (should probably update the blog post). Now we keep separate levelings for the repaired/unrepaired data and move the sstables over after the first incremental repair But, if you are running 2.1 in production, I would recommend that you wait until 2.1.3 is out, https://issues.apache.org/jira/browse/CASSANDRA-8316 fixes a bunch of issues with incremental repairs -pr is sufficient, same rules apply as before, if you run -pr you need to repair every node /Marcus On Thu, Jan 8, 2015 at 9:16 AM, Roland Etzenhammer < r.etzenham...@t-online.de> wrote: > Hi, > > I am currently trying to migrate my test cluster to incremental repairs. > These are the steps I'm doing on every node: > > - touch marker > - nodetool disableautocompation > - nodetool repair > - cassandra stop > - find all *Data*.db files older then marker > - invoke sstablerepairedset on those > - cassandra start > > This is essentially what http://www.datastax.com/dev/ > blog/anticompaction-in-cassandra-2-1 says. After all nodes migrated this > way, I think I need to run my regular repairs more often and they should be > faster afterwards. But do I need to run "nodetool repair" or is "nodetool > repair -pr" sufficient? > > And do I need to reenable autocompation? Oder do I need to compact myself? > > Thanks for any input, > Roland >
incremental repairs
Hi, I am currently trying to migrate my test cluster to incremental repairs. These are the steps I'm doing on every node: - touch marker - nodetool disableautocompation - nodetool repair - cassandra stop - find all *Data*.db files older then marker - invoke sstablerepairedset on those - cassandra start This is essentially what http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1 says. After all nodes migrated this way, I think I need to run my regular repairs more often and they should be faster afterwards. But do I need to run "nodetool repair" or is "nodetool repair -pr" sufficient? And do I need to reenable autocompation? Oder do I need to compact myself? Thanks for any input, Roland