Re: consistency ONE and null
as I understand, the read repair is a background task triggered by the read request, but once the consistency requirement has been met you will be given a response. the coordinator at CL.ONE is allowed to return your responce once it has one response (empty or not) from any replica. if the first response is empty, you get null - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 00:10, Jonathan Colby jonathan.co...@gmail.com wrote: Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not written because the node had a network hiccup (but came back online again). My question is, if you are reading a key with a CL of ONE, and you happen to land on that node that didn't get the write, will the read fail immediately? Or, would read repair check the other replicas and fetch the correct data from the other node(s)? Secondly, is read repair done according to the consistency level, or is read repair an independent configuration setting that can be turned on/off. There was a recent thread about a different variation of my question, but went into very technical details, so I didn't want to hijack that thread.
Re: consistency ONE and null
also there is a configuration parameter that controls the probability of any read request triggering a read repair - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 07:35, Stephen Connolly stephen.alan.conno...@gmail.com wrote: as I understand, the read repair is a background task triggered by the read request, but once the consistency requirement has been met you will be given a response. the coordinator at CL.ONE is allowed to return your responce once it has one response (empty or not) from any replica. if the first response is empty, you get null - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 00:10, Jonathan Colby jonathan.co...@gmail.com wrote: Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not written because the node had a network hiccup (but came back online again). My question is, if you are reading a key with a CL of ONE, and you happen to land on that node that didn't get the write, will the read fail immediately? Or, would read repair check the other replicas and fetch the correct data from the other node(s)? Secondly, is read repair done according to the consistency level, or is read repair an independent configuration setting that can be turned on/off. There was a recent thread about a different variation of my question, but went into very technical details, so I didn't want to hijack that thread.
problem with large batch mutation set
Hi, I am using the thrift client batch_mutate method with Cassandra 0.7.0 on Ubuntu 10.10. When the size of the mutations gets too large, the client fails with the following exception: Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157) at org.apache.cassandra.thrift.Cassandra$Client.send_batch_mutate(Cassandra.java:901) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:889) at com.cxense.cxad.core.persistence.cassandra.store.BatchMutationTask.apply(BatchMutationTask.java:78) at com.cxense.cxad.core.persistence.cassandra.store.BatchMutationTask.apply(BatchMutationTask.java:30) at com.cxense.cassandra.conn.DefaultCassandraConnectionTemplate.execute(DefaultCassandraConnectionTemplate.java:316) at com.cxense.cassandra.conn.DefaultCassandraConnectionTemplate.execute(DefaultCassandraConnectionTemplate.java:257) at com.cxense.cxad.core.persistence.cassandra.store.AbstractCassandraStore.writeMutations(AbstractCassandraStore.java:492) ... 39 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) ... 47 more It took a while for me to discover that this obscure error message was as a result of the thrift message exceeding the maximum frame size specified for the Cassandra server. (default 15MB) [Using TFastFramedTransport with a max frame size of 15728640 bytes.] The poor error message looks like a bug in Thrift server code that assumes that any transport exception is a connection failure which should drop the connection. * *My main problem is how to ensure that this does not occur again in running code. I could configure the server with a larger frame size, but this size is effectively arbitrary so there is no guarantee in our code would stop a very large mutation being sent occasionally. My current work-around is to break the mutation list into multiple parts, but to do this correctly I need to track the size of each mutation which is fairly messy. * Is there some way to configure Thrift or Cassandra to deal with messages that are larger than the max frame size (at either client or server) ?* Thanks, Ross
Re: consistency ONE and null
that makes sense. thanks! On Apr 7, 2011, at 8:36 AM, Stephen Connolly wrote: also there is a configuration parameter that controls the probability of any read request triggering a read repair - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 07:35, Stephen Connolly stephen.alan.conno...@gmail.com wrote: as I understand, the read repair is a background task triggered by the read request, but once the consistency requirement has been met you will be given a response. the coordinator at CL.ONE is allowed to return your responce once it has one response (empty or not) from any replica. if the first response is empty, you get null - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 00:10, Jonathan Colby jonathan.co...@gmail.com wrote: Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not written because the node had a network hiccup (but came back online again). My question is, if you are reading a key with a CL of ONE, and you happen to land on that node that didn't get the write, will the read fail immediately? Or, would read repair check the other replicas and fetch the correct data from the other node(s)? Secondly, is read repair done according to the consistency level, or is read repair an independent configuration setting that can be turned on/off. There was a recent thread about a different variation of my question, but went into very technical details, so I didn't want to hijack that thread.
Re: RE: batch_mutate failed: out of sequence response
El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió: out of sequence response is thrift's way of saying I got a response for request Y when I expected request X. my money is on using a single connection from multiple threads. don't do that. I'm not using thrift directly, and my application is single thread, so I guess this is Pelops fault somehow. Since I managed to tame memory comsuption the problem has not appeared again, but it always happened during a stop-the-world GC. Could it be that the message was instead of being dropped by the server?
Re: RE: batch_mutate failed: out of sequence response
El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió: out of sequence response is thrift's way of saying I got a response for request Y when I expected request X. my money is on using a single connection from multiple threads. don't do that. I'm not using thrift directly, and my application is single thread, so I guess this is Pelops fault somehow. Since I managed to tame memory comsuption the problem has not appeared again, but it always happened during a stop-the-world GC. Could it be that the message was sent instead of being dropped by the server when the client assumed it had timed out?
Re: Error messages after rolling updating cassandra from 0.7.0 to 0.7.2
Thank both of you. I'll look it up from another view including non-cassandra processes on the servers. Just guessing Nagios or iptables or others causes it. Kazuo (11/04/05 22:55), Sasha Dolgy wrote: I've been seeing this EOF in my system.log file occasionally as well. Doesn't seem to be causing harm: ERROR [Thread-22] 2011-04-05 20:37:22,562 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[Thread-22,5,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:73) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:61) Firewall rules prevent anything but cassandra instances accessing cassandra instances on port 7000 ... This is with 0.7.4 -sd On Tue, Apr 5, 2011 at 3:47 PM, Jonathan Ellisjbel...@gmail.com wrote: Oops, I saw EOFException and jumped to scrub. But your EOF is coming from TCP. Something (almost certainly a non-cassandra process) is connecting to the internal Cassandra communication port (the one that defaults to 7000) and disconnecting. On Mon, Apr 4, 2011 at 4:14 AM, Kazuo YAGIky...@zynga.co.jp wrote: Solution: upgrade to 0.7.4, run scrub Although I upgraded all my cassandra nodes from 0.7.0 to 0.7.4 and ran nodetool scrub to all keyspaces, this EOFException error messages didn't go away. Do you have any ideas how to deal with it next? Besides, it would be really useful if I could know whether or not these error messages are ignorable, because our application has been working well before and after upgrading. Thanks, Kazuo
Re: LB scenario
I would look at your client and see if it can handle multiple pools. So it will try to connect in the first pool, if that fails retry the other nodes in that pool and then move to the next pool. I think hector has grabbed some of the performance detection features from the DynamicSnitch so may be a good place to start looking. Hope that helps. Aaron On 7 Apr 2011, at 03:26, A J wrote: I have done some more research. My question now is: 1. From my tests I see that it matters a lot whether the co-ordinator node is local (geographically) to the client or not. I will have a cassandra cluster where nodes will be distributed across the globe. Say 3 in US east_cost, 3 in US west_coast and 3 in europe. Now, if I can help, I would like to have most of the traffic from my california clients being handled by the west_coast nodes. But incase the west_coast nodes are down or slow, the co-ordinator node can be elsewhere. What is the best strategy to give different weight to different nodes, where some nodes are preferred over the others. Thanks. On Tue, Apr 5, 2011 at 2:23 PM, Peter Schuller peter.schul...@infidyne.com wrote: Can someone comment on this ? Or is the question too vague ? Honestly yeah I couldn't figure out what you were asking ;) What specifically about the diagram are you trying to convey? -- / Peter Schuller
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
2011/4/7 Jonathan Ellis jbel...@gmail.com Hypothesis: it's probably the flush causing the CMS, not the snapshot linking. Confirmation possibility #1: Add a logger.warn to CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be called, but let's rule it out. Confirmation possibility #2: Force some flushes w/o snapshot. Either way: concurrent mode failure is the easy GC problem. Hopefully you really are seeing mostly that -- this means the JVM didn't start CMS early enough, so it ran out of space before it could finish the concurrent collection, so it falls back to stop-the-world. The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction and (possibly) increasing heap capacity if your heap is simply too full too much of the time. You can also mitigate it by increasing the phi threshold for the failure detector, so the node doing the GC doesn't mark everyone else as dead. (Eventually your heap will fragment and you will see STW collections due to promotion failed, but you should see that much less frequently. GC tuning to reduce fragmentation may be possible based on your workload, but that's out of scope here and in any case the real fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) Jonatan do you have plans to backport this to 0.7 branch. (Because It's very hard to tune CMS, and if people is novice in java this task becomes much harder )
Reappearing nodes
Hi all, dead nodes which I removed via nodetool's removetoken are reappearing in the ring after few days. Is this normal? Is there a way how to remove them for good? Thanks, Viliam
Re: CL.ONE reads / RR / badness_threshold interaction
Nice explanation. Wanted to add the importance of been the first node in the ordered node list, even for CL ONE, is that this is the node sent the data request and it has to return before the CL is considered satisfied. e.g. CL One with RR running, read sent to all 5 replicas, if 3 digest request have returned the coordinator will still be blocking waiting for the one data request. Thanks Aaron On 7 Apr 2011, at 08:13, Peter Schuller wrote: Ok, I took this opportunity to look look a bit more on this part of the code. My reading of StorageProxy.fetchRows() and related is as follows, but please allow for others to say I'm wrong/missing something (and sorry, this is more a stream of consciousness that is probably more useful to me for learning the code than in answer to your question, but it's probably better to send it than write it and then just discard the e-mail - maybe someone is helped;)): The endpoints obtained is sorted by the snitch's sortByProximity, against the local address. If the closest (as determined by that sorting) is the local address, the request is added directly to the local READ stage. In the case of the SimpleSnitch, sortByProximity is a no-op, so the sorted by proximity should be the ring order. As per the comments to the SimpleSnitch, the intent is to allow non-read-repaired reads to prefer a single endpoint, which improves cache locality. So my understand is that in the case of the SImpleSnitch, ignoring any effect of the dynamic snitch, you will *not* always grab from the local node because the closest node (because ring order is used) is just whatever is the first node on the ring in the replica set. In the case of the NetworkTopologyStrategy, it inherits the implementation in AbstractNetworkTopologySnitch which sorts by AbstractNetworkTopologySnitch.compareEndPoints(), which: (1) Always prefers itself to any other node. So myself is always closest, no matter what. (2) Else, always prefers a node in the same rack, to a node in a different rack. (3) Else, always prefers a node in the same dc, to a node in a different dc. So in the NTS case, I believe, *disregarding the dynamic snitch*, that with NTS you would in fact always read from the co-ordinator node if that node happens to be part of the replica set for the row. (There is no tie-breaking if neither 1, 2 nor 3 above gives a presedence, and it is sorted with Collections.sort(), which guarantees that the sort is stable. So for nodes where rack/dc awareness does not come into play, it should result in the ring order as with the SimpleSnitch.) Now; so far this only determines the order of endpoints after proximity sorting. fetchRows() will route to itself directly without messaging if the closest node is itself. This determines from which node we read the *data* (not digest). Moving back to endpoint selection; after sorting by proximity it is actually filtered by getReadCallback. This is what determines how many will be receiving a request. If read repair doesn't happen, it'll be whatever is implied by the consistency level (so only one for CL.ONE). If read repair does happen, all endpoints are included and so none is filtered out. Moving back out into fetchRows(), we're now past the sending or local scheduling of the data read. It then loops over the remainder (1 through last of handler.endpoints) and submitting digest read messages to each endpoint (either local or remote). We're now so far as to have determined (1) which node to send data request to, (2) which nodes, if any, to send digest reads to (regardless of whether it is due to read repair or consistency level requirements). Now fetchRows() proceeds to iterate over all the ReadCallbacks, get():Ing each. This is where digest mismatch exceptions are raised if relevant. CL.ONE seems special-cased in the sense that if the number of responses to block/wait for is exactly 1, the data is returned without resolving to check for digest mismatches (once responses come back later on, the read repair is triggered by ReadCallback.maybeResolveForRepair). In the case of CL ONE, a digest mismatch can be raised immediately in which case fetchRows() triggers read repair. Now: However case (C) as I have described it does not allow for any notion of 'pinning' as mentioned for dynamic_snitch_badness_threshold: # if set greater than zero and read_repair_chance is 1.0, this will allow # 'pinning' of replicas to hosts in order to increase cache capacity. # The badness threshold will control how much worse the pinned host has to be # before the dynamic snitch will prefer other replicas over it. This is # expressed as a double which represents a percentage. Thus, a value of # 0.2 means Cassandra would continue to prefer the static snitch values # until the pinned host was 20% worse than the fastest. If you look at DynamicEndpointSnitch.sortByProximity(), it branches into
Re: CL.ONE reads / RR / badness_threshold interaction
Peter, thank you for the extremely detailed reply. To now answer my own question, the critical points that are different from what I said earlier are: that CL.ONE does prefer *one* node (which one depending on snitch) and that RR uses digests (which are not mentioned on the wiki page [1]) instead of comparing raw requests. Totally tangential, but in the case of CL.ONE with narrow rows making the request and taking the fastest would probably be better, but having things work both ways depending on row size sounds painfully complicated. (As Aaron points out this is not how things work now.) I am assuming that RR digests save on bandwidth, but to generate the digest with a row cache miss the same number of disk seeks are required (my nemesis is disk io). So to increase pinny-ness I'll further reduce RR chance and set a badness threshold. Thanks all. [1] http://wiki.apache.org/cassandra/ReadRepair
Re: Stress tests failed with secondary index
Can you turn the logging up to DEBUG level and look for a message from CassandraServer that says ... timed out ? Also check the thread pool stats nodetool tpstats to see if the node is keeping up. Aaron On 7 Apr 2011, at 13:43, Sheng Chen wrote: Thank you Aaron. It does not seem to be an overload problem. I have 16 cores and 48G ram on the single node, and I reduced the concurrent threads to be 1. Still, it just suddenly dies of a timeout, while the cpu, ram, disk load are below 10% and write latency is about 0.5ms for the past 10 minutes which is really fast. No logs of dropped messages are found. 2011/4/7 aaron morton aa...@thelastpickle.com TimedOutException means that the less than CL number of nodes responded to the coordinator before the rpc_timeout. So it was overloaded. Which makes sense when you say it only happens with secondary indexes. Consider things like - reducing the throughput - reducing the number of clients - ensuring the clients are connecting to all nodes in the cluster. You will probably find some logs about dropped messages on some nodes. Aaron On 6 Apr 2011, at 20:39, Sheng Chen wrote: I used py_stress module to insert 10m test data with a secondary index. I got the following exceptions. # python stress.py -d xxx -o insert -n 1000 -c 5 -s 34 -C 5 -x keys total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time 265322,26532,26541,0.00186140829433,10 630300,36497,36502,0.00129331431204,20 986781,35648,35640,0.0013310986218,30 1332190,34540,34534,0.00135942295893,40 1473578,14138,14138,0.00142941070007,50 Process Inserter-38: Traceback (most recent call last): File /usr/lib64/python2.4/site-packages/multiprocessing/process.py, line 237, in _bootstrap self.run() File stress.py, line 242, in run self.cclient.batch_mutate(cfmap, consistency) File /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py, line 784, in batch_mutate TimedOutException: TimedOutException(args=()) self.run() File stress.py, line 242, in run self.recv_batch_mutate() File /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py, line 810, in recv_batch_mutate raise result.te Tests without secondary index is ok at about 40k ops/sec. There is a `GC for ParNew` for about 200ms taking place every second. Does it matter? The same gc for about 400ms happens every 2 seconds, which does not hurt the inserts without secondary index. Thanks in advance for any advice. Sheng
Re: Reappearing nodes
Sounds like this http://comments.gmane.org/gmane.comp.db.cassandra.user/14498 https://issues.apache.org/jira/browse/CASSANDRA-2371 Aaron On 7 Apr 2011, at 22:36, Viliam Holub wrote: Hi all, dead nodes which I removed via nodetool's removetoken are reappearing in the ring after few days. Is this normal? Is there a way how to remove them for good? Thanks, Viliam
reoccurring exceptions seen
These types of exceptions is seen sporadically in our cassandra logs. They occur especially after running a repair with the nodetool. I assume there are a few corrupt rows. Is this cause for panic? Will a repair fix this, or is it best to do a decomission + bootstrap via a move for example? or would a scrub help here? ERROR [CompactionExecutor:1] 2011-04-07 15:51:12,093 PrecompactedRow.java (line 82) Skipping row DecoratedKey(36813508603227779893025154359070714012, 32326437643439642d623566332d346433392d613334622d343738643433633130383633) in /var/lib/cassandra/data/DFS/main-f-164-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803) at org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56) at org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [CompactionExecutor:1] 2011-04-07 15:51:26,356 INFO [MigrationStage:1] 2011-03-11 17:20:10,900 Migration.java (line 136) Applying migration 6f6e2a6c-4bfb-11e0-a3ae-87e4c47e8541 Add keyspace: DFSrep factor:2rep strategy:NetworkTopologyStrategy{org.apache.cassandra.config.CFMetaData@2a4bd173[cfId=1000,tableName=DFS,cfName=main,cfType=Standard,comparator=org.apache.cassandra.db.marshal.BytesType@c16c2c0,subcolumncomparator=null,c...skipping... at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:449) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:124) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:94) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138)
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
No, 2252 is not suitable for backporting to 0.7. On Thu, Apr 7, 2011 at 7:33 AM, ruslan usifov ruslan.usi...@gmail.com wrote: 2011/4/7 Jonathan Ellis jbel...@gmail.com Hypothesis: it's probably the flush causing the CMS, not the snapshot linking. Confirmation possibility #1: Add a logger.warn to CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be called, but let's rule it out. Confirmation possibility #2: Force some flushes w/o snapshot. Either way: concurrent mode failure is the easy GC problem. Hopefully you really are seeing mostly that -- this means the JVM didn't start CMS early enough, so it ran out of space before it could finish the concurrent collection, so it falls back to stop-the-world. The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction and (possibly) increasing heap capacity if your heap is simply too full too much of the time. You can also mitigate it by increasing the phi threshold for the failure detector, so the node doing the GC doesn't mark everyone else as dead. (Eventually your heap will fragment and you will see STW collections due to promotion failed, but you should see that much less frequently. GC tuning to reduce fragmentation may be possible based on your workload, but that's out of scope here and in any case the real fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) Jonatan do you have plans to backport this to 0.7 branch. (Because It's very hard to tune CMS, and if people is novice in java this task becomes much harder ) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
codeigniter+phpcassa
For anyone using Codeigniter and interested. I've written a little library to integrate Codeigniter with PHPcassa and consequently Cassandra. It provides you with access to code igniter's $this-db instance that only has the library's methods and phpcassa's. Follow up tutorial http://crlog.info/2011/04/07/apache-cassandra-phpcassa-code-igniter-large-scale-php-app-in-5-minutes/ Library download from https://github.com/zcourts/cassandraci
Secondary Index Updates Break CLI and Client Code Reading
Creating an index, validator, and default validator then renaming/dropping the index later results in read errors. Is there an easy way around this problem without having to keep an invalid definition for a column that will get deleted or expired? 1) create a secondary index on a column with a validator and a default validator 2) insert a row 3) read and verify the row 4) update the CF/index/name/validator 5) read the CF and get an error (CLI or Pycassa) CLI Commands to create the row and CF/Index create column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: colour, validation_class: LongType, index_type: KEYS}]; set cf_testing['key']['colour']='1234'; list cf_testing; update column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: color, validation_class: LongType, index_type: KEYS}]; ERROR from the CLI: list cf_testing; Using default limit of 100 --- RowKey: key invalid UTF8 bytes 04d2 Here is the Pycassa client code that shows this error too. badindex.py #!/usr/local/bin/python2.7 import pycassa import uuid import sys def main(): try: keyspace=badindex serverPoolList = ['localhost:9160'] pool = pycassa.connect(keyspace, serverPoolList) except: print couldn't get a connection sys.exit() cfname=cf_testing cf = pycassa.ColumnFamily(pool, cfname) results = cf.get_range(start='key', finish='key', row_count=1) for key, columns in results: print key, '=', columns if __name__ == __main__: main()
Re: problem with large batch mutation set
On Wed, Apr 6, 2011 at 11:49 PM, Ross Black ross.w.bl...@gmail.com wrote: Hi, I am using the thrift client batch_mutate method with Cassandra 0.7.0 on Ubuntu 10.10. When the size of the mutations gets too large, the client fails with the following exception: Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157) at org.apache.cassandra.thrift.Cassandra$Client.send_batch_mutate(Cassandra.java:901) at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:889) at com.cxense.cxad.core.persistence.cassandra.store.BatchMutationTask.apply(BatchMutationTask.java:78) at com.cxense.cxad.core.persistence.cassandra.store.BatchMutationTask.apply(BatchMutationTask.java:30) at com.cxense.cassandra.conn.DefaultCassandraConnectionTemplate.execute(DefaultCassandraConnectionTemplate.java:316) at com.cxense.cassandra.conn.DefaultCassandraConnectionTemplate.execute(DefaultCassandraConnectionTemplate.java:257) at com.cxense.cxad.core.persistence.cassandra.store.AbstractCassandraStore.writeMutations(AbstractCassandraStore.java:492) ... 39 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) ... 47 more It took a while for me to discover that this obscure error message was as a result of the thrift message exceeding the maximum frame size specified for the Cassandra server. (default 15MB) [Using TFastFramedTransport with a max frame size of 15728640 bytes.] The poor error message looks like a bug in Thrift server code that assumes that any transport exception is a connection failure which should drop the connection. * *My main problem is how to ensure that this does not occur again in running code. I could configure the server with a larger frame size, but this size is effectively arbitrary so there is no guarantee in our code would stop a very large mutation being sent occasionally. My current work-around is to break the mutation list into multiple parts, but to do this correctly I need to track the size of each mutation which is fairly messy. * Is there some way to configure Thrift or Cassandra to deal with messages that are larger than the max frame size (at either client or server) ?* The only way to do that is to set the frame size higher. Messages cannot be bigger than the maximum frame size. -ryan
Re: Secondary Index keeping track of column names
You could simulate it thoug. Just Add some Meta Column with a boolean Value indicating if the referred Column is in the Row or Not. Then Add an Index in that Meta Column and query for it. I. E. Row a: (c=1234),(has_c=Yes) Quert : List cf where has_c=Yes Am 06.04.2011 um 18:52 schrieb Jonathan Ellis jbel...@gmail.com: No, 0.7 indexes handle equality queries; you're basically asking for a IS NOT NULL query. On Wed, Apr 6, 2011 at 11:23 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: In 0.7.X is there a way to have an automatic secondary index which keeps track of what keys contain a certain column? Right now we are keeping track of this manually, so we can quickly get all of the rows which contain a given column, it would be nice if it was automatic. -Jeremiah Jeremiah Jordan Application Developer Morningstar, Inc. Morningstar. Illuminating investing worldwide. +1 312 696-6128 voice jeremiah.jor...@morningstar.com www.morningstar.com This e-mail contains privileged and confidential information and is intended only for the use of the person(s) named above. Any dissemination, distribution, or duplication of this communication without prior written consent from Morningstar is strictly prohibited. If you have received this message in error, please contact the sender immediately and delete the materials from any computer. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
I'll capture what I we're seeing here for anyone else who may look into this in more detail later. Our standard heap growth is ~300K in between collections with regular ParNew collections happening on average about every 4 seconds. All very healthy. The memtable flush (where we see almost all our CMS activity) seems to have some balloon effect that despite a 64MB memtable size, causes over 512MB heap to be consumed in half a second. In addition to the hefty amount of garbage it causes, due to the MaxTenuringThreshold=1 setting most of that garbage seems to spill immediately into the tenured generation which quickly fills and triggers a CMS. The rate of garbage overflowing to tenured seems to outstrip the speed of the concurrent mark worker which is almost always interrupted and failed to a concurrent collection. However, the tenured collection is usually hugely effective, recovering over half the total heap. Two questions for the group then: 1) Does this seem like a sane amount of garbage (512MB) to generate when flushing a 64MB table to disk? 2) Is this possibly a case of the MaxTenuringThreshold=1 working against cassandra? The flush seems to create a lot of garbage very quickly such that normal CMS isn't even possible. I'm sure there was a reason to introduce this setting but I'm not sure it's universally beneficial. Is there any history on the decision to opt for immediate promotion rather than using an adaptable number of survivor generations?
Columns values(integer) need frequent updates/ increments
Hi, I am working on a Question/Answers web app using Cassandra(consider very similar to StackOverflow sites). I need to built the reputation system for users on the application. This way the user's reputation increases when s/he answers correctly somebody's question. Thus if I keep the reputation score of users as column values, these columns are very very frequently updated. Thus I have several versions of a single column which I guess is very bad. Similarly for the questions as well, the no of up-votes will increase very very frequently and hence again I'll get several versions of same column. How should I try to minimize this ill effect? ** What I thought of.. Try using a separate CF for reputation system, so that the memtable stores most of the columns(containing reputation scores of the users). Thus frequent updates will update the column in the memtable, which means more easier reads as well as updates. These reputations columns are anyways small do not explode in numbers(they only replace another column).
Re: Flush / Snapshot Triggering Full GCs, Leaving Ring
On Thu, Apr 7, 2011 at 2:27 PM, Erik Onnen eon...@gmail.com wrote: 1) Does this seem like a sane amount of garbage (512MB) to generate when flushing a 64MB table to disk? Sort of -- that's just about exactly the amount of space you'd expect 64MB of serialized data to take, in memory. (Not very efficient, I know.) So, you would expect that much to be available to GC, after a flush. Also, flush creates a buffer equal to in_memory_compaction_limit. So that will also generate a spike. I think you upgraded from 0.6 -- if the converter turned row size warning limit into i_m_c_l then it could be much larger. Otherwise, not sure why flush would consume that much *extra* though. Smells like something unexpected in the flush code to me. I don't see anything obvious though. SSTableWriter serializes directly to the outputstream without (m)any other allocations. 2) Is this possibly a case of the MaxTenuringThreshold=1 working against cassandra? The flush seems to create a lot of garbage very quickly such that normal CMS isn't even possible. I'm sure there was a reason to introduce this setting but I'm not sure it's universally beneficial. Is there any history on the decision to opt for immediate promotion rather than using an adaptable number of survivor generations? The history is that, way back in the early days, we used to max it out the other way (MTT=128) but observed behavior is that objects that survive 1 new gen collection are very likely to survive forever. This fits with what we expect theoretically: read requests and ephemera from write requests will happen in a small number of ms, but memtable data is not GC-able until flush. (Rowcache data of course is effectively unbounded in tenure.) Keeping long-lived data in a survivor space just makes new gen collections take longer since you are copying that data back and forth over and over. (We have advised some read-heavy customers to ramp up to MTT=16, so it's not a hard-and-fast rule, but it still feels like a reasonable starting point to me.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: RE: batch_mutate failed: out of sequence response
Pelops uses a single connection per operation from a pool that is backed by Apache Commons Pool (assuming you're using Cassandra 0.7). I'm not saying it's perfect but it's NOT sharing a connection over multiple threads. Dan Hendry mentioned that he sees these errors. Is he also using Pelops? From his comment about retrying I'd assume not... -- Dan Washusen On Thursday, 7 April 2011 at 7:39 PM, Héctor Izquierdo Seliva wrote: El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió: out of sequence response is thrift's way of saying I got a response for request Y when I expected request X. my money is on using a single connection from multiple threads. don't do that. I'm not using thrift directly, and my application is single thread, so I guess this is Pelops fault somehow. Since I managed to tame memory comsuption the problem has not appeared again, but it always happened during a stop-the-world GC. Could it be that the message was sent instead of being dropped by the server when the client assumed it had timed out?
RE: Abnormal memory consumption
Connect with jconsole and watch the memory consumption graph. Click the force GC button watch what the low point is, that is how much memory is being used for persistent stuff, the rest is garbage generated while satisfying queries. Run a query, watch how the graph spikes up when you run your query, that is how much is needed for the query. Like others have said, Cassandra isn't using 600Mb of RAM, the Java Virtual Machine is using 600Mb of RAM, because your settings told it it could. The JVM will use as much memory as your settings allow it to. If you really are putting that little data into your test server, you should be able to tune everything down to only 256Mb easily (I do this for test instances of Cassandra that I spin up to run some tests on), maybe further. -Jeremiah From: openvictor Open [mailto:openvic...@gmail.com] Sent: Wednesday, April 06, 2011 7:59 PM To: user@cassandra.apache.org Subject: Re: Abnormal memory consumption Hello Paul, Thank you for the tip. The random port attribution policy of JMX was really making me mad ! Good to know there is a solution for that problem. Concerning the rest of the conversation, my only concern is that as an administrator and a student it is hard to constantly watch Cassandra instances so that they don't crash. As much as I love the principle of Cassandra, being constantly afraid of memory consumption is an issue in my opinion. That being said, I took a new 16 Gb server today, but I don't want Cassandra to eat up everything if it is not needed, because Cassandra will have some neighbors such as Tomcat, solR on this server. And for me it is very weird that on my small instance where I put a lot of constraints like throughput_memtableInMb to 6 Cassandra uses 600 Mb of ram for 6 Mb of data. It seems to be a little bit of an overkill to me... And so far I failed to find any information on what this massive overhead can be... Thank you for your answers and for taking the time to answer my questions. 2011/4/6 Paul Choi paulc...@plaxo.com You can use JMX over ssh by doing this: http://blog.reactive.org/2011/02/connecting-to-cassandra-jmx-via-ssh.htm l Basically, you use SSH -D to do dynamic application port forwarding. In terms of scaling, you'll be able to afford 120GB RAM/node in 3 years if you're successful. Or, a machine with much less RAM and flash-based storage. :) Seriously, though, the formula in the tuning guidelines is a guideline. You can probably get acceptable performance with much less. If not, you can shard your app such that you host a few Cfs per cluster. I doubt you'll need to though. From: openvictor Open openvic...@gmail.com Reply-To: user@cassandra.apache.org Date: Mon, 4 Apr 2011 18:24:25 -0400 To: user@cassandra.apache.org Subject: Re: Abnormal memory consumption Okay, I see. But isn't there a big issue for scaling here ? Imagine that I am the developper of a certain very successful website : At year 1 I need 20 CF. I might need to have 8Gb of RAM. Year 2 I need 50 CF because I added functionalities to my wonderful webiste will I need 20 Gb of RAM ? And if at year three I had 300 Column families, will I need 120 Gb of ram / node ? Or did I miss something about memory consuption ? Thank you very much, Victor 2011/4/4 Peter Schuller peter.schul...@infidyne.com And about the production 7Gb or RAM is sufficient ? Or 11 Gb is the minimum ? Thank you for your inputs for the JVM I'll try to tune that Production mem reqs are mostly dependent on memtable thresholds: http://www.datastax.com/docs/0.7/operations/tuning If you enable key caching or row caching, you will have to adjust accordingly as well. -- / Peter Schuller
weird error when connecting to cassandra mbean proxy
Hi All, I have written a code for connecting to mbean server runnning on cassandra node. I get the following error: Exception in thread main java.lang.reflect.UndeclaredThrowableException at $Proxy1.getReadOperations(Unknown Source) at com.smeet.cassandra.CassandraJmxHttpServerMy.init(CassandraJmxHttpServerMy.java:72) at com.smeet.cassandra.CassandraJmxHttpServerMy.main(CassandraJmxHttpServerMy.java:77) Caused by: javax.management.InstanceNotFoundException: org.apache.cassandra.service:type=StorageProxy at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1118) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:679) at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:672) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:90) I have attached the code file. Cassandra is running on the port I am trying to connect to . Please Suggest Thanks Anurag CassandraJmxHttpServerMy.java Description: Binary data
Re: weird error when connecting to cassandra mbean proxy
The correct object name is org.apache.cassandra.db:type=StorageProxy -Naren On Thu, Apr 7, 2011 at 4:36 PM, Anurag Gujral anurag.guj...@gmail.comwrote: Hi All, I have written a code for connecting to mbean server runnning on cassandra node. I get the following error: Exception in thread main java.lang.reflect.UndeclaredThrowableException at $Proxy1.getReadOperations(Unknown Source) at com.smeet.cassandra.CassandraJmxHttpServerMy.init(CassandraJmxHttpServerMy.java:72) at com.smeet.cassandra.CassandraJmxHttpServerMy.main(CassandraJmxHttpServerMy.java:77) Caused by: javax.management.InstanceNotFoundException: org.apache.cassandra.service:type=StorageProxy at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1118) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:679) at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:672) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:90) I have attached the code file. Cassandra is running on the port I am trying to connect to . Please Suggest Thanks Anurag -- Narendra Sharma Solution Architect *http://www.persistentsys.com* *http://narendrasharma.blogspot.com/*
Re: Secondary Index Updates Break CLI and Client Code Reading :: DebugLog Attached
Addressed on the issue you created, https://issues.apache.org/jira/browse/CASSANDRA-2436. On Thu, Apr 7, 2011 at 12:19 PM, Fryar, Dexter dexter.fr...@hp.com wrote: I have also attached the debug log with each step attached. I've even tried going back and updating the CF with the old index to no avail. You can insert/write all you want, but reads will fail if you come across a row that included one of these cases. log4j-server.properties log4j.rootLogger=DEBUG,stdout,R -Original Message- From: Fryar, Dexter Sent: Thursday, April 07, 2011 11:19 AM To: user@cassandra.apache.org Subject: Secondary Index Updates Break CLI and Client Code Reading Creating an index, validator, and default validator then renaming/dropping the index later results in read errors. Is there an easy way around this problem without having to keep an invalid definition for a column that will get deleted or expired? 1) create a secondary index on a column with a validator and a default validator 2) insert a row 3) read and verify the row 4) update the CF/index/name/validator 5) read the CF and get an error (CLI or Pycassa) CLI Commands to create the row and CF/Index create column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: colour, validation_class: LongType, index_type: KEYS}]; set cf_testing['key']['colour']='1234'; list cf_testing; update column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[{column_name: color, validation_class: LongType, index_type: KEYS}]; ERROR from the CLI: list cf_testing; Using default limit of 100 --- RowKey: key invalid UTF8 bytes 04d2 Here is the Pycassa client code that shows this error too. badindex.py #!/usr/local/bin/python2.7 import pycassa import uuid import sys def main(): try: keyspace=badindex serverPoolList = ['localhost:9160'] pool = pycassa.connect(keyspace, serverPoolList) except: print couldn't get a connection sys.exit() cfname=cf_testing cf = pycassa.ColumnFamily(pool, cfname) results = cf.get_range(start='key', finish='key', row_count=1) for key, columns in results: print key, '=', columns if __name__ == __main__: main() -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
CLI does not list data after upgrading to 0.7.4
Hello I just upgraded a 1-node setup from rc2 to 0.7.4 and ran scrub without any error. Now 'list CF' in CLI does not return any data as followings: list User; Using default limit of 100 Input length = 1 I don't see any errors or exceptions in the log. If I run CLi from 0.7.0 against 0.7.4 server, I am getting data. Thanks