Re: Concurrency Control
It's the timestamps provided in the columns that do concurrency control/conflict resolution. Basically, the newer timestamp wins. For counters I think there is no such mechanism (i.e. counter updates are not idempotent). From https://wiki.apache.org/cassandra/DataModel : All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings. Timestamps can be anything you like, but microseconds since 1970 is a convention. Whatever you use, it must be consistent across the application, otherwise earlier changes may overwrite newer ones. 2012/5/28 Helen live42...@gmx.ch Hi, what kind of Concurrency Control Method is used in Cassandra? I found out so far that it's not done with the MVCC Method and that no vector clocks are being used. Thanks Helen -- Filipe Gonçalves
Re: Retrieving old data version for a given row
I have further questions: -Is there any other way to stract the contect of SSTable, writing a java program for example instead of using sstable2json? -I tried to get tombstons using the thrift API, but seems to be not possible, is it right? When I try, the program throws an exception. thanks in advance Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/24 aaron morton aa...@thelastpickle.com: Ok... it's really strange to me that Cassandra doesn't support data versioning cause all of other key-value databases support it (at least those who I know). You can design it into your data model if you need it. I have one remaining question: -in the case that I have more than 1 SSTable in the disk for the same column but with different data versions, is it possible to make a query to get the old version instead of the newest one? No. There is only ever 1 value for a column. The older copies of the column in the SSTables are artefacts of immutable on disk structures. If you want to see what's inside an SSTable use bin/sstable2json Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 9:42 PM, Felipe Schmidt wrote: Ok... it's really strange to me that Cassandra doesn't support data versioning cause all of other key-value databases support it (at least those who I know). I have one remaining question: -in the case that I have more than 1 SSTable in the disk for the same column but with different data versions, is it possible to make a query to get the old version instead of the newest one? Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/16 Dave Brosius dbros...@mebigfatguy.com: You're in for a world of hurt going down that rabbit hole. If you truely want version data then you should think about changing your keying to perhaps be a composite key where key is of form NaturalKey/VersionId Or if you want the versioning at the column level, use composite columns with ColumnName/VersionId format On 05/16/2012 10:16 AM, Felipe Schmidt wrote: That was very helpfull, thank you very much! I still have some questions: -it is possible to make Cassandra keep old value data after flushing? The same question for the memTable, before flushing. Seems to me that when I update some tuple, the old data will be overwrited in memTable, even before flushing. -it is possible to scan values from the memtable, maybe using the so-called Thrift API? Using the client-api I can just see the newest data version, I can't see what's really happening with the memTable. I ask that cause what I'll try to do is a Change Data Capture to Cassandra and the answers will define what kind of aproaches I'm able to use. Thanks in advance. Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/14 aaron mortonaa...@thelastpickle.com: Cassandra does not provide access to multiple versions of the same column. It is essentially implementation detail. All mutations are written to the commit log in a binary format, see the o.a.c.db.RowMutation.getSerializedBuffer() (If you want to tail it for analysis you may want to change commitlog_sync in cassandra.yaml) Here is post about looking at multiple versions columns in an sstable http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones/ Remember that not all versions of a column are written to disk (see http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/). Also compaction will compress multiple versions of the same column from multiple files into a single version in a single file . Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/05/2012, at 9:50 PM, Felipe Schmidt wrote: Yes, I need this information just for academic purposes. So, to read old data values, I tried to open the Commitlog using tail -f and also the log files viewer of Ubuntu, but I can not see many informations inside of the log! Is there any other way to open this log? I didn't find any Cassandra API for this purpose. Thanks averybody in advance. Regards, Felipe Mathias Schmidt (Computer Science UFRGS, RS, Brazil) 2012/5/14 zhangcheng2zhangche...@software.ict.ac.cn: After compaciton, the old version data will gone! zhangcheng2 From: Felipe Schmidt Date: 2012-05-14 05:33 To: user Subject: Retrieving old data version for a given row I'm trying to retrieve old data version for some row but it seems not be possible. I'm a beginner with Cassandra and the unique aproach I know is looking to the SSTable in the storage folder, but if I insert some column and right after insert another value to the same row, after flushing, I only get the last value. Is there any way to get the old data
Moving to 1.1
I didn't track mailing list since 1.1-rc is out and know i have several questions. 1) We want to upgrade from 1.09. How stable 1.1 is? I mean work under high load, running compactions and clean-ups? Is it faster then 1.09? 2) If i what to use hector as cassandra client which version is better for 1.1? Is it ok to use 0.8.0-3? We're kind of stuck on this hector release because new versions support serialization of Doubles (and some other types, but doubles are 50% of data). So we can't read old data: double values were serialized as objects and can't be deserialized as double. We can override default serializer by it's older version and keep working with serialized objects... but it looks rather stupid. Did anyone run into such problem? And i didn't find any change lists for hector - so such backward incompatibility was quite a surprise. Anybody knows some other breaking changes from 0.8.0-3? 3) Java 7 now recommended for use by Oracle. We have several developers running local cassandra instances on it for a while without problems. Anybody tried it in production? Some time ago java 7 wasn't recommended for use with cassandra, what's for now? p.s. sorry for my 'english' Thanks, Sergey B.
Renaming a keyspace in 1.1
Is it possible ? How ?
Re: commitlog_sync_batch_window_in_ms change in 0.7
Thank you all. We're planning to move soon to a more advanced version. But for now I have a lot of data on my 0.7 cluster which i dont want to lose, just because of some schema error on restart etc. I dont mind losing any writes during the shutdown, however losing ALL the data would require me to run the setup script for my experiments for several days, something I obviously want to avoid. On Wed, May 30, 2012 at 8:29 AM, Pierre Chalamet pie...@chalamet.net wrote: You'd better use version 1.0.9 (using this one in production) or 1.0.10. 1.1 is still a bit young to be ready for prod unfortunately. --Original Message-- From: Rob Coli To: user@cassandra.apache.org To: osish...@gmail.com ReplyTo: user@cassandra.apache.org Subject: Re: commitlog_sync_batch_window_in_ms change in 0.7 Sent: May 30, 2012 03:12 On Mon, May 28, 2012 at 6:53 AM, osishkin osishkin osish...@gmail.com wrote: I'm experimenting with Cassandra 0.7 for some time now. I feel obligated to recommend that you upgrade to Cassandra 1.1. Cassandra 0.7 is better than 0.6, but I definitely still wouldn't be experimenting with this old version in 2012. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb - Pierre
tokens and RF for multiple phases of deployment
Hi all, We are planning to deploy a small cluster with 4 nodes in one DC first, and will expend that to 8 nodes, then add another DC with 8 nodes for fail over (not active-active), so all the traffic will go to the 1st cluster, and switch to 2nd cluster if the whole 1st cluster is down or on maintenance. Could you provide some guide on how to assign the tokens in this growing deployment phases? I looked at some docs but not very clear on how to assign tokens on the fail-over case. Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster? We'd like to keep both write and read on the same cluster. Thanks in advance, Chong
Re: Moving to 1.1
1) Stable is a hard word to define. History shows it is better to let anything .0 burn in a bit. if you are pre-production it probably does not matter, otherwise I would say play safe. Wait for a .1 or .2 or the .0 is in the wild for a few weeks. 2) I worked on one of the patches to get hector working with 1.1 there is a specific release especially for those creating meta-data 3) We are slowly migrating our environment to Java 1.7. The only issue we have ran into is https://issues.apache.org/jira/browse/CASSANDRA-4275 which is just a setting tune. Anecdotally I see something that could be better performance with 1.7 (but I also did a kernel update) so I would not call it essential. Edward On Wed, May 30, 2012 at 7:08 AM, Vanger disc...@gmail.com wrote: I didn't track mailing list since 1.1-rc is out and know i have several questions. 1) We want to upgrade from 1.09. How stable 1.1 is? I mean work under high load, running compactions and clean-ups? Is it faster then 1.09? 2) If i what to use hector as cassandra client which version is better for 1.1? Is it ok to use 0.8.0-3? We're kind of stuck on this hector release because new versions support serialization of Doubles (and some other types, but doubles are 50% of data). So we can't read old data: double values were serialized as objects and can't be deserialized as double. We can override default serializer by it's older version and keep working with serialized objects... but it looks rather stupid. Did anyone run into such problem? And i didn't find any change lists for hector - so such backward incompatibility was quite a surprise. Anybody knows some other breaking changes from 0.8.0-3? 3) Java 7 now recommended for use by Oracle. We have several developers running local cassandra instances on it for a while without problems. Anybody tried it in production? Some time ago java 7 wasn't recommended for use with cassandra, what's for now? p.s. sorry for my 'english' Thanks, Sergey B.
unsibscribe
Cassandra 1.1.1 release?
Anyone have a rough idea of when Cassandra 1.1.1 is likely to be released? -Roland
Re: Replication factor
Ah. The lack of page cache hits after compaction makes sense. But I don't think the drastic effect it appears to have is expected. Do you have an idea of how much slower local reads get ? If you are selecting coordinators based on token ranges the DS is not as much. It still has some utility as the Digest reads will be happening on other nodes and it should help with selecting them. Thanks for the extra info. Aaron - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 1:24 AM, Viktor Jevdokimov wrote: All data is in the page cache. No repairs. Compactions not hitting disk for read. CPU 50%. ParNew GC 100 ms in average. After one compaction completes, new sstable is not in page cache, there may be a disk usage spike before data is cached, so local reads gets slower for a moment, comparing with other nodes. Redirecting almost all requests to other nodes finally ends up with a huge latency spike almost on all nodes, especially when ParNew GC may spike on one node (200ms). We call it “cluster hiccup”, when incoming and outgoing network traffic drops for a moment. And such hiccups happens several times an hour, few seconds long. Playing with badness threshold did not gave a lot better results, but turning DS off completely fixed all problems with latencies, node spikes, cluster hiccups and network traffic drops. In our case, our client is selecting endpoints for a key by calculating a token, so we always hit a replica. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo18be.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Thursday, May 24, 2012 13:00 To: user@cassandra.apache.org Subject: Re: Replication factor Your experience is when using CL ONE the Dynamic Snitch is moving local reads off to other nodes and this is causing spikes in read latency ? Did you notice what was happening on the node for the DS to think it was so slow ? Was compaction or repair going on ? Have you played with the badness threshold https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L472 ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:28 PM, Viktor Jevdokimov wrote: Depends on use case. For ours we have another experience and statistics, when turning dynamic snitch off makes overall latency and spikes much, much lower. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo29.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Brandon Williams [mailto:dri...@gmail.com] Sent: Thursday, May 24, 2012 02:35 To: user@cassandra.apache.org Subject: Re: Replication factor On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon
Re: what about an hybrid partitioner for CF with composite row key ?
* with the RP: for one ui action, many nodes may be requested, but it's simpler to balance the cluster Many nodes good. You will have increased availability if the data is more widely distributed. one sweeter(?) partitioner would be a partitioner that would distribute a row according only to the first part of its key (= according to folder id only). It would still be unbalanced. Is it doable to implement such a partitioner ? Sort of, but it's not a good idea. The token created by the partitioner is just some bytes, so technically they can be anything. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 1:47 AM, DE VITO Dominique wrote: Hi, We have defined a CF with a composite row key that sounds like (folder id, doc id). For our app, one very common pattern is accessing, through one ui action, some bunch of data with the following row keys: (id, id_1), (id, id_2), (id, id_3)... So, multiple rows are accessed, but all row keys have the same 1st part folder id. * with the BOP: for one ui action, one simple node is requested (in average), but it's much harder to balances the cluster nodes * with the RP: for one ui action, many nodes may be requested, but it's simpler to balance the cluster one sweeter(?) partitioner would be a partitioner that would distribute a row according only to the first part of its key (= according to folder id only). Is it doable to implement such a partitioner ? Thanks. Regards, Dominique
Re: Moving to 1.1
On Wed, May 30, 2012 at 4:08 AM, Vanger disc...@gmail.com wrote: 3) Java 7 now recommended for use by Oracle. We have several developers running local cassandra instances on it for a while without problems. Anybody tried it in production? Some time ago java 7 wasn't recommended for use with cassandra, what's for now? I have a variation of this question, which goes : Now that OpenJDK is the official Java reference implementation, are there plans to make Cassandra support it? https://blogs.oracle.com/henrik/entry/moving_to_openjdk_as_the Cassandra has (had?) a slightly passive-aggressive log message where it refers to any JDK other than Sun's as a buggy and suggests that you should upgrade to the Sun JDK. I'm fine with using whatever JDK is technically best, but within the enterprise using something other than the official reference implementation can be a tough sell. Wondering if people have a view as to the importance and/or feasibility of making OpenJDK supported. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Confusion regarding the terms replica and replication factor
First, note that replication is done at the row level, not at the node level. That line should look more like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 1,DC2: 1,DC3: 1 } This means that each row will have one copy in each DC and within each DC it's placement will be according to the partitioner, so could be on any of the nodes in the each DC. So, don't think of it as nodes replicating, but rather as how nodes should store a copy of each row in each DC. Also, replication does not relate the the seed nodes. Seed nodes allow the nodes to find each other initially, but are not special otherwise - any node can be used as a seed node. So if you had a strategy like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 3,DC2: 2,DC3: 1 } Each row would exist on 3 of 4 nodes in DC1, 2 of 4 nodes in DC2 and on one of the nodes in DC3. Again, with the placement in each DC due to the partitioner, based on the row key. Jeff On May 29, 2012, at 11:25 PM, David Fischer wrote: Ok now i am confused :), ok if i have the following placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1:R1,DC2:R1,DC3:R1 } this means in each of my datacenters i will have one full replica that also can be seed node? if i have 3 node in addition to the DC replica's with normal token calculations a key can be in any datacenter plus on each of the replicas right? It will show 12 nodes total in its ring On Thu, May 24, 2012 at 2:39 AM, aaron morton aa...@thelastpickle.com wrote: This is partly historical. NTS (as it is now) has not always existed and was not always the default. In days gone by used to be a fella could run a mighty fine key-value store using just a Simple Replication Strategy. A different way to visualise it is a single ring with a Z axis for the DC's. When you look at the ring from the top you can see all the nodes. When you look at it from the side you can see the nodes are on levels that correspond to their DC. Simple Strategy looks at the ring from the top. NTS works through the layers of the ring. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Nodes having a DC is a feature of *some* snitches and utilised by the *some* of the replication strategies (and by the messaging system for network efficiency). For background, mapping from row tokens to nodes is based on http://en.wikipedia.org/wiki/Consistent_hashing Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 1:07 AM, java jalwa wrote: Thanks Aaron. That makes things clear. So I guess the 0 - 2^127 range for tokens corresponds to a cluster -level top-level ring. and then you add some logic on top of that with NTS to logically segment that range into sub-rings as per the notion of data clusters defined in NTS. Whats the advantage of having a single top-level ring ? intuitively it seems like each replication group could have a separate ring so that the same tokens can be assigned to nodes in different DC. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Thanks again. On Wed, May 23, 2012 at 3:14 AM, aaron morton aa...@thelastpickle.com wrote: Now if a row key hash is mapped to a range owned by a node in DC3, will the Node in DC3 still store the key as determined by the partitioner and then walk the ring and store 2 replicas each in DC1 and DC2 ? No, only nodes in the DC's specified in the NTS configuration will be replicas. Or will the co-ordinator node be aware of the replica placement strategy, and override the partitioner's decision and walk the ring until it first encounters a node in DC1 or DC2 ? and then place the remaining replicas ? The NTS considers each DC to have it's own ring. This can make token selection in a multi DC environment confusing at times. There is something in the DS docs about it. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 3:16 PM, java jalwa wrote: Hi all, I am a bit confused regarding the terms replica and replication factor. Assume that I am using RandomPartitioner and NetworkTopologyStrategy for replica placement. From what I understand, with a RandomPartitioner, a row key will always be hashed and be stored on the node that owns the range to which the key is mapped. http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy. The example here, talks about having 2 data centers and a replication factor of 4 with 2 replicas in each datacenter, so the strategy is configured as DC1:2 and DC2:2. Now suppose I add another
Re: Confusion regarding the terms replica and replication factor
Thanks! My missunderstanding was the snitch names are broken up by DC1:RAC1 and the strategy_options takes only the first part of the snitch names? On Wed, May 30, 2012 at 12:14 PM, Jeff Williams je...@wherethebitsroam.com wrote: First, note that replication is done at the row level, not at the node level. That line should look more like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 1,DC2: 1,DC3: 1 } This means that each row will have one copy in each DC and within each DC it's placement will be according to the partitioner, so could be on any of the nodes in the each DC. So, don't think of it as nodes replicating, but rather as how nodes should store a copy of each row in each DC. Also, replication does not relate the the seed nodes. Seed nodes allow the nodes to find each other initially, but are not special otherwise - any node can be used as a seed node. So if you had a strategy like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 3,DC2: 2,DC3: 1 } Each row would exist on 3 of 4 nodes in DC1, 2 of 4 nodes in DC2 and on one of the nodes in DC3. Again, with the placement in each DC due to the partitioner, based on the row key. Jeff On May 29, 2012, at 11:25 PM, David Fischer wrote: Ok now i am confused :), ok if i have the following placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1:R1,DC2:R1,DC3:R1 } this means in each of my datacenters i will have one full replica that also can be seed node? if i have 3 node in addition to the DC replica's with normal token calculations a key can be in any datacenter plus on each of the replicas right? It will show 12 nodes total in its ring On Thu, May 24, 2012 at 2:39 AM, aaron morton aa...@thelastpickle.com wrote: This is partly historical. NTS (as it is now) has not always existed and was not always the default. In days gone by used to be a fella could run a mighty fine key-value store using just a Simple Replication Strategy. A different way to visualise it is a single ring with a Z axis for the DC's. When you look at the ring from the top you can see all the nodes. When you look at it from the side you can see the nodes are on levels that correspond to their DC. Simple Strategy looks at the ring from the top. NTS works through the layers of the ring. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Nodes having a DC is a feature of *some* snitches and utilised by the *some* of the replication strategies (and by the messaging system for network efficiency). For background, mapping from row tokens to nodes is based on http://en.wikipedia.org/wiki/Consistent_hashing Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 1:07 AM, java jalwa wrote: Thanks Aaron. That makes things clear. So I guess the 0 - 2^127 range for tokens corresponds to a cluster -level top-level ring. and then you add some logic on top of that with NTS to logically segment that range into sub-rings as per the notion of data clusters defined in NTS. Whats the advantage of having a single top-level ring ? intuitively it seems like each replication group could have a separate ring so that the same tokens can be assigned to nodes in different DC. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Thanks again. On Wed, May 23, 2012 at 3:14 AM, aaron morton aa...@thelastpickle.com wrote: Now if a row key hash is mapped to a range owned by a node in DC3, will the Node in DC3 still store the key as determined by the partitioner and then walk the ring and store 2 replicas each in DC1 and DC2 ? No, only nodes in the DC's specified in the NTS configuration will be replicas. Or will the co-ordinator node be aware of the replica placement strategy, and override the partitioner's decision and walk the ring until it first encounters a node in DC1 or DC2 ? and then place the remaining replicas ? The NTS considers each DC to have it's own ring. This can make token selection in a multi DC environment confusing at times. There is something in the DS docs about it. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 3:16 PM, java jalwa wrote: Hi all, I am a bit confused regarding the terms replica and replication factor. Assume that I am using RandomPartitioner and NetworkTopologyStrategy for replica placement. From what I understand, with a RandomPartitioner, a row key will always be hashed and be stored on the node that owns the range to which the key is mapped.
Re: commitlog_sync_batch_window_in_ms change in 0.7
On Tue, May 29, 2012 at 10:29 PM, Pierre Chalamet pie...@chalamet.net wrote: You'd better use version 1.0.9 (using this one in production) or 1.0.10. 1.1 is still a bit young to be ready for prod unfortunately. OP described himself as experimenting which I inferred to mean not-production. I agree with others, 1.0.x is what I'd currently recommend for production. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Confusion regarding the terms replica and replication factor
You can avoid the confusion by using the term natural endpoints. For example, with a replication factor of 3 natural endpoints for key x are node1, node2, node11. The snitch does use the datacenter and the rack but almost all deployments use a single rack per DC, because when you have more then one rack in a data center the NTS snitch has some logic to spread the data between racks. (most people do not want this behavior) On Wed, May 30, 2012 at 3:57 PM, David Fischer fischer@gmail.com wrote: Thanks! My missunderstanding was the snitch names are broken up by DC1:RAC1 and the strategy_options takes only the first part of the snitch names? On Wed, May 30, 2012 at 12:14 PM, Jeff Williams je...@wherethebitsroam.com wrote: First, note that replication is done at the row level, not at the node level. That line should look more like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 1,DC2: 1,DC3: 1 } This means that each row will have one copy in each DC and within each DC it's placement will be according to the partitioner, so could be on any of the nodes in the each DC. So, don't think of it as nodes replicating, but rather as how nodes should store a copy of each row in each DC. Also, replication does not relate the the seed nodes. Seed nodes allow the nodes to find each other initially, but are not special otherwise - any node can be used as a seed node. So if you had a strategy like: placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1: 3,DC2: 2,DC3: 1 } Each row would exist on 3 of 4 nodes in DC1, 2 of 4 nodes in DC2 and on one of the nodes in DC3. Again, with the placement in each DC due to the partitioner, based on the row key. Jeff On May 29, 2012, at 11:25 PM, David Fischer wrote: Ok now i am confused :), ok if i have the following placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {DC1:R1,DC2:R1,DC3:R1 } this means in each of my datacenters i will have one full replica that also can be seed node? if i have 3 node in addition to the DC replica's with normal token calculations a key can be in any datacenter plus on each of the replicas right? It will show 12 nodes total in its ring On Thu, May 24, 2012 at 2:39 AM, aaron morton aa...@thelastpickle.com wrote: This is partly historical. NTS (as it is now) has not always existed and was not always the default. In days gone by used to be a fella could run a mighty fine key-value store using just a Simple Replication Strategy. A different way to visualise it is a single ring with a Z axis for the DC's. When you look at the ring from the top you can see all the nodes. When you look at it from the side you can see the nodes are on levels that correspond to their DC. Simple Strategy looks at the ring from the top. NTS works through the layers of the ring. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Nodes having a DC is a feature of *some* snitches and utilised by the *some* of the replication strategies (and by the messaging system for network efficiency). For background, mapping from row tokens to nodes is based on http://en.wikipedia.org/wiki/Consistent_hashing Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 1:07 AM, java jalwa wrote: Thanks Aaron. That makes things clear. So I guess the 0 - 2^127 range for tokens corresponds to a cluster -level top-level ring. and then you add some logic on top of that with NTS to logically segment that range into sub-rings as per the notion of data clusters defined in NTS. Whats the advantage of having a single top-level ring ? intuitively it seems like each replication group could have a separate ring so that the same tokens can be assigned to nodes in different DC. If the hierarchy is Cluster - DataCenter - Node, why exactly do we need globally unique node tokens even though nodes are at the lowest level in the hierarchy. Thanks again. On Wed, May 23, 2012 at 3:14 AM, aaron morton aa...@thelastpickle.com wrote: Now if a row key hash is mapped to a range owned by a node in DC3, will the Node in DC3 still store the key as determined by the partitioner and then walk the ring and store 2 replicas each in DC1 and DC2 ? No, only nodes in the DC's specified in the NTS configuration will be replicas. Or will the co-ordinator node be aware of the replica placement strategy, and override the partitioner's decision and walk the ring until it first encounters a node in DC1 or DC2 ? and then place the remaining replicas ? The NTS considers each DC to have it's own ring. This can make token selection in a multi DC environment confusing at times. There is something in the DS docs about it. Cheers - Aaron Morton
Re: Confusion regarding the terms replica and replication factor
On May 30, 2012, at 10:32 PM, Edward Capriolo wrote: The snitch does use the datacenter and the rack but almost all deployments use a single rack per DC, because when you have more then one rack in a data center the NTS snitch has some logic to spread the data between racks. (most people do not want this behavior) Out of curiosity, why would most people not want this behaviour? It seems like a good idea from a availability perspective. Jeff
Re: unknown exception with hector
i'm not sure if using framed transport is an option with hector. http://hector-client.github.com/hector//source/content/API/core/0.8.0-2/me/prettyprint/cassandra/service/CassandraHostConfigurator.html#setUseThriftFramedTransport(boolean) what should i be in the logs looking for to find the cause of these dropped reads? These look like transport errors. If something is happening on the server side it will be logged at ERROR level. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 1:00 PM, Deno Vichas wrote: i'm not sure if using framed transport is an option with hector. what should i be in the logs looking for to find the cause of these dropped reads? thanks, On 5/24/2012 3:04 AM, aaron morton wrote: Dropped read messages occur when the node could not process a read task within rpc_timeout. It generally means the cluster has been overwhelmed at some point: too many requests, to much GC, compaction hurting, etc. Check the server side logs for errors but I doubt it is related to the call stack below. Have you confirmed that you are using framed transport on the client? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:52 PM, Deno Vichas wrote: i've notice the my nodes seem to have a large (?, not really sure what acceptable numbers are) read dropped count from tpstats. could they be related? On 5/23/2012 2:55 AM, aaron morton wrote: No sure but at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) Looks like the client is not using framed transport. The server defaults to framed. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 5:35 AM, Deno Vichas wrote: could somebody clue me in to the cause of this exception? i see these randomly. AnalyzerService-2 2012-05-22 13:28:00,385 :: WARN cassandra.connection.HConnectionManager - Exception: me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:39) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$23.execute(KeyspaceServiceImpl.java:851) at me.prettyprint.cassandra.service.KeyspaceServiceImpl$23.execute(KeyspaceServiceImpl.java:840) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:99) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:243) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) at me.prettyprint.cassandra.service.KeyspaceServiceImpl.getColumn(KeyspaceServiceImpl.java:857) at me.prettyprint.cassandra.model.thrift.ThriftColumnQuery$1.doInKeyspace(ThriftColumnQuery.java:57) at me.prettyprint.cassandra.model.thrift.ThriftColumnQuery$1.doInKeyspace(ThriftColumnQuery.java:52) at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) at me.prettyprint.cassandra.model.thrift.ThriftColumnQuery.execute(ThriftColumnQuery.java:51) at com.stocktouch.dao.StockDaoImpl.getHistorical(StockDaoImpl.java:365) at com.stocktouch.dao.StockDaoImpl.getHistoricalQuote(StockDaoImpl.java:433) at com.stocktouch.service.StockHistoryServiceImpl.getHistoricalQuote(StockHistoryServiceImpl.java:480) at com.stocktouch.service.AnalyzerServiceImpl.getClose(AnalyzerServiceImpl.java:180) at com.stocktouch.service.AnalyzerServiceImpl.calcClosingPrices(AnalyzerServiceImpl.java:90) at com.stocktouch.service.AnalyzerServiceImpl.nightlyRollup(AnalyzerServiceImpl.java:66) at com.stocktouch.service.AnalyzerServiceImpl$2.run(AnalyzerServiceImpl.java:55) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at
Re: about multitenant datamodel
- Do a lot of keyspaces cause some problems? (If I have 1,000 users, cassandra creates 1,000 keyspaces…) It's not keyspaces, but the number of column families. Without storing any data each CF uses about 1MB of ram. When they start storing and reading data they use more. IMHO a model that allows external users to create CF's is a bad one. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 12:52 PM, Toru Inoko wrote: Hi, all. I'm designing data api service(like cassandra.io but not using dedicated server for each user) on cassandra 1.1 on which users can do DML/DDL method like cql. Followings are api which users can use( almost same to cassandra api). - create/read/delete ColumnFamilies/Rows/Columns Now I'm thinking about multitenant datamodel on that. My data model like the following. I'm going to prepare a keyspace for each user as a user's tenant space. | keyspace1 | --- | column family | |(for user1)| | ... | keyspace2 | --- | column family | |(for user2)| | ... Followings are my question! - Is this data model a good for multitenant? - Do a lot of keyspaces cause some problems? (If I have 1,000 users, cassandra creates 1,000 keyspaces...) please, help. thank you in advance. Toru Inoko.
Re: High CPU load on Cassandra Node
Further I need to understand that for internal read/write does cassandra uses thrift for doing so over an rpc connection(port 9160) or 7000 as for inter node communication.May be that also could be a reason for so many connections on 9160. Uses 7000 What I could see from Ganglia is high CPU load on this server and also number of TCP connection on port 9160 is around 600+ all the time.The distribution of these connections say that we have connections from this machine to other DC machines are around 90 odd each. For port 7000 its around 45. Could these be hadoop tasks that are still running ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 4:51 PM, Shubham Srivastava wrote: I have a multiDC ring with 6 nodes in each DC. I have a single node which runs some jobs (including Hadoop Map-Reduce with PIG) every 15minutes. Lately there has been high CPU load and memory issues on this node. What I could see from Ganglia is high CPU load on this server and also number of TCP connection on port 9160 is around 600+ all the time.The distribution of these connections say that we have connections from this machine to other DC machines are around 90 odd each. For port 7000 its around 45. Further I need to understand that for internal read/write does cassandra uses thrift for doing so over an rpc connection(port 9160) or 7000 as for inter node communication.May be that also could be a reason for so many connections on 9160. I have an 8Core machine with 14Gb RAM and 8Gb Heap. rpc min and max threads are default and so are the other rpc based properties RF:3 each DC and Read/Write CL:1 and Read Repair Chance=0.1. cassandra version is 0.8.6 Regards, Shubham
Re: Schema changes not getting picked up from different process
What clients are the scripts using ? This sounds like something that should be handled in the client. I would worry about holding a long running connection to a single node. There are several situations where the correct behaviour for a client is to kill a connection and connect to another node. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 12:11 AM, Victor Blaga wrote: Hi Dave, Thank you for your answer. 2012/5/25 Dave Brosius dbros...@mebigfatguy.com What version are you using? I am using version 1.1.0 It might be related to https://issues.apache.org/jira/browse/CASSANDRA-4052 Indeed the Issue you suggested goes into the direction of my problem. However, things are a little bit more complex. I used the cassandra-cli just for this example, although I'm getting this behavior from other clients (I'm using python and ruby scripts). Basically I'm modifying the schema through the ruby script and I'm trying to query and insert data through the python script. Both of the scripts are meant to be on forever (sort of daemons) and thus they establish once at start a connection to the Cassandra which is kept alive. I can see from the comments on the issue that keeping a long-lived connection to the Cluster might not be ideal and it would probably be better to reconnect upon executing a set of queries.
Re: Frequent exception with Cassandra 1.0.9
Still getting this ? Was there some more to the message ? Here's an example from the internets http://pastebin.com/WdD7181x it may be an issue with the JVM on windows. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 6:07 AM, Dwight Smith wrote: I am running embedded Cassandra version 1.0.9 on Windows2008 Server frequently encounter the following exception: Stack: [0x7dc6,0x7dcb], sp=0x7dcaf0b0, free space=316k Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j java.io.WinNTFileSystem.getSpace0(Ljava/io/File;I)J+0 j java.io.WinNTFileSystem.getSpace(Ljava/io/File;I)J+10 j java.io.File.getUsableSpace()J+34 j org.apache.cassandra.config.DatabaseDescriptor.getDataFileLocationForTable(Ljava/lang/String;JZ)Ljava/lang/String;+44 j org.apache.cassandra.db.Table.getDataFileLocation(JZ)Ljava/lang/String;+6 j org.apache.cassandra.db.Table.getDataFileLocation(J)Ljava/lang/String;+3 j org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(JLjava/lang/String;)Ljava/lang/String;+5 j org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(JJLorg/apache/cassandra/db/commitlog/ReplayPosition;)Lorg/apache/cassandra/io/sstable/SSTableWriter;+18 J org.apache.cassandra.db.Memtable.writeSortedContents(Lorg/apache/cassandra/db/commitlog/ReplayPosition;)Lorg/apache/cassandra/io/sstable/SSTableReader; j org.apache.cassandra.db.Memtable.access$400(Lorg/apache/cassandra/db/Memtable;Lorg/apache/cassandra/db/commitlog/ReplayPosition;)Lorg/apache/cassandra/io/sstable/SSTableReader;+2 j org.apache.cassandra.db.Memtable$4.runMayThrow()V+36 j org.apache.cassandra.utils.WrappedRunnable.run()V+9 J java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V J java.util.concurrent.ThreadPoolExecutor$Worker.run()V j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub Java into: java version 1.6.0_30 Java(TM) SE Runtime Environment (build 1.6.0_30-b12) Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)
Re: will compaction delete empty rows after all columns expired?
Minor compaction will remove the tombstones if the row only exists in the sstable being compaction. Are these very wide rows that are constantly written to ? Cheers p.s. cassandra 1.0 really does rock. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 6:21 AM, Curt Allred wrote: This is an old thread from December 27, 2011. I interpret the yes answer to mean you do not have to explicitly delete an empty row after all of its columns have been deleted, the empty row (i.e. row key) will automatically be deleted eventually (after gc_grace). Is that true? I am not seeing that behavior on our v 0.7.9 ring. We are accumulating a large number of old empty rows. They are taking alot of space because the row keys are big, and exploding the data size by 10x. I have read conflicting information on blogs and cassandra docs. Someone mentioned that there are both row tombstones and column tombstones, implying that you have to explicitly delete empty rows. Is that correct. My basic question is... how do I delete all these empty row keys? - From: Feng Qu Sent: Tuesday, December 27, 2011 11:09 AM Compaction should delete empty rows once gc_grace_seconds is passed, right? - From: Peter Schuller Yes. But just to be extra clear: Data will not actually be removed once the row in question participates in compaction. Compactions will not be actively triggered by Cassandra for tombstone processing reasons.
Re: cassandra read latency help
80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number is way too high, avg row size is just 149 bytes. Even index reads should not cause this high data being read from the disk. what i understand is that each read request translates to 2 disk accesses (because there is only 1 sstable). 1 for the index, 1 for the data. At such a low reads/second, why is the latency so high? would appreciate help debugging this issue. Thanks Gurpreet On Tue, May 22, 2012 at 2:46 AM, aaron morton aa...@thelastpickle.com wrote: With heap size = 4 gigs I would check for GC activity in the logs and consider setting it to 8 given you have 16 GB. You can also check if the IO system is saturated (http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html) Also take a look at nodetool cfhistogram perhaps to see how many sstables are involved. I would start by looking at the latency reported on the server, then work back to the client…. I may have missed it in the email but what recent latency for the CF is reported by nodetool cfstats ? That's latency for a single request on a single read thread. The default settings give you 32 read threads. If you know the latency for a single request, and you know you have 32 concurrent read threads, you can get an idea of the max throughput for a single node. Once you get above that throughput the latency for a request will start to include wait time. It's a bit more complicated, because when you request 40 rows that turns into 40 read tasks. So if two clients send a request for 40 rows at the same time there will be 80 read tasks to be processed by 32 threads. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/05/2012, at 4:10 PM, Radim Kolar wrote: Dne 19.5.2012 0:09, Gurpreet Singh napsal(a): Thanks Radim. Radim, actually 100 reads per second is achievable even with 2 disks. it will become worse as rows will get fragmented. But achieving them with a really low avg latency per key is the issue. I am wondering if anyone has played with index_interval, and how much of a difference would it make to reads on reducing the index_interval. close to zero. but try it yourself too and post your findings.
Re: TimedOutException caused by Stop the world activity
The cluster is running into GC problems and this is slowing it down under the stress test. When it slows down one or more of the nodes is failing to perform the write within rpc_timeout . This causes the coordinator of the write to raise the TimedOutException. You options are: * allocate more memory * ease back on the stress test. * work as a CL QUORUM so that one node failing does result in the error. see also http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/05/2012, at 12:59 PM, Jason Tang wrote: Hi My system is 4 nodes 64 bit cassandra cluster, 6G big per node,default configuration (which means 1/3 heap for memtable), replicate number 3, write all, read one. When I run stress load testing, I got this TimedOutException, and some operation failed, and all traffic hang for a while. And when I have 1G memory 32 bit cassandra on standalone model, I didn't find so frequently Stop the world behavior. So I wonder what kind of operation will hang the cassandra system. How to collect information for tuning. From the system log and document, I guess there are three type operations: 1) Flush memtable when meet max size 2) Compact SSTable (why?) 3) Java GC system.log: INFO [main] 2012-05-25 16:12:17,054 ColumnFamilyStore.java (line 688) Enqueuing flush of Memtable-LocationInfo@1229893321(53/66 serialized/live bytes, 2 ops) INFO [FlushWriter:1] 2012-05-25 16:12:17,054 Memtable.java (line 239) Writing Memtable-LocationInfo@1229893321(53/66 serialized/live bytes, 2 ops) INFO [FlushWriter:1] 2012-05-25 16:12:17,166 Memtable.java (line 275) Completed flushing /var/proclog/raw/cassandra/data/system/LocationInfo-hb-2-Data.db (163 bytes) ... INFO [CompactionExecutor:441] 2012-05-28 08:02:55,345 CompactionTask.java (line 112) Compacting [SSTableReader(path='/var/proclog/raw/cassandra/data/myks/queue-hb-41-Data.db'), SSTableReader(path='/var/proclog/raw/cassandra/data/ myks /queue-hb-32-Data.db'), SSTableReader(path='/var/proclog/raw/cassandra/data/ myks /queue-hb-37-Data.db'), SSTableReader(path='/var/proclog/raw/cassandra/data/ myks /queue-hb-53-Data.db')] ... WARN [ScheduledTasks:1] 2012-05-28 08:02:26,619 GCInspector.java (line 146) Heap is 0.7993011015621736 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically INFO [ScheduledTasks:1] 2012-05-28 08:02:54,980 GCInspector.java (line 123) GC for ConcurrentMarkSweep: 728 ms for 2 collections, 3594946600 used; max is 6274678784 INFO [ScheduledTasks:1] 2012-05-28 08:41:34,030 GCInspector.java (line 123) GC for ParNew: 1668 ms for 1 collections, 4171503448 used; max is 6274678784 INFO [ScheduledTasks:1] 2012-05-28 08:41:48,978 GCInspector.java (line 123) GC for ParNew: 1087 ms for 1 collections, 2623067496 used; max is 6274678784 INFO [ScheduledTasks:1] 2012-05-28 08:41:48,987 GCInspector.java (line 123) GC for ConcurrentMarkSweep: 3198 ms for 3 collections, 2623361280 used; max is 6274678784 Timeout Exception: Caused by: org.apache.cassandra.thrift.TimedOutException: null at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:19495) ~[na:na] at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:1035) ~[na:na] at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:1009) ~[na:na] at me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:95) ~[na:na] ... 64 common frames omitted BRs //Tang Weiqiang
Re: Snapshot failing on JSON files in 1.1.0
CASSANDRA-4230 is a bug in 1.1 I am not aware of issues using snapshot on 1.0.9. But errno 0 is a bit odd. On the server side there should be a log message at ERROR level that contains the string Unable to create hard link and the error message. What does that say ? Can you also include the OS version. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/05/2012, at 9:27 PM, Alain RODRIGUEZ wrote: I have the same error with the last Datastax AMI (1.0.9). Is that the same bug ? Requested snapshot for: cassa_teads Exception in thread main java.io.IOError: java.io.IOException: Unable to create hard link from /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Index.db to /raid0/cassandra/data/cassa_teads/snapshots/20120528/stats_product-hc-233-Index.db (errno 0) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1433) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1462) at org.apache.cassandra.db.Table.snapshot(Table.java:210) at org.apache.cassandra.service.StorageService.takeSnapshot(StorageService.java:1710) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Unable to create hard link from /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Index.db to /raid0/cassandra/data/cassa_teads/snapshots/20120528/stats_product-hc-233-Index.db (errno 0) at org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:158) at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:857) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1412) ... 32 more Can we do a snapshot manually (like flushing and after copying all the file into the snapshot folder) ? Alain 2012/5/19 Jonathan Ellis jbel...@gmail.com: When these bugs are fixed: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+CASSANDRA+AND+fixVersion+%3D+%221.1.1%22+AND+resolution+%3D+Unresolved+ORDER+BY+due+ASC%2C+priority+DESC%2C+created+ASCmode=hide On Wed, May 16, 2012 at 6:35 PM, Bryan Fernandez bfernande...@gmail.com wrote: Does anyone know when 1.1.1 will be released? Thanks. On Tue, May 15, 2012 at 5:40 PM, Brandon Williams dri...@gmail.com wrote: Probably https://issues.apache.org/jira/browse/CASSANDRA-4230 On Tue, May 15, 2012 at 4:08 PM, Bryan
Re: Doubts regarding compaction
Also, I want to make sure, if Major compactions could only be done manually ? Major compactions are the ones you run using nodetool Is the author referring to this time period as no minor compactions being triggered automatically ? They minor compaction will be triggered less frequently because you will need to run 4 compactions at the first size tier before one runs at the next. And so on. Up to the point where you need to get another 3 files roughly the same size as the one you got from the major compaction. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 3:41 AM, Rajat Mathur wrote: http://www.datastax.com/docs/1.0/operations/tuning On this page at last, there's a note about Major Compaction which says, Also, once you run a major compaction, automatic minor compactions are no longer triggered frequently... Could anybody give an explanation for that, because as far as I think, once a major compaction takes place, after that let's say there would be no compactions till N(default value 4) SSTables of same size (size of memtable to be precise) are formed, then automatically minor compactions would start. Is the author referring to this time period as no minor compactions being triggered automatically ? Also, I want to make sure, if Major compactions could only be done manually ? -- Rajat Mathur
Re: About Composite range queries
Composite Columns compare each part in turn, so the values are ordered as you've shown them. However the rows are not ordered according to key value. They are ordered using the random token generated by the partitioner see http://wiki.apache.org/cassandra/FAQ#range_rp What is the real advantage compared to super column families? They are faster. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote: How is it done in Cassandra to be able to range query on a composite key? key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C) like get_range (key1, start_column=(A,), end_column=(A, C)); will return [ (A:B:C), (A:C:C) ] (in pycassa) I mean does the composite implementation add much overhead to make it work? Does it need to add other Column families, to be able to range query between composites simple keys (first, second and third part of the composite)? What is the real advantage compared to super column families? key1 = A: (A,C), (B,C), (C,C), (D,C) , B: (A,C) thx
Re: All host pools Marked Down
I would remove the load balancer from the equation. Compactions do not stop the world, they may degrade performance for a while but thats about it. Look in the logs on the servers, are the nodes logging that other nodes are going DOWN ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/05/2012, at 2:25 AM, cem wrote: It should retry but it doesn't. It is also clear that it delegates the retry to the client Retry burden pushed out to client you can also check Hector code. I wrote a separate service that retries when this exception occurs. I think you have a problem with your load balancer. Try to connect with telnet. Cem. On Tue, May 29, 2012 at 3:06 PM, Shubham Srivastava shubham.srivast...@makemytrip.com wrote: My webapp connects to the LoadBalancer IP which has the actual nodes in its pool. If there is by any chance a connection break then will hector not retry to re-establish connection I guess it should retry every XX seconds based on retryDownedHostsDelayInSeconds . Regards, Shubham From: cem [cayiro...@gmail.com] Sent: Tuesday, May 29, 2012 6:13 PM To: user@cassandra.apache.org Subject: Re: All host pools Marked Down Since all hosts are seem to be down, Hector will not do retry. There should be at least one node up in a cluster. Make sure that you have a proper connection from your webapps to your cluster. Cem. On Tue, May 29, 2012 at 1:46 PM, Shubham Srivastava shubham.srivast...@makemytrip.com wrote: Any takers on this. Hitting us badly right now. Regards, Shubham From: Shubham Srivastava Sent: Tuesday, May 29, 2012 12:55 PM To: user@cassandra.apache.org Subject: All host pools Marked Down I am getting this exception lot of times me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client. What this causes is no data read/write from the ring from my WebApp. I have retries as 3 and can see that max retries 3 getting exhausted with the same error as above. Checked cfstats and tpstats nothing seem to be a problem. However through the logs I see lot of time taken in compactions like the below INFO [CompactionExecutor:73] 2012-05-29 11:03:01,605 CompactionManager.java (line 608) Compacted to /opt/cassandra-data/data/LH/UserPrefrences-tmp-g-8906-Data.db. 36,986,932 to 36,961,554 (~99% of original) bytes for 132,743 keys. Time: 112,910ms. The time taken here seems pretty high. Will this cause a pause or read timeout etc. I have the connection from my web app through a hardware loadbalancer . Cassandra version is 0.8.6 with multi-DC ring on 6 nodes each in one DC. CL:1 and RF:3. Memeory:8Gb heap - 14Gb Server memory with 8Core CPU. How do I move ahead in this. Shubham Srivastava | Technical Lead - Technology Development +91 124 4910 548 | MakeMyTrip.com, 243 SP Infocity, Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India image001.gifWhat's new? My Trip Rewards - An exclusive loyalty program for MakeMyTrip customers. image002.gif image003.gif Office Map image004.gif Facebook image005.gif Twitter
Re: Nodetool talking to an old IP address (and timing out)
node tool passes the host name un modified to the JMX library to connect to the host. The JMX server will, by default, bind to the ip address of the machine. If the host name was wrong, I would guess the JMX service failed to bind. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/05/2012, at 6:39 AM, Douglas Muth wrote: 8 hours, 1 cup of coffee, and 4 Advil later, and I think I got the bottom of this. Not having much of a Java or JMX background, I'll try to explain it the best that I can. To recap, my machine originally had the IP address of 10.244.207.16. Then I shutdown/restarted that EC2 instance, and it had the IP 10.84.117.110. During this, Cassandra was fine -- I could still connect to 127.0.0.1 with cqlsh and the Helenus node.js module. Things got weird only when I tried to use nodetool to manage the instance. As best I can tell, here's the algorithm that nodetool uses when connecting to a Cassandra instance: Step 1) Connect to the hostname and port specified on the command line. localhost and 7199 are the defaults. Step 2) Cassandra, at boot time, notes the hostname of the machine, and tells nodetool go connect to this hostname instead! After further investigation, it seems that after my instance was rebooted, the file /etc/hostname was not updated. It still had the value ip-10-244-207-16.ec2.internal in it. This means that any attempt to connect to Cassandra involved Cassandra telling nodetool, Hey, go talk to 10.244.207.16 instead. And that's where things went wrong. The permanent fix for this was to change the hostname to localhost and to restart Cassandra. The fact that Cassandra notes the hostname at startup was one thing that made this so difficult to track down. I did not see the old IP anywhere in Cassandra configuration (or in logfile output), so I did not think there was anything abnormal happening in the instance. While I'm sure there's a good reason for this sort of behavior, it is very confusing to a Cassandra newbie such as myself, and I'll bet others have been affected by this as well. In the future, I think some sort of logging of this sort of of logic, or perhaps a --verbose mode for nodetool would be a really good idea. What do other folks think? -- Doug http://twitter.com/dmuth On Tue, May 29, 2012 at 12:08 PM, Douglas Muth doug.m...@gmail.com wrote: I'm afraid that did not work. I'm running JMX on port 7199 (the default) and I verified that the port is open and accepting connections. [snip]
RE: will compaction delete empty rows after all columns expired?
No, these were not wide rows. They are rows that formerly had one or 2 columns. The columns are deleted but the empty rows dont go away, even after gc_grace_secs. So if I understand... the empty row will only be removed after gc_grace if enough compactions have occurred so that all the column tombstones for the empty row are in a single SSTable file? From: aaron morton [mailto:aa...@thelastpickle.com] Minor compaction will remove the tombstones if the row only exists in the sstable being compaction. Are these very wide rows that are constantly written to ? Cheers p.s. cassandra 1.0 really does rock.
Re: will compaction delete empty rows after all columns expired?
On Thu, May 31, 2012 at 9:31 AM, Curt Allred c...@mediosystems.com wrote: No, these were not wide rows. They are rows that formerly had one or 2 columns. The columns are deleted but the empty rows dont go away, even after gc_grace_secs. The empty row goes away only during a compaction after the gc_grace_secs. You can set the gc_grace_secs as a little value and force major compaction after the row is expired. After then please check whether the row still exists. ** ** So if I understand... the empty row will only be removed after gc_grace if enough compactions have occurred so that all the column tombstones for the empty row are in a single SSTable file? *From:* aaron morton [mailto:aa...@thelastpickle.com] ** ** Minor compaction will remove the tombstones if the row only exists in the sstable being compaction. ** ** Are these very wide rows that are constantly written to ? ** ** Cheers p.s. cassandra 1.0 really does rock.
Re: Confusion regarding the terms replica and replication factor
http://answers.oreilly.com/topic/2408-replica-placement-strategies-when-using-cassandra/ As mentioned it does this: The Network Topology Strategy places some replicas in another data center and the remainder in other racks in the first data center, as specified Which is not what most would expect. Assume your largish cluster is say 40 nodes in a data center. 3 copies get you quorum and a 48 port switch is pretty common. switches can even be stacked sometimes 3 or 4 units so now you are talking 48x4 switch ports really make up one rack. On Wed, May 30, 2012 at 4:37 PM, Jeff Williams je...@wherethebitsroam.com wrote: On May 30, 2012, at 10:32 PM, Edward Capriolo wrote: The snitch does use the datacenter and the rack but almost all deployments use a single rack per DC, because when you have more then one rack in a data center the NTS snitch has some logic to spread the data between racks. (most people do not want this behavior) Out of curiosity, why would most people not want this behaviour? It seems like a good idea from a availability perspective. Jeff