random partitioner and key scan
Hi, I know that RandomPartitioner does MD5 of a key and the MD5 is then used for key distribution AND key ordering. I was just wondering if it's possible to have RandomPartitioner just for key distribution and OrderedPartitioner just for per-node key ordering. That would solve the often requested key scan feature. Regards, Patrik
Re: Tripling size of a cluster
Hi again, we have now moved all nodes to correct position in ring, but we can see higher load on 2 nodes, than on other nodes: ... node01-05 rack1 Up Normal 244.65 GB 6,67% 102084710076281539039012382229530463432 node02-13 rack2 Up Normal 240.26 GB 6,67% 107756082858297180096735292353393266961 node01-13 rack1 Up Normal 243.75 GB 6,67% 113427455640312821154458202477256070485 node02-05 rack2 Up Normal 249.31 GB 6,67% 119098828422328462212181112601118874004 node01-14 rack1 Up Normal 244.95 GB 6,67% 124770201204344103269904022724981677533 node02-14 rack2 Up Normal 392.7 GB 6,67% 130441573986359744327626932848844481058 node01-06 rack1 Up Normal 249.3 GB 6,67% 136112946768375385385349842972707284576 node02-15 rack2 Up Normal 286.82 GB 6,67% 141784319550391026443072753096570088106 node01-15 rack1 Up Normal 245.21 GB 6,67% 147455692332406667500795663220432891630 node02-06 rack2 Up Normal 244.9 GB 6,67% 153127065114422308558518573344295695148 ... Node: * node01-15 = > 286.82 GB * node02-14 = > 392.7 GB average load on all other nodes is around 245 GB, nodetool cleanup command was invoked on problematic nodes after move operation... Why this has happen? And how can we balance cluster? On 06.07.2012 20:15, aaron morton wrote: If you have the time yes I would wait for the bootstrap to finish. It will make you life easier. good luck. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/07/2012, at 7:12 PM, Mariusz Dymarek wrote: Hi, we`re in the middle of extending our cluster from 10 to 30 nodes, we`re running cassandra 1.1.1... We`ve generated initial tokens for new nodes: "0": 0, # existing: node01-01 "1": 5671372782015641057722910123862803524, # new: node02-07 "2": 11342745564031282115445820247725607048, # new: node01-07 "3": 17014118346046923173168730371588410572, # existing: node02-01 "4": 22685491128062564230891640495451214097, # new: node01-08 "5": 28356863910078205288614550619314017621, # new: node02-08 "6": 34028236692093846346337460743176821145, # existing: node01-02 "7": 39699609474109487404060370867039624669, # new: node02-09 "8": 45370982256125128461783280990902428194, # new: node01-09 "9": 51042355038140769519506191114765231718, # existing: node02-02 "10": 56713727820156410577229101238628035242, # new: node01-10 "11": 62385100602172051634952011362490838766, # new: node02-10 "12": 68056473384187692692674921486353642291, # existing: node01-03 "13": 7372784616620750397831610216445815, # new: node02-11 "14": 79399218948218974808120741734079249339, # new: node01-11 "15": 85070591730234615865843651857942052864, # existing: node02-03 "16": 90741964512250256923566561981804856388, # new: node01-12 "17": 96413337294265897981289472105667659912, # new: node02-12 "18": 102084710076281539039012382229530463436, # existing: node01-05 "19": 107756082858297180096735292353393266961, # new: node02-13 "20": 113427455640312821154458202477256070485, # new: node01-13 "21": 119098828422328462212181112601118874009, # existing: node02-05 "22": 124770201204344103269904022724981677533, # new: node01-14 "23": 130441573986359744327626932848844481058, # new: node02-14 "24": 136112946768375385385349842972707284582, # existing: node01-06 "25": 141784319550391026443072753096570088106, # new: node02-15 "26": 147455692332406667500795663220432891630, # new: node01-15 "27": 153127065114422308558518573344295695155, # existing: node02-06 "28": 158798437896437949616241483468158498679, # new: node01-16 "29": 164469810678453590673964393592021302203 # new: node02-16 then we`ve started to boostrap new nodes, but due to copy and paste mistake: * node node01-14 was started with 130441573986359744327626932848844481058 as initial token(so node01-14 has initial_token, what should belong to node02-14), it should have 124770201204344103269904022724981677533 as initial_token * node node02-14 was started with 136112946768375385385349842972707284582 as initial token, so it has token from existing node01-06 However we`ve used other program for generating previous initial_tokens and actual token of node01-06 in ring is 136112946768375385385349842972707284576. Summing up: we have currently this situation in ring: node02-05 rack2 Up Normal 596.31 GB 6.67% 119098828422328462212181112601118874004 node01-14 rack1 Up Joining 242.92 KB 0.00% 130441573986359744327626932848844481058 node01-06 rack1 Up Normal 585.5 GB 13.33% 136112946768375385385349842972707284576 node02-14 rack2 Up Joining 113.17 KB 0.00% 136112946768375385385349842972707284582 node02-15 rack2 Up Joining 178.05 KB 0.00% 141784319550391026443072753096570088106 node01-15 rack1 Up Joining 191.7 GB 0.00% 147455692332406667500795663220432891630 node02-06 rack2 Up Normal 597.69 GB 20.00% 153127065114422308558518573344295695148 We would like to get back to our original configuration. Is it safe to wait for finishing bootstraping of all new nodes and after that inv
Re: An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)
Thanks. Team is working on to extend support for SimpleJPARepository(including implementation for ManagedType). -Vivek On Thu, Jul 19, 2012 at 9:06 AM, Roshan wrote: > Hi Brian > > This is basically a wonderful news for me, because we are using lots of > spring support in the project. Good luck and keep post. > > Cheers > > /Roshan. > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/An-experiment-using-Spring-Data-w-Cassandra-initially-via-JPA-Kundera-tp7581319p7581320.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. >
Cassandra startup times
Good evening, I am interested in improving the startup time of our cassandra cluster. We have a 3 node cluster (replication factor of 3) in which our application requires quorum reads and writes to function. Each machine is well specced with 24gig of ram, 10 cores, jna enabled etc. On each server our keyspace files are so far around 90 Gb (stored on NFS although I am not seeing signs that we have much network io). This size will grow in future. Our startup time for 1 server at the moment is greater then half an hour (45 minutes to 50 minutes even) which is putting a risk factor on the resiliance of our service. I have tried version 1.09 to latest 1.12. I do not see too much system utilization while starting either. I gazed apon an article suggesting increased speed in 1.2 although when I set it up, it did not seem to be any faster at all (if not slower). I was observing what was happening during startup and I noticed (via strace), cassandra was doing lots of 8 byte reads from: /var/lib/cassandra/data/XX/YY/XXX-YYY-hc-1871-CompressionInfo.db /var/lib/cassandra/data/XX/YY/XXX-YYY-hc-1874-CompressionInfo.db Also... Is there someone I can change the 8 byte reads to something greater? 8 byte reads across NFS is terribly inefficient (and I am guessing the cause of our terribly slow startup times). Regards, -- -Ben
Re: An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)
Hi Brian This is basically a wonderful news for me, because we are using lots of spring support in the project. Good luck and keep post. Cheers /Roshan. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/An-experiment-using-Spring-Data-w-Cassandra-initially-via-JPA-Kundera-tp7581319p7581320.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)
This is just an FYI. I experimented w/ Spring Data JPA w/ Cassandra leveraging Kundera. It sort of worked: https://github.com/boneill42/spring-data-jpa-cassandra http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.html I'm now working on a pure Spring Data adapter using Astyanax: https://github.com/boneill42/spring-data-cassandra I'll keep you posted. (Thanks to all those that helped out w/ advice) -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Can't change replication factor in Cassandra 1.1.2
Hi folks, I have an interesting problem in Cassandra 1.1.2, a Google Search wasn't much help, so I thought I'd ask here. Essentially, I have a "problem keyspace" in my 2-node cluster that keeps me from changing the replication factor on a specific keyspace. It's probably easier to show what I'm seeing in cassandra-cli: [default@foobar] update keyspace test1 with strategy_options = {replication_factor:1}; 2d5f0d16-bb4b-3d75-a084-911fe39f7629 Waiting for schema agreement... ... schemas agree across the cluster [default@foobar] update keyspace test1 with strategy_options = {replication_factor:1}; 7745dd06-ee5d-3e74-8734-7cdc18871e67 Waiting for schema agreement... ... schemas agree across the cluster Even though keyspace "test1" had a replication_factor of 1 to start with, each of the above UPDATE KEYSPACE commands caused a new UUID to be generated for the schema, which I assume is normal and expected. Then I try it with the problem keyspace: [default@foobar] update keyspace foobar with strategy_options = {replication_factor:1}; 7745dd06-ee5d-3e74-8734-7cdc18871e67 Waiting for schema agreement... ... schemas agree across the cluster Note that the UUID did not change, and the replication_factor in the underlying database did not change either. The funny thing is that foobar had a replication_factor of 1 yesterday, then I brought my second node online and changed the replication_factor to 2 without incident. I only ran into issues when I tried changing it back to 1. I tried running "nodetool clean" on both nodes, but the problem persists. Any suggestions? Thanks, -- Doug -- http://twitter.com/dmuth
Re: Composite Column Expiration Behavior
I answered my own question with a test: Using default limit of 100 --- RowKey: test => (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:A, value=, timestamp=1342628020428000, ttl=10) => (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:B, value=, timestamp=1342628020428000, ttl=30) 1 Row Returned. Elapsed time: 4 msec(s). [default@context] list context_session_views; Using default limit of 100 --- RowKey: test => (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:B, value=, timestamp=1342628020428000, ttl=30) 1 Row Returned. Elapsed time: 3 msec(s). On Wed, Jul 18, 2012 at 11:06 AM, rohit bhatia wrote: > Hi, > > I don't think that composite columns have "parent columns". your point > might be true for supercolumns .. > but each composite column is probably independent.. > > On Wed, Jul 18, 2012 at 9:14 PM, Thomas Van de Velde > wrote: > > Hi there, > > > > I am trying to understand the expiration behavior of composite columns. > > Assume I have two entries both have the same parent column name but each > one > > has a different ttl. Would expiration be applied at the parent column > level > > (taking into account ttls set per column under the parent and expiring > all > > of the child columns when the most recent ttl is met) or is each each > child > > entry expired independently? > > > > Would this be correct? > > > > A:B->ttl=5 > > A:C->ttl=10 > > > > > > t+5: Nothing gets expired (because A:C's expiration has not yet been > > reached) > > t+10: Both A:B and A:C are expired > > > > > > Thanks, > > Thomas >
Re: Composite Column Expiration Behavior
Hi, I don't think that composite columns have "parent columns". your point might be true for supercolumns .. but each composite column is probably independent.. On Wed, Jul 18, 2012 at 9:14 PM, Thomas Van de Velde wrote: > Hi there, > > I am trying to understand the expiration behavior of composite columns. > Assume I have two entries both have the same parent column name but each one > has a different ttl. Would expiration be applied at the parent column level > (taking into account ttls set per column under the parent and expiring all > of the child columns when the most recent ttl is met) or is each each child > entry expired independently? > > Would this be correct? > > A:B->ttl=5 > A:C->ttl=10 > > > t+5: Nothing gets expired (because A:C's expiration has not yet been > reached) > t+10: Both A:B and A:C are expired > > > Thanks, > Thomas
Re: Batch update efficiency with composite key
Cassandra doesn't do reads before writes. It just places the updates in memtables. In effect updates are the same as inserts.Batches certainly help with network latency, and some minor amount of code repetitiion on the server side. - Original Message -From: "Leonid Ilyevsky" >;lilyev...@mooncapital.com
Re: Cassandra Evaluation/ Benchmarking: Throughput not scaling as expected neither latency showing good numbers
On 2012.07.18. 7:13, Code Box wrote: The cassandra stress tool gives me values around 2.5 milli seconds for writing. The problem with the Cassandra Stress Tool is that it just gives the average latency numbers and the average latency numbers that i am getting are comparable in some cases. It is the 95 percentile and 99 percentile numbers are the ones that are bad. So it means that the 95% of requests are really bad and the rest 5% are really good that makes the average go down. No, the opposite is true. 95% of the requests are fast, and 5% is slow. Or in case of the 99 percentile, 99% is fast, 1% is slow. Except if you order your samples in the opposite direction, not in the usual.
Batch update efficiency with composite key
I have a question about efficiency of updates to a CF with composite key. Let say I have 100 of logical rows to update, and they all belong to the same physical wide row. In my naïve understanding (correct me if I am wrong), in order to update a logical row, Cassandra has to retrieve the whole physical row, add columns to it, and put it back. So I put all my 100 updates in a batch and send it over. Would Cassandra be smart enough to recognize that they all belong to one physical row, retrieve it once, do all the updates and put it back once? Is my batch thing even relevant in this case? What happens if I just send updates one by one? I want to understand why I should use batches. I don't really care about one timestamp for all records, I only care about efficiency. So I thought, I want to at least save on the number of remote calls, but I also wonder what happens on Cassandra side. This email, along with any attachments, is confidential and may be legally privileged or otherwise protected from disclosure. Any unauthorized dissemination, copying or use of the contents of this email is strictly prohibited and may be in violation of law. If you are not the intended recipient, any disclosure, copying, forwarding or distribution of this email is strictly prohibited and this email and any attachments should be deleted immediately. This email and any attachments do not constitute an offer to sell or a solicitation of an offer to purchase any interest in any investment vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital does not provide legal, accounting or tax advice. Any statement regarding legal, accounting or tax matters was not intended or written to be relied upon by any person as advice. Moon Capital does not waive confidentiality or privilege as a result of this email.
Re: Cassandra Evaluation/ Benchmarking: Throughput not scaling as expected neither latency showing good numbers
How kind of client are you using in YCSB? If you want to improve latency, try distributing the requests among nodes instead of stressing a single node, try host connection pooling instead of creating connection for each request. Check high level clients like hector or asyantax for use if you are not already using them. Some clients have ring aware request handling. You have a 3 nodes cluster and using a RF of three, that means all the node will get the data. What CL are you using for writes? Latency increases for strong CL. If you want to increase throughput, try increasing the number of clients. Of course, it doesnt mean that throughtput will always increase. My observation was that it will increase and after certain number of clients throughput decrease again. Regards, Manoj Mainali On Wednesday, July 18, 2012, Code Box wrote: > The cassandra stress tool gives me values around 2.5 milli seconds for > writing. The problem with the Cassandra Stress Tool is that it just gives > the average latency numbers and the average latency numbers that i am > getting are comparable in some cases. It is the 95 percentile and 99 > percentile numbers are the ones that are bad. So it means that the 95% of > requests are really bad and the rest 5% are really good that makes the > average go down. I want to make sure that the 95% and 99% values are in one > digit milli seconds. I want them to be single digit because i have seen > people getting those numbers. > > This is my conclusion till now with all the investigations:- > > Three node cluster with replication factor of 3 gets me around 10 ms 100% > writes with consistency equal to ONE. The reads are really bad and they are > around 65ms. > > I thought that network is the issue so i moved the client on a local > machine. Client on the local machine with one node cluster gives me again > good average write latencies but the 99%ile and 95%ile are bad. I am > getting around 10 ms for write and 25 ms for read. > > Network Bandwidth between the client and server is 1 Gigabit/second. I was > able to at the max generate 25 K requests. So it could be the client is the > bottleneck. I am using YCSB. May be i should change my client to some other. > > Throughput that i got from a client at the maximum local was 35K and > remote was 17K. > > > I can try these things now:- > > Use a different client and see how much numbers i get for 99% and 95%. I > am not sure if there is any client that gives me this detailed or i have to > write one of my own. > > Tweak some hard disk settings raid0 and xfs / ext4 and see if that helps. > > Could be a possibility that the cassandra 0.8 to 1.1 the 95% and 99% > numbers have gone down. The throughput numbers have also gone down. > > Is there any other client that i can use except the cassandra stress tool > and YCSB and what ever numbers i have got are they good ? > > > --Akshat Vig. > > > > > On Tue, Jul 17, 2012 at 9:22 PM, aaron morton wrote: > > I would benchmark a default installation, then start tweaking. That way > you can see if your changes result in improvements. > > To simplify things further try using the tools/stress utility in the > cassandra source distribution first. It's pretty simple to use. > > Add clients until you see the latency increase and tasks start to back up > in nodetool tpstats. If you see it report dropped messages it is over > loaded. > > Hope that helps. > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 18/07/2012, at 4:48 AM, Code Box wrote: > > Thanks a lot for your reply guys. I was trying fsyn = batch and window > =0ms to see if the disk utilization is happening full on my drive. I > checked the numbers using iostat the numbers were around 60% and the CPU > usage was also not too high. > > Configuration of my Setup :- > > I have three m1.xlarge hosts each having 15 GB RAM and 4 CPU. It has 8 > EC2 Compute Units. > I have kept the replication factor equal to 3. The typical write size is 1 > KB. > > I tried adding different nodes each with 200 threads and the throughput > got split into two. If i do it from a single host with FSync Set to > Periodic and Window Size equal to 1000ms and using two nodes i am getting > these numbers :- > > > [OVERALL], Throughput(ops/sec), 4771 > [INSERT], AverageLatency(us), 18747 > [INSERT], MinLatency(us), 1470 > [INSERT], MaxLatency(us), 446413 > [INSERT], 95thPercentileLatency(ms), 55 > [INSERT], 99thPercentileLatency(ms), 167 > > [OVERALL], Throughput(ops/sec), 4678 > [INSERT], AverageLatency(us), 22015 > [INSERT], MinLatency(us), 1439 > [INSERT], MaxLatency(us), 466149 > [INSERT], 95thPercentileLatency(ms), 62 > [INSERT], 99thPercentileLatency(ms), 171 > > Is there something i am doing wrong in cassandra Setup ?? What is the bet > Setup for Cassandra to get high throughput and good write latency numbers ? > > > > On Tue, Jul 17, 2012 at 7:02 AM, Sylvain Lebresne > >