Re: SSTable format
> > It depends on what partitioner you use. You should be using the > RandomPartitioner, and if so, the rows are sorted by the hash of the row > key. there are partitioners that sort based on the raw key value but these > partitioners shouldn't be used as they have problems due to uneven > partitioning of data. > Any reason rowkeys are not stored by their raws keys on a given node for RP ? I understand the partitioning across nodes should be randomized, but on a given node why they are sorted by hash of their keys and not just by the raw keys ? What are we gaining by 'decorating' the keys with a random number ? ( ref section 3 in http://wiki.apache.org/cassandra/ArchitectureSSTable ) -Thanks, Prasenjit
Re: SSTable format
While in memory cassandra calls it a MemTable, but yes sstables are write-once, and later combined with others into new ones thru compaction. On 07/13/2012 09:54 PM, Michael Theroux wrote: Thanks for the information, So is the SStable essentially kept in memory, then sorted and written to disk on flush? After that point, an SStable is not modified, but can be written to another SStable through compaction? -Mike On Jul 13, 2012, at 8:22 PM, Rob Coli wrote: On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius wrote: It depends on what partitioner you use. You should be using the RandomPartitioner, and if so, the rows are sorted by the hash of the row key. there are partitioners that sort based on the raw key value but these partitioners shouldn't be used as they have problems due to uneven partitioning of data. The formal way this works in the code is that SSTables are ordered by "decorated" row key, where "decoration" is only a transformation when you are not using OrderedPartitioner. FWIW, in case you see that "DecoratedKey" syntax while reading code.. =Rob -- =Robert Coli AIM>ALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: SSTable format
Thanks for the information, So is the SStable essentially kept in memory, then sorted and written to disk on flush? After that point, an SStable is not modified, but can be written to another SStable through compaction? -Mike On Jul 13, 2012, at 8:22 PM, Rob Coli wrote: > On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius > wrote: >> It depends on what partitioner you use. You should be using the >> RandomPartitioner, and if so, the rows are sorted by the hash of the row >> key. there are partitioners that sort based on the raw key value but these >> partitioners shouldn't be used as they have problems due to uneven >> partitioning of data. > > The formal way this works in the code is that SSTables are ordered by > "decorated" row key, where "decoration" is only a transformation when > you are not using OrderedPartitioner. FWIW, in case you see that > "DecoratedKey" syntax while reading code.. > > =Rob > > -- > =Robert Coli > AIM>ALK - rc...@palominodb.com > YAHOO - rcoli.palominob > SKYPE - rcoli_palominodb
Re: SSTable format
On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius wrote: > It depends on what partitioner you use. You should be using the > RandomPartitioner, and if so, the rows are sorted by the hash of the row > key. there are partitioners that sort based on the raw key value but these > partitioners shouldn't be used as they have problems due to uneven > partitioning of data. The formal way this works in the code is that SSTables are ordered by "decorated" row key, where "decoration" is only a transformation when you are not using OrderedPartitioner. FWIW, in case you see that "DecoratedKey" syntax while reading code.. =Rob -- =Robert Coli AIM>ALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: SSTable format
On 07/13/2012 08:00 PM, Michael Theroux wrote: Hello, I've been trying to understand in greater detail how SStables are stored, and how information is transferred between Cassandra nodes, especially when a new node is joining a cluster. Specifically, Is information stored to SStables ordered by rowkeys? Some of the articles I've read suggests this is the case (although it's a little vague if they actually mean that the columns are stored in order, not the rowkeys). However, if data is stored in rowkey order, how is this achieved, as sstables are immutable? Thanks for any insights, -Mike It depends on what partitioner you use. You should be using the RandomPartitioner, and if so, the rows are sorted by the hash of the row key. there are partitioners that sort based on the raw key value but these partitioners shouldn't be used as they have problems due to uneven partitioning of data. As for how this is done, remember an sstable doesn't hold all the data for a column family. Not only does the data for a column family exist on multiple servers, there are usually multiple sstable files on disk that represent data from one column family on one machine. So at the time the sstable is written, the rows that are to be put in the sstable are sorted, and written in sorted order. In fact the same rowkey may be written in multiple sstables, one sstable having one set of columns for the key, the other sstable having other columns for the same key. On query for some row based on a key, cassandra is responsible for finding where the columns are found in which sstables (potentially several) and merging the results.
SSTable format
Hello, I've been trying to understand in greater detail how SStables are stored, and how information is transferred between Cassandra nodes, especially when a new node is joining a cluster. Specifically, Is information stored to SStables ordered by rowkeys? Some of the articles I've read suggests this is the case (although it's a little vague if they actually mean that the columns are stored in order, not the rowkeys). However, if data is stored in rowkey order, how is this achieved, as sstables are immutable? Thanks for any insights, -Mike
Re: Increased replication factor not evident in CLI
I was able to apply the patch in the cited bug report to the public source for version 1.1.2. It seemed pretty straightforward; six lines in MigrationManager.java were switched from System.currentTimeMillis() to FBUtilities.timestampMicros(). I then re-built the project by running 'ant artifacts' in the cassandra root. After I was up and running with the new version, I attempted to increase the replication factor, and then the compressions options. Unfortunately, new patch did not seem to help in my case. Neither of the schema attributes would change. Running a "describe cluster" shows that all node schemas are consistent. Are there any other ways that I could potentially force Cassandra to accept these changes? - .Dustin On Jul 13, 2012, at 10:02 AM, Dustin Wenz wrote: > It sounds plausible that is what we are running into. All of our nodes report > a replication factor of 2 (both using describe, and show schema), even though > the cluster reported that all schemas agree after I issued the change to 4. > > If this is related to the bug that you filed, it might also explain why I've > had difficulty changing the compression options on this same cluster. I issue > an update command, schemas agree, but yet the change is not evident. > > - .Dustin > > On Jul 12, 2012, at 7:56 PM, Michael Theroux wrote: > >> Sounds a lot like a bug that I hit that was filed and fixed recently: >> >> https://issues.apache.org/jira/browse/CASSANDRA-4432 >> >> -Mike >> >> On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote: >> >>> Possibly the bug with nanotime causing cassandra to think the change >>> happened in the past. Talked about onlist in past few days. >>> On Thursday, July 12, 2012, aaron morton wrote: Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? Do show schema and show keyspace say the same thing ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 13/07/2012, at 7:39 AM, Dustin Wenz wrote: We recently increased the replication factor of a keyspace in our cassandra 1.1.1 cluster from 2 to 4. This was done by setting the replication factor to 4 in cassandra-cli, and then running a repair on each node. Everything seems to have worked; the commands completed successfully and disk usage increased significantly. However, if I perform a describe on the keyspace, it still shows replication_factor:2. So, it appears that the replication factor might be 4, but it reports as 2. I'm not entirely sure how to confirm one or the other. Since then, I've stopped and restarted the cluster, and even ran an upgradesstables on each node. The replication factor still doesn't report as I would expect. Am I missing something here? - .Dustin >> >
2012 Cassandra MVP nominations
DataStax would like to recognize individuals who go above and beyond in their contributions to Apache Cassandra. To formalize this a little bit, we're creating an MVP program, the first of which will be announced at the Cassandra summit [1] in August. To make this program a success, we need your help to nominate either yourself or another you think merits consideration. We're looking for people who take the initiative organizing user groups, who explain Cassandra in talks, blogs, Twitter, or other forums, or who answer questions on the mailing list, IRC, StackOverflow, etc. Please take five minutes and submit your nomination today at [2]. Nominations will be open throughout the next week. Those selected will be notified in advance. [1] http://www.datastax.com/events/cassandrasummit2012 [2] http://www.surveymonkey.com/s/WVBZGHR -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Cassandra Summit 2012
Hi all, The 2012 Cassandra Summit will be in San Jose on August 8. The 2011 Summit sold out with almost 500 attendees; this year we found a bigger venue to accommodate 700+. It's fantastic to see the Cassandra community grow like this! The 2012 Summit will have *four* talk tracks, plus the popular "Ask the Experts" breakout room where DataStax engineers will take any question, all day. Accepted talks are posted at http://www.datastax.com/events/cassandrasummit2012#Sessions, and speaker bios at http://www.datastax.com/events/cassandrasummit2012#Speakers. More abstracts will be posted as they are confirmed. Learn more and register at http://www.datastax.com/events/cassandrasummit2012. Use the "cassandra-list-20" code when registering and save 20%! P.S. Brandon Williams and I will be conducting a developer training course immediately before the Summit. More information at http://www.datastax.com/services/training -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How to speed up data loading
Any chance your server has been running for the last two weeks with the leap second bug? http://www.datastax.com/dev/blog/linux-cassandra-and-saturdays-leap-second-problem -Tupshin On Jul 12, 2012 1:43 PM, "Leonid Ilyevsky" wrote: > I am loading a large set of data into a CF with composite key. The load > is going pretty slow, hundreds or even thousands times slower than it would > do in RDBMS. > > I have a choice of how granular my physical key (the first component of > the primary key) is, this way I can balance between smaller rows and too > many keys vs. wide rows and fewer keys. What are the guidelines about this? > How the width of the physical row affects the speed of load? > > ** ** > > I see that Cassandra is doing a lot of processing behind the scene, even > when I kill the client, the server is still consuming a lot of CPU for a > long time. > > ** ** > > What else should I look at ? Anything in configuration? > > -- > This email, along with any attachments, is confidential and may be legally > privileged or otherwise protected from disclosure. Any unauthorized > dissemination, copying or use of the contents of this email is strictly > prohibited and may be in violation of law. If you are not the intended > recipient, any disclosure, copying, forwarding or distribution of this > email is strictly prohibited and this email and any attachments should be > deleted immediately. This email and any attachments do not constitute an > offer to sell or a solicitation of an offer to purchase any interest in any > investment vehicle sponsored by Moon Capital Management LP (“Moon > Capital”). Moon Capital does not provide legal, accounting or tax advice. > Any statement regarding legal, accounting or tax matters was not intended > or written to be relied upon by any person as advice. Moon Capital does not > waive confidentiality or privilege as a result of this email. >
Re: Increased replication factor not evident in CLI
It sounds plausible that is what we are running into. All of our nodes report a replication factor of 2 (both using describe, and show schema), even though the cluster reported that all schemas agree after I issued the change to 4. If this is related to the bug that you filed, it might also explain why I've had difficulty changing the compression options on this same cluster. I issue an update command, schemas agree, but yet the change is not evident. - .Dustin On Jul 12, 2012, at 7:56 PM, Michael Theroux wrote: > Sounds a lot like a bug that I hit that was filed and fixed recently: > > https://issues.apache.org/jira/browse/CASSANDRA-4432 > > -Mike > > On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote: > >> Possibly the bug with nanotime causing cassandra to think the change >> happened in the past. Talked about onlist in past few days. >> On Thursday, July 12, 2012, aaron morton wrote: >> > Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? >> > Do show schema and show keyspace say the same thing ? >> > Cheers >> > >> > >> > - >> > Aaron Morton >> > Freelance Developer >> > @aaronmorton >> > http://www.thelastpickle.com >> > On 13/07/2012, at 7:39 AM, Dustin Wenz wrote: >> > >> > We recently increased the replication factor of a keyspace in our >> > cassandra 1.1.1 cluster from 2 to 4. This was done by setting the >> > replication factor to 4 in cassandra-cli, and then running a repair on >> > each node. >> > >> > Everything seems to have worked; the commands completed successfully and >> > disk usage increased significantly. However, if I perform a describe on >> > the keyspace, it still shows replication_factor:2. So, it appears that the >> > replication factor might be 4, but it reports as 2. I'm not entirely sure >> > how to confirm one or the other. >> > >> > Since then, I've stopped and restarted the cluster, and even ran an >> > upgradesstables on each node. The replication factor still doesn't report >> > as I would expect. Am I missing something here? >> > >> > - .Dustin >> > >> > >> > >
Re: Cassandra and Tableau
Thank you Aaron and Brian. We're currently investigating several options. Hadoop + Hive combo also seems a good choice as our input files are flat. I'll keep you up-to-date about our final decision. - Robin 2012/7/6 aaron morton > Here are two links I've noticed in my travels, have not looked into what > they offer. > > http://www.pentaho.com/big-data/nosql/cassandra/ > > http://www.jaspersoft.com/bigdata > > Cheers > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 7/07/2012, at 3:03 AM, Brian O'Neill wrote: > > Robin, > > We have the same issue right now. We use Tableau for all of our > reporting needs, but we couldn't find any acceptable bridge between it > and Cassandra. > > We ended up using cassandra-triggers to replicate the data to Oracle. > https://github.com/hmsonline/cassandra-triggers/ > > Let us know if you get things setup with a direct connection. > We'd be *very* interested int helping out if you find a way to do it. > > -brian > > > On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen wrote: > > Hi there, > > > Is there anyone out there who's using Tableau in combination with a > > Cassandra cluster? There seems to be no standard solution to connect, at > > least I couldn't find one. Does anyone know how to tackle this problem? > > > > With kind regards, > > > Robin Verlangen > > Software engineer > > > W http://www.robinverlangen.nl > > E ro...@us2.nl > > > Disclaimer: The information contained in this message and attachments is > > intended solely for the attention and use of the named addressee and may be > > confidential. If you are not the intended recipient, you are reminded that > > the information remains the property of the sender. You must not use, > > disclose, distribute, copy, print or rely on this e-mail. If you have > > received this message in error, please contact the sender immediately and > > irrevocably delete this message and any copies. > > > > > > -- > Brian ONeill > Lead Architect, Health Market Science (http://healthmarketscience.com) > mobile:215.588.6024 > blog: http://weblogs.java.net/blog/boneill42/ > blog: http://brianoneill.blogspot.com/ > > > -- With kind regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Never ending manual repair after adding second DC
Hello everyone, I'm facing quite weird problem with Cassandra since we've added secondary DC to our cluster and have totally ran out of ideas; this email is a call for help/advice! History looks like: - we used to have 4 nodes in a single DC - running Cassandra 0.8.7 - RF:3 - around 50GB of data on each node - randomPartitioner and SimpleSnitch All was working fine for over 9 months. Few weeks ago we decided we want to add another 4 nodes in a second DC and join them to the cluster. Prior doing that, we upgraded Cassandra to 1.0.9 to push it out of the doors before the multi-DC work. After upgrade, we left it working for over a week and it was all good; no issues. Then, we added 4 additional nodes in another DC bringing the cluster to 8 nodes in total, spreading across two DCs, so now we've: - 8 nodes across 2 DCs, 4 in each DC - 100Mbps low-latency connection (sub 5ms) running over Cisco ASA Site-to-Site VPN (which is ikev1 based) - DC1:3,DC2:3 RFs - randomPartitioner and using PropertyFileSnitch now nodetool ring looks as follows: $ nodetool -h localhost ring Address DC RackStatus State Load OwnsToken 148873535527910577765226390751398592512 192.168.81.2DC1 RC1 Up Normal 37.9 GB 12.50% 0 192.168.81.3DC1 RC1 Up Normal 35.32 GB 12.50% 21267647932558653966460912964485513216 192.168.81.4DC1 RC1 Up Normal 39.51 GB 12.50% 42535295865117307932921825928971026432 192.168.81.5DC1 RC1 Up Normal 19.42 GB 12.50% 63802943797675961899382738893456539648 192.168.94.178 DC2 RC1 Up Normal 40.72 GB 12.50% 85070591730234615865843651857942052864 192.168.94.179 DC2 RC1 Up Normal 30.42 GB 12.50% 106338239662793269832304564822427566080 192.168.94.180 DC2 RC1 Up Normal 30.94 GB 12.50% 127605887595351923798765477786913079296 192.168.94.181 DC2 RC1 Up Normal 12.75 GB 12.50% 148873535527910577765226390751398592512 (please ignore the fact that nodes are not interleaved; they should be however there's been hiccup during the implementation phase. Unless *this* is the problem!) Now, the problem: over 7 out of 10 manual repairs are not being finished. They usually get stuck and show 3 different sympoms: 1). Say node 192.168.81.2 runs manual repair, it requests merkle trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not from 192.168.94.181. 192.168.94.181 logs are saying that it has sent the merkle tree back but it's never received by 192.168.81.2. 2). Say node 192.168.81.2 runs manual repair, it requests merkle trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179, 192.168.94.181. It receives them from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not from 192.168.94.181. 192.168.94.181 logs are not saying *anything* about merkle tree being sent. Also compactionstats are not even saying anything about them being validated (generated) 3). Merkle trees are being delivered, and nodes are sending data across to sync theirselves. On a certain occasions, they'll get "stuck" streaming files between each other at 100% and won't move forward. Now the interesting bit is, the ones that are getting stuck are always placed in different DCs! Now, pretty much every single scenario points towards connectivity problem, however we also have few PostgreSQL replication streams happening over this connection, some other traffic going over and quite a lot of monitoring happening and none of those are being affected in any way. Also, if random packets are being lost, I'd expect TCP to correct that (re-transmit them). It doesn't matter whether its manual repair or just -pr repair, both end with pretty much the same. Has anyone came across this kind of issue before or have any ideas how else I could investigate this? The issue is pressing me massively as this is our live cluster and I've to run manual repairs pretty much manually (usually multiple times before it finally goes through) every single day… And also I'm not sure whether cluster is getting affected in any other way BTW. I've gone through Jira issues and considered upgrading to 1.1.X but I can't see anything that would even look like something that is happening to my cluster. If any further information, like logs, configuration files are needed, please let me know. Any informations, suggestions, advices - greatly appreciated. Kind regards, Bart
Re: Using a node in separate cluster without decommissioning.
Hi Just wanted to say that it worked. I also made sure to modify thrift rpc_port and storage port so that the two clusters don't interfere. Thanks for the suggestion Thanks Rohit On Thu, Jul 12, 2012 at 10:01 AM, aaron morton wrote: > Since replication factor is 2 in first cluster, I > won't lose any data. > > Assuming you have been running repair or working at CL QUORUM (which is the > same as CL ALL for RF 2) > > Is it advisable and safe to go ahead? > > um, so the plan is to turn off 2 nodes in the first cluster, restask them > into the new cluster and then reverse the process ? > > If you simply turn two nodes off in the first cluster you will have reduce > the availability for a portion of the ring. 25% of the keys will now have at > best 1 node they can be stored on. If a node is having any sort of problems, > and it's is a replica for one of the down nodes, the cluster will appear > down for 12.5% of the keyspace. > > If you work at QUORUM you will not have enough nodes available to write / > read 25% of the keys. > > If you decomission the nodes, you will still have 2 replicas available for > each key range. This is the path I would recommend. > > If you _really_ need to do it what you suggest will probably work. Some > tips: > > * do safe shutdowns - nodetool disablegossip, disablethrift, drain > * don't forget to copy the yaml file. > * in the first cluster the other nodes will collect hints for the first hour > the nodes are down. You are not going to want these so disable HH. > * get the nodes back into the first cluster before gc_grace_seconds expires. > * bring them back and repair them. > * when you bring them back, reading at CL ONE will give inconsistent > results. Reading at QUOURM may result in a lot of repair activity. > > Hope that helps. > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 11/07/2012, at 6:35 AM, rohit bhatia wrote: > > Hi > > I want to take out 2 nodes from a 8 node cluster and use in another > cluster, but can't afford the overhead of streaming the data and > rebalance cluster. Since replication factor is 2 in first cluster, I > won't lose any data. > > I'm planning to save my commit_log and data directories and > bootstrapping the node in the second cluster. Afterwards I'll just > replace both the directories and join the node back to the original > cluster. This should work since cassandra saves all the cluster and > schema info in the system keyspace. > > Is it advisable and safe to go ahead? > > Thanks > Rohit > >