Java GC pauses, reality check
Hello! >From what I understand java GC pauses are pretty much a fact of life, but you can tune the jvm to reduce the likelihood of the frequency and length of GC pauses. When using Cassandra, how frequent or long have these pauses known to be? Even with tuning, is it safe to assume they cannot be eliminated? Would a 20-30 second pause be something out of the ordinary? Thanks.
what operations don't update materialized views?
Hi, Are there any operations that skip updating the materialized views?
store individual inventory items in a table, how to assign them correctly
Say I have 100 products in inventory, instead of having a counter I want to create 100 rows per inventory item. When someone purchases a product, how can I correctly assign that customer a product from inventory without having any race conditions etc? Thanks.
RE: wide rows
Hi, Can someone clarify how you would model a "wide" row cassandra table? From what I understand, a wide row table is where you keep appending columns to a given row. The other way to model a table would be the "regular" style where each row contains data so you would during a SELECT you would want multiple rows as oppose to a wide row where you would get a single row, but a subset of columns. Can someone show a simple data model that compares both styles? Thanks.
understanding partitions and # of nodes
Hello, If you have a 10 node cluster, how does having 10 partitions or 100 partitions change how cassandra will perform? With 10 partitions you will have 1 partition per node. WIth 100 partitions you will have 10 partitions per node. With 100 partitions I guess it helps because when you add more nodes to your cluster, the data can be redistributed since you have more nodes. What else are things to consider? Thanks.
understanding partitions
Hello, If you have a 10 node cluster, how does having 10 partitions or 100 partitions change how cassandra will perform? With 10 partitions you will have 1 partition per node. WIth 100 partitions you will have 10 partitions per node. With 100 partitions I guess it helps because when you add more nodes to your cluster, the data can be redistributed since you have more nodes. What else are things to consider?
RE: no more zookeeper?
Does C* no long use zookeeper? I don't see a reference to it in the https://github.com/apache/cassandra/blob/trunk/build.xml If not, what replaced it?
Re: no more zookeeper?
Sorry guys, I am confusing things with Hbase. But Nate's jira look sure looks interesting thanks. On Tue, Jan 28, 2014 at 12:25 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Some people had done some custom cassandra zookeper integration back in the day. Triggers, there is some reference in the original facebook thrown over the wall to zk. No official release has ever used zk directly. Though people have suggested it. On Tue, Jan 28, 2014 at 12:08 PM, Andrey Ilinykh ailin...@gmail.comwrote: Why would cassandra use zookeeper? On Tue, Jan 28, 2014 at 7:18 AM, S Ahmed sahmed1...@gmail.com wrote: Does C* no long use zookeeper? I don't see a reference to it in the https://github.com/apache/cassandra/blob/trunk/build.xml If not, what replaced it?
Re: Which of these VPS configurations would perform better for Cassandra ?
From what I understood tons of people are running things on ec2, but it could be the instance size is pretty large that it compares to a dedicated server (especially if you go with SSD, it is like 1K/month!) On Tue, Aug 6, 2013 at 3:54 AM, Aaron Morton aa...@thelastpickle.comwrote: how many nodes to start with(2 ok?) ? I'd recommend 3, that will give you some redundancy see http://thelastpickle.com/2011/06/13/Down-For-Me/ Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 5/08/2013, at 1:41 AM, Rajkumar Gupta rajkumar@gmail.com wrote: okay, so what should a workable VPS configuration to start with minimum how many nodes to start with(2 ok?) ? Seriously I cannot afford the tensions of colocation setup. My hosting provider provides SSD drives with KVM virtualization.
Re: funnel analytics, how to query for reports etc.
Thanks Aaron. Too bad Rainbird isn't open sourced yet! On Tue, Jul 23, 2013 at 4:48 AM, aaron morton aa...@thelastpickle.comwrote: For background on rollup analytics: Twitter Rainbird http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011 Acunu http://www.acunu.com/ Cheers - Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 22/07/2013, at 1:03 AM, Vladimir Prudnikov v.prudni...@gmail.com wrote: This can be done easily, Use normal column family to store the sequence of events where key is session #ID identifying one use interaction with a website, column names are TimeUUID values and column value id of the event (do not write something like user added product to shopping cart, something shorter identifying this event). Then you can use counter column family to store counters, you can count anything, number of sessions, total number of events, number of particular events etc. One row per day for example. Then you can retrieve this row and calculate all required %. On Sun, Jul 21, 2013 at 1:05 AM, S Ahmed sahmed1...@gmail.com wrote: Would cassandra be a good choice for creating a funnel analytics type product similar to mixpanel? e.g. You create a set of events and store them in cassandra for things like: event#1 user visited product page event#2 user added product to shopping cart event#3 user clicked on checkout page event#4 user filled out cc information event#5 user purchased product Now in my web application I track each user and store the events somehow in cassandra (in some column family etc) Now how will I pull a report that produces results like: 70% of people added to shopping cart 20% checkout page 10% filled out cc information 4% purchased the product And this is for a Saas, so this report would be for thousands of customers in theory. -- Vladimir Prudnikov
high write load, with lots of updates, considerations? tomestombed data coming back to life
I was watching some videos from the C* summit 2013 and I recall many people saying that if you can some up with a design where you don't preform updates on rows, that would make things easier (I believe it was because there would be less compaction). When building an Analytics (time series) app on top of C*, based on Twitters Rainbird design ( http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011), this means there will be lots and lots of counters. With lots of counters (updates), admin wise, what are some things to consider? Could old tomestombed data somehow come back to life? I forget what scenerio brings about old data (kinda scary!).
funnel analytics, how to query for reports etc.
Would cassandra be a good choice for creating a funnel analytics type product similar to mixpanel? e.g. You create a set of events and store them in cassandra for things like: event#1 user visited product page event#2 user added product to shopping cart event#3 user clicked on checkout page event#4 user filled out cc information event#5 user purchased product Now in my web application I track each user and store the events somehow in cassandra (in some column family etc) Now how will I pull a report that produces results like: 70% of people added to shopping cart 20% checkout page 10% filled out cc information 4% purchased the product And this is for a Saas, so this report would be for thousands of customers in theory.
is there a key to sstable index file?
Since SSTables are mutable, and they are ordered, does this mean that there is a index of key ranges that each SS table holds, and the value could be 1 more sstables that have to be scanned and then the latest one is chosen? e.g. Say I write a value abc to CF1. This gets stored in a sstable. Then I write def to CF1, this gets stored in another sstable eventually. How when I go to fetch the value, it has to scan 2 sstables and then figure out which is the latest entry correct? So is there an index of key's to sstables, and there can be 1 or more sstables per key? (This is assuming compaction hasn't occurred yet).
is there a key to sstable index file?
Since SSTables are mutable, and they are ordered, does this mean that there is a index of key ranges that each SS table holds, and the value could be 1 more sstables that have to be scanned and then the latest one is chosen? e.g. Say I write a value abc to CF1. This gets stored in a sstable. Then I write def to CF1, this gets stored in another sstable eventually. How when I go to fetch the value, it has to scan 2 sstables and then figure out which is the latest entry correct? So is there an index of key's to sstables, and there can be 1 or more sstables per key? (This is assuming compaction hasn't occurred yet).
Re: does anyone store large values in cassandra e.g. 100kb?
So was the point of breaking into 36 parts to bring each row to the 64 or 128mb threshold? On Tue, Jul 9, 2013 at 3:18 AM, Theo Hultberg t...@iconara.net wrote: We store objects that are a couple of tens of K, sometimes 100K, and we store quite a few of these per row, sometimes hundreds of thousands. One problem we encountered early was that these rows would become so big that C* couldn't compact the rows in-memory and had to revert to slow two-pass compactions where it spills partially compacted rows to disk. we solved that in two ways, first by increasing in_memory_compaction_limit_in_mb from 64 to 128, and although it helped a little bit we quickly realized didn't have much effect because most of the time was taken up by really huge rows many times larger than that. We ended up implementing a simple sharding scheme where each row is actually 36 rows that each contain 1/36 of the range (we take the first letter in the column key and stick that on the row key on writes, and on reads we read all 36 rows -- 36 because there are 36 letters and numbers in the ascii alphabet and our column keys happen to distribute over that quite nicely). Cassandra works well with semi-large objects, and it works well with wide rows, but you have to be careful about the combination where rows get larger than 64 Mb. T# On Mon, Jul 8, 2013 at 8:13 PM, S Ahmed sahmed1...@gmail.com wrote: Hi Peter, Can you describe your environment, # of documents and what kind of usage pattern you have? On Mon, Jul 8, 2013 at 2:06 PM, Peter Lin wool...@gmail.com wrote: I regularly store word and pdf docs in cassandra without any issues. On Mon, Jul 8, 2013 at 1:46 PM, S Ahmed sahmed1...@gmail.com wrote: I'm guessing that most people use cassandra to store relatively smaller payloads like 1-5kb in size. Is there anyone using it to store say 100kb (1/10 of a megabyte) and if so, was there any tweaking or gotchas that you ran into?
does anyone store large values in cassandra e.g. 100kb?
I'm guessing that most people use cassandra to store relatively smaller payloads like 1-5kb in size. Is there anyone using it to store say 100kb (1/10 of a megabyte) and if so, was there any tweaking or gotchas that you ran into?
Re: does anyone store large values in cassandra e.g. 100kb?
Hi Peter, Can you describe your environment, # of documents and what kind of usage pattern you have? On Mon, Jul 8, 2013 at 2:06 PM, Peter Lin wool...@gmail.com wrote: I regularly store word and pdf docs in cassandra without any issues. On Mon, Jul 8, 2013 at 1:46 PM, S Ahmed sahmed1...@gmail.com wrote: I'm guessing that most people use cassandra to store relatively smaller payloads like 1-5kb in size. Is there anyone using it to store say 100kb (1/10 of a megabyte) and if so, was there any tweaking or gotchas that you ran into?
videos of 2013 summit
Hi, Are the videos online anywhere for the 2013 summit?
how to debug/trace
How can you possibly trace a read/write in cassandra's codebase when it uses so many threadpools/executers? I'm just getting into threads so I'm not to familiar with how one can trace things while in debug mode in IntelliJ when various thread pools are processing things etc.
java lib used in cli to provide auto-completion
Hi folks, I'm curious what java lib is used to provide auto-completion in the cli? Or is it all custom code?
unsubscribe
linux flavor?
Is there a particular linux flavor that plays best with Cassandra? I believe the file system plays big role also, any comments in this regard? thanks.
Re: indexing rows ordered by int
So when using Redis, how do you go about updating the index? Do you serialize changes to the index i.e. when someone votes, you then update the index? Little confused as to how to go about updating a huge index. Say you have 1 million stores, and you want to order by the top votes, how would you maintain such an index since they are being constantly voted on. On Sun, Aug 15, 2010 at 10:48 PM, Chris Goffinet c...@chrisgoffinet.comwrote: Digg is using redis for such a feature as well. We use it on the MyNews - Top in 24 hours. Since we need timestamp ordering + sorting by how many friends touch a story. -Chris On Aug 15, 2010, at 7:34 PM, Benjamin Black wrote: http://code.google.com/p/redis/ On Sat, Aug 14, 2010 at 11:51 PM, S Ahmed sahmed1...@gmail.com wrote: For CF that I need to perform range scans on, I create separate CF that have custom ordering. Say a CF holds comments on a story (like comments on a reddit or digg story post) So if I need to order comments by votes, it seems I have to re-index every time someone votes on a comment (or batch it every x minutes). Right now I think I have to pull all the comments into memory, then sort by votes, then re-write the index. Are there any best-practises for this type of index?
indexing rows ordered by int
For CF that I need to perform range scans on, I create separate CF that have custom ordering. Say a CF holds comments on a story (like comments on a reddit or digg story post) So if I need to order comments by votes, it seems I have to re-index every time someone votes on a comment (or batch it every x minutes). Right now I think I have to pull all the comments into memory, then sort by votes, then re-write the index. Are there any best-practises for this type of index?
why does it take 60-90 seconds for a new node to get up?
Why is it that, if you set AutoBootStrap = false that it takes 60-90 seconds for the node to announce itself? I just want to understand what is going on during that time, and why that specific timeframe (if there is a reason?)
Re: Question on nodetool ring
that's the token range so node#1 is from 1600.. to 429.. node#2 is from 429... to 1600... hopefully others can chime into confirm. On Mon, Aug 9, 2010 at 12:30 PM, Mark static.void@gmail.com wrote: I'm running a 2 node cluster and when I run nodetool ring I get the following output Address Status State LoadToken 160032583171087979418578389981025646900 127.0.0.1 Up Normal 42.28 MB 42909338385373526599163667549814010691 127.0.0.2 Up Normal 42.26 MB 160032583171087979418578389981025646900 The columns/values are pretty much self explanatory except for the first line. What is this value? Thanks
Re: Question on nodetool ring
b/c node#1 has a start and end range, so you can see the boundaries for each node by looking at the last column. On Mon, Aug 9, 2010 at 4:12 PM, Mark static.void@gmail.com wrote: On 8/9/10 12:51 PM, S Ahmed wrote: that's the token range so node#1 is from 1600.. to 429.. node#2 is from 429... to 1600... hopefully others can chime into confirm. On Mon, Aug 9, 2010 at 12:30 PM, Mark static.void@gmail.com wrote: I'm running a 2 node cluster and when I run nodetool ring I get the following output Address Status State LoadToken 160032583171087979418578389981025646900 127.0.0.1 Up Normal 42.28 MB 42909338385373526599163667549814010691 127.0.0.2 Up Normal 42.26 MB 160032583171087979418578389981025646900 The columns/values are pretty much self explanatory except for the first line. What is this value? Thanks I was just wondering why the 160032583171087979418578389981025646900 token is on a line by itself and listed under 127.0.0.2.
Re: Growing commit log directory.
if your commit logs are not getting cleared, doesn't that indicate your load is more than your servers can handle? On Mon, Aug 9, 2010 at 4:50 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I have a 16 node 6.3 cluster and two nodes from my cluster are giving me major headaches. 10.71.71.56 Up 58.19 GB 10827166220211678382926910108067277| ^ 10.71.71.61 Down 67.77 GB 123739042516704895804863493611552076888v | 10.71.71.66 Up 43.51 GB 127605887595351923798765477786913079296| ^ 10.71.71.59 Down 90.22 GB 139206422831293007780471430312996086499v | 10.71.71.65 Up 22.97 GB 148873535527910577765226390751398592512| ^ The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB + commit log directories. They keep growing, along with memory usage, eventually the logs start showing GCInspection errors and then the nodes will go OOM INFO 14:20:01,296 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1281378001296.log INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving 7955651792 used; max is 9773776896 INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving 8137412920 used; max is 9773776896 INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving 8310139720 used; max is 9773776896 INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving 8480136592 used; max is 9773776896 INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving 8648872520 used; max is 9773776896 INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving 8816581312 used; max is 9773776896 INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving 8986063136 used; max is 9773776896 INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving 9153134392 used; max is 9773776896 INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving 9318140296 used; max is 9773776896 java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid10913.hprof ... INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead. INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead. INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200 reclaimed leaving 9334753480 used; max is 9773776896 INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead. Heap dump file created [12730501093 bytes in 253.445 secs] ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880 reclaimed leaving 9335215296 used; max is 9773776896 Does anyone have any ideas what is going on?
explanation of generated files and ops
In /var/lib/cassandra there is: /data/system LocationInfo-4-Data.db LocationInfo-4-Filter.db LocationInfo-4-Index.db .. .. /data/Keyspace1/ Standard2-2-Data.db Standard2-2-Filter.db Standard2-2-Index.db /commitlog CommitLog-timestamp.log /var/log/cassandra system.log Is this pretty much all the files that Cassandra generates? (have I missed any) Are there are common administrative tasks that admins might need to perform on these files at all? What exactly is stored in the -Filter.db files?
cassandra summit, making videos?
Will there be videos of the session at the Cassandra Summit in SF? I am really interested in the Cassandra codebase/internals seminar.
Re: Estimated release for Cassandra 0.6.4
So is it a good estimate to give about 1 month per +.1 release? i.e. 7.0 should be around October/November? (btw great work, keep it up!) On Wed, Jul 21, 2010 at 12:15 AM, CassUser CassUser cassu...@gmail.comwrote: Thanks Eric. On Tue, Jul 20, 2010 at 8:14 PM, Eric Evans eev...@rackspace.com wrote: On Tue, 2010-07-20 at 13:53 -0700, CassUser CassUser wrote: Is there a release date (or approximate date) for cassandra 0.6.4. We are mainly concerned about the Cassandra-1042 patch. The reason we don't simply apply the patch is because since we are shipping a product which interacts with the cassandra server (and the patch is server side), the customer would feel better if it was in a stable release. Just trying to get an idea from the Cassandra guys on their plans :) We've been working on a monthly cadence for stable releases, so sometime in the next couple of weeks. -- Eric Evans eev...@rackspace.com
Re: Cassandra benchmarking on Rackspace Cloud
I'm reading what this thread and I am a little lost, what should the expected behavioral be? Should it maintain 53K regardless of nodes? nodes reads/sec 1 53,000 2 37,000 4 37,000 I ran this test previously on the cloud, with similar results: nodes reads/sec 1 24,000 2 21,000 3 21,000 4 21,000 5 21,000 6 21,000 On Mon, Jul 19, 2010 at 2:02 PM, David Schoonover david.schoono...@gmail.com wrote: Multiple client processes, or multiple client machines? I ran it with both one and two client machines making requests, and ensured the sum of the request threads across the clients was 50. That was on the cloud. I am re-running the multi-host test against the 4-node cluster on dedicated hardware now to ensure that result was not an artifact of the cloud. David Schoonover On Jul 19, 2010, at 1:38 PM, Jonathan Ellis wrote: On Mon, Jul 19, 2010 at 12:30 PM, David Schoonover david.schoono...@gmail.com wrote: How many physical client machines are running stress.py? One with 50 threads; it is remote from the cluster but within the same DC in both cases. I also run the test with multiple clients and saw similar results when summing the reqs/sec. Multiple client processes, or multiple client machines? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Newbie to cassandra
read the wiki, read about nosql in general. download and install it, play with it. browse the source code. read the bigdata paper by google, dynamo by amazon. On Sun, Jul 18, 2010 at 2:46 PM, sonia gehlot sonia.geh...@gmail.comwrote: Hi everyone, I am new to Cassandra and wanted to try and start learning Cassandra. I have database background. I am fully exposed and have full command on Netezza, Oracle, MySQL, Sybase, SQL etc basically all the relational databases. As Cassandra is gaining popularity day by day by its amazing features, I also got tempt towards it and wanted to take deep dive into it. Please help me by guiding me in a right direction. How can I start working with Cassandra? Any help is appreciated. Thanks in advance. Sonia
Re: key types and grouping related rows together
Well I'm not talking about a specific column family here, as ALL my column families will have content that is specific to a certain website, so I need a strategy that I will use on almost all my column families. On Wed, Jul 14, 2010 at 9:20 PM, Schubert Zhang zson...@gmail.com wrote: for your apps, how about this schema: key: website1123 columnName: UserID ... On Thu, Jul 15, 2010 at 6:13 AM, Aaron Morton aa...@thelastpickle.comwrote: The key structure you have should group the keys based on the website There are some differences between range queries with RP and OPP this article may help http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ Aaron On 15 Jul, 2010,at 08:44 AM, S Ahmed sahmed1...@gmail.com wrote: Where is the link that describes the various key types and their impact on sorting? (I believe I read it before, can't seem to find it now). So my application supports multi-tenants, so I need the keys to represent things like: website1123 + contentID or website3454 + userID And for range queries, these keys have to be grouped together obviously. What key type would be best suited for this? I might have to create a CF that maps the website and its key prefix?
Re: key types and grouping related rows together
Do you think a composite key using a key type of Bytes would work? How many bytes can it be? public static byte [] createRowKey(int websiteid, long stamp) throws Exception { byte [] websiteidBytes = Bytes.toBytes(websiteid); byte [] stampBytes = Bytes.toBytes(stamp); return Bytes.add(websiteidBytes, stampBytes); } So say this key is used in a ColumnFamily that stores Articles for all websites, using a key like this would allow me to get a range of articles written, ordered by date, for a specific website correct? On Thu, Jul 15, 2010 at 9:38 AM, S Ahmed sahmed1...@gmail.com wrote: Well I'm not talking about a specific column family here, as ALL my column families will have content that is specific to a certain website, so I need a strategy that I will use on almost all my column families. On Wed, Jul 14, 2010 at 9:20 PM, Schubert Zhang zson...@gmail.com wrote: for your apps, how about this schema: key: website1123 columnName: UserID ... On Thu, Jul 15, 2010 at 6:13 AM, Aaron Morton aa...@thelastpickle.comwrote: The key structure you have should group the keys based on the website There are some differences between range queries with RP and OPP this article may help http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ Aaron On 15 Jul, 2010,at 08:44 AM, S Ahmed sahmed1...@gmail.com wrote: Where is the link that describes the various key types and their impact on sorting? (I believe I read it before, can't seem to find it now). So my application supports multi-tenants, so I need the keys to represent things like: website1123 + contentID or website3454 + userID And for range queries, these keys have to be grouped together obviously. What key type would be best suited for this? I might have to create a CF that maps the website and its key prefix?
Re: key types and grouping related rows together
Benjamin, Ah, thanks for clarifying that. key sorting is changing in .7 I believe to support a binary array? On Thu, Jul 15, 2010 at 3:26 PM, Benjamin Black b...@b3k.us wrote: Keys are always sorted (in 0.6) as UTF8 strings. The CompareWith applies to _columns_ within rows, _not_ to row keys. On Wed, Jul 14, 2010 at 1:44 PM, S Ahmed sahmed1...@gmail.com wrote: Where is the link that describes the various key types and their impact on sorting? (I believe I read it before, can't seem to find it now). So my application supports multi-tenants, so I need the keys to represent things like: website1123 + contentID or website3454 + userID And for range queries, these keys have to be grouped together obviously. What key type would be best suited for this? I might have to create a CF that maps the website and its key prefix?
Re: key types and grouping related rows together
Given a CF like: Articles : { key1 : { title:some title, body: this is my article body..., }, key1 : { title:some title, body: this is my article body..., } } Now these articles could be for different websites e.g. www.website1.com, www.website2.com If I want to get the latest 10 articles for a given website, how would I formulate my key to achieve this? I basically need to understand how to handle multi-tenancy, b/c I will need to do this for almost all my CF's. I'm a little stuck here so guidance would be great! On Thu, Jul 15, 2010 at 4:01 PM, S Ahmed sahmed1...@gmail.com wrote: Benjamin, Ah, thanks for clarifying that. key sorting is changing in .7 I believe to support a binary array? On Thu, Jul 15, 2010 at 3:26 PM, Benjamin Black b...@b3k.us wrote: Keys are always sorted (in 0.6) as UTF8 strings. The CompareWith applies to _columns_ within rows, _not_ to row keys. On Wed, Jul 14, 2010 at 1:44 PM, S Ahmed sahmed1...@gmail.com wrote: Where is the link that describes the various key types and their impact on sorting? (I believe I read it before, can't seem to find it now). So my application supports multi-tenants, so I need the keys to represent things like: website1123 + contentID or website3454 + userID And for range queries, these keys have to be grouped together obviously. What key type would be best suited for this? I might have to create a CF that maps the website and its key prefix?
Re: NYC Cassandra training
How will we load the VM on our machines? Do we download it ? Is it running Ubuntu? On Wed, Jul 14, 2010 at 11:11 AM, Jonathan Ellis jbel...@gmail.com wrote: Turns out we can get a list from Eventbrite: http://www.eventbrite.com/org/474011012?s=1926097 On Tue, Jul 13, 2010 at 3:09 PM, Jonathan Ellis jbel...@gmail.com wrote: On Fri, Jul 9, 2010 at 9:36 AM, Jeremy Dunck jdu...@gmail.com wrote: On Fri, Jul 2, 2010 at 1:08 PM, Jonathan Ellis jbel...@gmail.com wrote: Riptano's one day Cassandra training is coming to NYC in August, our first public session on the East coast: http://www.eventbrite.com/event/749518831 Is there a calendar where you're listing this stuff, or is it just tweets and mail messages about individual events at this point? We are working on getting a calendar up on our web site, but for now it is just the mailing list here. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
key types and grouping related rows together
Where is the link that describes the various key types and their impact on sorting? (I believe I read it before, can't seem to find it now). So my application supports multi-tenants, so I need the keys to represent things like: website1123 + contentID or website3454 + userID And for range queries, these keys have to be grouped together obviously. What key type would be best suited for this? I might have to create a CF that maps the website and its key prefix?
Re: advice, is cassandra suitable for a multi-tanency vBulletin type application?
The only issue I see (please correct me if I am wrong) is that you loose, is that you have single points of failure in the system now i.e. redis etc. On Tue, Jul 13, 2010 at 3:33 AM, Sandeep Kalidindi at PaGaLGuY.com sandeep.kalidi...@pagalguy.com wrote: @michael - benjamin answered your question. Thing is if you use mysql just for indices you are not at all using the benefits of the whole relational database engine(which is fine) but then are inheriting all its disadvantages. You can use mysql for storing indices and then write your own sharding layer on top and then make sure network partitions are taken care of and then.. oh wait you are already starting to create a poor mans cassandra on top of Mysql. Why not just use cassandra ??? One valid argument can be mysql is solid in stability where as cassandra still yet to prove it is rock solid. But then 0.7 release looks awesome. There are some really wonderful people developing cassandra and then here to answer most of your questions and then if you still need there is Riptano(and jonathan ellis is one hell of a person to discuss your infra issues). Cheers, Deepu. On Tue, Jul 13, 2010 at 12:17 PM, Benjamin Black b...@b3k.us wrote: On Mon, Jul 12, 2010 at 11:35 PM, Michael Dürgner mich...@duergner.de wrote: The thing about slow on joins is true (we experience that ourselves) but still I wonder myself, why you use cassandra for the indices. Can't you just store them in MySQL although? ...and then shard and shard and shard to deal with hundreds of millions or billions of rows? That's usually the trade-off. Both can be made to work, but neither is free. b
Re: advice, is cassandra suitable for a multi-tanency vBulletin type application?
Very interesting! What kind of integration do you have between vB and Cassandra? its not a port then? On Mon, Jul 12, 2010 at 3:34 AM, Sandeep Kalidindi at PaGaLGuY.com sandeep.kalidi...@pagalguy.com wrote: we were one of the vbulletin customers and our forums has been facing some bad scaling issues. we coded our forum software to work with cassandra. we are still testing for bugs and might go live in couple of weeks. You can ask any specific questions about vbulletin and cassandra and i will answer to the best of my knowledge. I our case a combination of cassandra and redis took care of most of the functionality that vbulletin offers and much more. Cheers, Deepu. On Mon, Jul 12, 2010 at 9:58 AM, Paul Prescod pres...@gmail.com wrote: On Sun, Jul 11, 2010 at 8:39 AM, S Ahmed sahmed1...@gmail.com wrote: I want to build a vBulletin type application (forums, threads, posts, user management, etc). Support multi-tenancy for a Saas type environment. Would Cassandra be suitable for this type of application? Thanks in advance. Most likely, it is technically a fine fit. But Cassandra is very early stage software, so you should expect that the documentation will not always be clear and things will change from version to version. If you are not extremely self-reliant, you may find it a frustrating experience. Unless you are confident you will have trouble scaling traditional technologies, it might not make business sense. Paul Prescod
Re: server needs thrift to run also?
confused, why does the installation guide say to build and make it then? http://github.com/ericflo/twissandra http://github.com/ericflo/twissandratwissandar is for 0.6.1 is that why? i.e. it was embedded in a later version? On Mon, Jul 12, 2010 at 4:46 PM, Stu Hood stu.h...@rackspace.com wrote: The Thrift server is embedded in Cassandra, and starts by default. Look for references to Thrift on: http://wiki.apache.org/cassandra/GettingStarted Thanks, Stu -Original Message- From: S Ahmed sahmed1...@gmail.com Sent: Monday, July 12, 2010 3:43pm To: user@cassandra.apache.org Subject: server needs thrift to run also? I'm trying to follow along the twissandra installation instructions. So to get it running I have to install Thrift. So thrift runs as another service? So communication is done via thrift, which then communicates to Cassandra on another port?
Re: server needs thrift to run also?
Ok I guess I have to read up on exactly what is going on here. I figured I could download twissandra, fire up cassandra and run the app! I thought all you needed was the python driver which comes with twissandra. Let me read more about Thrift and generating client code etc. thanks! On Mon, Jul 12, 2010 at 5:04 PM, Michael Pearson mjpear...@gmail.comwrote: Twissandra is packaged with pycassa + correct generated thrift transports under /deps already, so really just need the thrift binary to build from a cassandra.thrift API newer than what's currently supported by the bundled pycassa. -michael On Mon, Jul 12, 2010 at 1:55 PM, Stu Hood stu.h...@rackspace.com wrote: You'll need Thrift installed to generate the _client_ code: the server code is embedded within Cassandra. -Original Message- From: S Ahmed sahmed1...@gmail.com Sent: Monday, July 12, 2010 3:49pm To: user@cassandra.apache.org Subject: Re: server needs thrift to run also? confused, why does the installation guide say to build and make it then? http://github.com/ericflo/twissandra http://github.com/ericflo/twissandratwissandar is for 0.6.1 is that why? i.e. it was embedded in a later version? On Mon, Jul 12, 2010 at 4:46 PM, Stu Hood stu.h...@rackspace.com wrote: The Thrift server is embedded in Cassandra, and starts by default. Look for references to Thrift on: http://wiki.apache.org/cassandra/GettingStarted Thanks, Stu -Original Message- From: S Ahmed sahmed1...@gmail.com Sent: Monday, July 12, 2010 3:43pm To: user@cassandra.apache.org Subject: server needs thrift to run also? I'm trying to follow along the twissandra installation instructions. So to get it running I have to install Thrift. So thrift runs as another service? So communication is done via thrift, which then communicates to Cassandra on another port?
Re: advice, is cassandra suitable for a multi-tanency vBulletin type application?
What sort of traffic levels made you port the application to Cassandra? Very interested in seeing this go live. What sort of server setup are you looking at using? On Mon, Jul 12, 2010 at 4:39 PM, Sandeep Kalidindi at PaGaLGuY.com sandeep.kalidi...@pagalguy.com wrote: No we re-coded from scratch with most of the needed functionality. Cheers, Deepu. On Mon, Jul 12, 2010 at 7:49 PM, S Ahmed sahmed1...@gmail.com wrote: Very interesting! What kind of integration do you have between vB and Cassandra? its not a port then? On Mon, Jul 12, 2010 at 3:34 AM, Sandeep Kalidindi at PaGaLGuY.com sandeep.kalidi...@pagalguy.com wrote: we were one of the vbulletin customers and our forums has been facing some bad scaling issues. we coded our forum software to work with cassandra. we are still testing for bugs and might go live in couple of weeks. You can ask any specific questions about vbulletin and cassandra and i will answer to the best of my knowledge. I our case a combination of cassandra and redis took care of most of the functionality that vbulletin offers and much more. Cheers, Deepu. On Mon, Jul 12, 2010 at 9:58 AM, Paul Prescod pres...@gmail.com wrote: On Sun, Jul 11, 2010 at 8:39 AM, S Ahmed sahmed1...@gmail.com wrote: I want to build a vBulletin type application (forums, threads, posts, user management, etc). Support multi-tenancy for a Saas type environment. Would Cassandra be suitable for this type of application? Thanks in advance. Most likely, it is technically a fine fit. But Cassandra is very early stage software, so you should expect that the documentation will not always be clear and things will change from version to version. If you are not extremely self-reliant, you may find it a frustrating experience. Unless you are confident you will have trouble scaling traditional technologies, it might not make business sense. Paul Prescod
advice, is cassandra suitable for a multi-tanency vBulletin type application?
I want to build a vBulletin type application (forums, threads, posts, user management, etc). Support multi-tenancy for a Saas type environment. Would Cassandra be suitable for this type of application? Thanks in advance.
Re: NYC Cassandra training
My previous reply seemed to have bounced. Will there be a training day before/after the Cassandr Summit? (in SF on the 10th) On Fri, Jul 2, 2010 at 2:08 PM, Jonathan Ellis jbel...@gmail.com wrote: Riptano's one day Cassandra training is coming to NYC in August, our first public session on the East coast: http://www.eventbrite.com/event/749518831 We have also nailed down our next locations, although registration is not yet open: Denver in September and Seattle in October. See you there! -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Digg 4 Preview on TWiT
Agreed, what exactly did they replace it with. On Sun, Jul 4, 2010 at 8:14 AM, Bill de hÓra b...@dehora.net wrote: On Mon, 2010-06-28 at 11:51 -0500, Eric Evans wrote: On Mon, 2010-06-28 at 07:53 -0700, Kochheiser,Todd W - TOK-DITT-1 wrote: On a related but separate note: While I am fairly new to Cassandra and have only been following the mailing lists for a few months, the conversation with Kevin Rose on TWiT made me curious if the versions of Cassandra that Digg, Twitter, and Facebook are using may end up being forks of the Apache project or old versions. Facebook and Apache have diverged (technically we're the fork). To the best of my knowledge, this has always been the case. This person's understanding is that Facebook 'no longer contributes to nor uses Cassandra.': http://redmonk.com/sogrady/2010/05/17/beyond-cassandra/ I assume it's accurate - policy reasons wouldn't interest me as much as technical ones. Bill
Re: facebook search index super column, do I have this correct?
Actually I think in the video they said they store each messageID as a seperate column, that way they can do range queries correct? so it would be: aloha: { message1: 2343, message2: 9590002, } On Thu, Jul 1, 2010 at 6:25 PM, S Ahmed sahmed1...@gmail.com wrote: So trying to map how facebook implemented a CF of type Super to index message terms. Is this json representation correct? MessageIndex = { userid1 : { aloha : { messageIdList: 234,2343234,23423434,234255,345345,2342,532432}, clown : { messageIdList: 632, 2342, 23452, 234234, 234234}, .. .. .. }, userid2 : { eating : { messageIdList: 234,2343234,23423434,234255,345345,2342,532432}, studying : { messageIdList: 632, 2342, 23452, 234234, 234234}, .. .. .. } } So if a user searches for the term clown, they you perform a lookup in the CF named MessageIndex, and use do a lookup for the row of the currently logged in user by UserID (which is the key), and then look for a a CF with the term clown and return the value. Is this a proper representation and am I using the correct terminology?
Pelops 'up and running' post question + WTF is a SuperColumn = really confused.
https://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java So using the code snipped below, I want to create a json representation of the CF (super). /** * Write multiple sub-column values to a super column... * @param rowKeyThe key of the row to modify * @param colFamily The name of the super column family to operate on * @param colName The name of the super column * @param subColumnsA list of the sub-columns to write */ mutator. writeSubColumns( userId, L1Tickets, UuidHelper.newTimeUuidBytes(), // using a UUID value that sorts by time mutator.newColumnList( mutator.newColumn(category, videoPhone), mutator.newColumn(reportType, POOR_PICTURE), mutator.newColumn(createdDate, NumberHelper.toBytes(System.currentTimeMillis())), mutator.newColumn(capture, jpegBytes), mutator.newColumn(comment) )); Can someone show me what it would look like? This is what I have so far SupportTickets = { userId : { L1Tickets : { } } } But from what I understood, a CF of type super looks like ( http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model) : AddressBook = { // this is a ColumnFamily of type Super phatduckk: {// this is the key to this row inside the Super CF // the key here is the name of the owner of the address book // now we have an infinite # of super columns in this row // the keys inside the row are the names for the SuperColumns // each of these SuperColumns is an address book entry friend1: {street: 8th street, zip: 90210, city: Beverley Hills, state: CA}, // this is the address book entry for John in phatduckk's address book John: {street: Howard street, zip: 94404, city: FC, state: CA}, Kim: {street: X street, zip: 87876, city: Balls, state: VA}, Tod: {street: Jerry street, zip: 54556, city: Cartoon, state: CO}, Bob: {street: Q Blvd, zip: 24252, city: Nowhere, state: MN}, ... // we can have an infinite # of ScuperColumns (aka address book entries) }, // end row ieure: { // this is the key to another row in the Super CF // all the address book entries for ieure joey: {street: A ave, zip: 55485, city: Hell, state: NV}, William: {street: Armpit Dr, zip: 93301, city: Bakersfield, state: CA}, }, } The Pelop's code snippet seems to be adding an additional inner layer to this to me, confused!
Re: Pelops 'up and running' post question + WTF is a SuperColumn = really confused.
ok now that makes sense, thanks a bundle. On Fri, Jul 2, 2010 at 5:49 PM, Dan Washusen d...@reactive.org wrote: L1Tickets = { // column family userId: { // row key 42C120DF-D44A-44E4-9BDC-2B5439A5C7B4: { category: videoPhone, reportType: POOR_PICTURE, ...}, 99B60047-382A-4237-82CE-AE53A74FB747: { category: somethingElse, reportType: FOO, ...} } } On 3 July 2010 02:29, S Ahmed sahmed1...@gmail.com wrote: https://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java So using the code snipped below, I want to create a json representation of the CF (super). /** * Write multiple sub-column values to a super column... * @param rowKeyThe key of the row to modify * @param colFamily The name of the super column family to operate on * @param colName The name of the super column * @param subColumnsA list of the sub-columns to write */ mutator. writeSubColumns( userId, L1Tickets, UuidHelper.newTimeUuidBytes(), // using a UUID value that sorts by time mutator.newColumnList( mutator.newColumn(category, videoPhone), mutator.newColumn(reportType, POOR_PICTURE), mutator.newColumn(createdDate, NumberHelper.toBytes(System.currentTimeMillis())), mutator.newColumn(capture, jpegBytes), mutator.newColumn(comment) )); Can someone show me what it would look like? This is what I have so far SupportTickets = { userId : { L1Tickets : { } } } But from what I understood, a CF of type super looks like ( http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model) : AddressBook = { // this is a ColumnFamily of type Super phatduckk: {// this is the key to this row inside the Super CF // the key here is the name of the owner of the address book // now we have an infinite # of super columns in this row // the keys inside the row are the names for the SuperColumns // each of these SuperColumns is an address book entry friend1: {street: 8th street, zip: 90210, city: Beverley Hills, state: CA}, // this is the address book entry for John in phatduckk's address book John: {street: Howard street, zip: 94404, city: FC, state: CA}, Kim: {street: X street, zip: 87876, city: Balls, state: VA}, Tod: {street: Jerry street, zip: 54556, city: Cartoon, state: CO}, Bob: {street: Q Blvd, zip: 24252, city: Nowhere, state: MN}, ... // we can have an infinite # of ScuperColumns (aka address book entries) }, // end row ieure: { // this is the key to another row in the Super CF // all the address book entries for ieure joey: {street: A ave, zip: 55485, city: Hell, state: NV}, William: {street: Armpit Dr, zip: 93301, city: Bakersfield, state: CA}, }, } The Pelop's code snippet seems to be adding an additional inner layer to this to me, confused!
vector maps and counts
(I realize the ability to get/set a count constantly is coming in a upcoming release) Can someone give me a high level of the design of the vector map solution? Is the actual count value stored in the CF row or is it stored separately?
where is the video just before this one by Avinash?
In this video: http://vimeo.com/5185526 Avinash mentions that the previous presenter covered allot of what he was to cover. Does anyone have a link to that presentation?
facebook search index super column, do I have this correct?
So trying to map how facebook implemented a CF of type Super to index message terms. Is this json representation correct? MessageIndex = { userid1 : { aloha : { messageIdList: 234,2343234,23423434,234255,345345,2342,532432}, clown : { messageIdList: 632, 2342, 23452, 234234, 234234}, .. .. .. }, userid2 : { eating : { messageIdList: 234,2343234,23423434,234255,345345,2342,532432}, studying : { messageIdList: 632, 2342, 23452, 234234, 234234}, .. .. .. } } So if a user searches for the term clown, they you perform a lookup in the CF named MessageIndex, and use do a lookup for the row of the currently logged in user by UserID (which is the key), and then look for a a CF with the term clown and return the value. Is this a proper representation and am I using the correct terminology?
Re: forum application data model conversion
Any thoughts? On Tue, Jun 22, 2010 at 2:13 PM, S Ahmed sahmed1...@gmail.com wrote: Converting a Forum application to cassandra's data model. Tables: Posts [postID, threadID, userID, subject, body, created, lastmodified] So this table contains the actual question subject and body. When a user logs in, they want to see a list of their questions, and also order by the last-modified date (to see if people responed to their question). How would you do this best in Cassandra, seeing as the question/answer text is stored in another table. I know you could make a CF like: userID { postID1, postID2, ...} And somehow order by last-modified, but then on the actual web page you would have to first query for postID's owned by the user, and orderd by last-modified. THEN you would have to fetch the post data from the posts collection. Is this the only way? I mean other than repeating the post subject+body in the user-to-postID index CF.
forum application data model conversion
Converting a Forum application to cassandra's data model. Tables: Posts [postID, threadID, userID, subject, body, created, lastmodified] So this table contains the actual question subject and body. When a user logs in, they want to see a list of their questions, and also order by the last-modified date (to see if people responed to their question). How would you do this best in Cassandra, seeing as the question/answer text is stored in another table. I know you could make a CF like: userID { postID1, postID2, ...} And somehow order by last-modified, but then on the actual web page you would have to first query for postID's owned by the user, and orderd by last-modified. THEN you would have to fetch the post data from the posts collection. Is this the only way? I mean other than repeating the post subject+body in the user-to-postID index CF.
django or pylons
Seeing as I will be using a different ORM, would it make more sense to use pylons over django? From what I understand, pylons assumes less as compared to django.
CF that is like a non-clustered index, are key lookups that fast?
If you store only the key mappings in a column family, for custom ordering of rows etc. for things like: friends = { user_id : { friendid1, friendid2, } } or topForumPosts = { forum_id1 : { post2343, post32343, post32223, ...} } Now on friends page or on the top_forum_posts page you will get back a list of post_ids, you will then have to perform lookups on the main 'posts' CF to get the actual data. So if a page is displaying 10, 25, or 50 posts you will have 10, 25 or 50 key based lookups for each page view. Is this the suggested way? i.e. a look based on a slice to get a list of post_id's, then a seperate call to actually fetch the data for the given entity. Or is cassandra so fast that 50 key based calls is no reason to worry?
Re: CF that is like a non-clustered index, are key lookups that fast?
well it won't be a range, it will be random key lookups. On Tue, Jun 15, 2010 at 8:44 AM, Gary Dusbabek gdusba...@gmail.com wrote: On Tue, Jun 15, 2010 at 04:29, S Ahmed sahmed1...@gmail.com wrote: If you store only the key mappings in a column family, for custom ordering of rows etc. for things like: friends = { user_id : { friendid1, friendid2, } } or topForumPosts = { forum_id1 : { post2343, post32343, post32223, ...} } Now on friends page or on the top_forum_posts page you will get back a list of post_ids, you will then have to perform lookups on the main 'posts' CF to get the actual data. So if a page is displaying 10, 25, or 50 posts you will have 10, 25 or 50 key based lookups for each page view. Is this the suggested way? i.e. a look based on a slice to get a list of post_id's, then a seperate call to actually fetch the data for the given entity. Or is cassandra so fast that 50 key based calls is no reason to worry? You should look at using either multi_get_slice or get_range_slices. You'll save on network trips and the amount of work required of the cluster. Gary.
using cassandra w/django
When using cassandra with django, can you still use the rapid development freatures of django w/cassandra or are you basically just using the framework but the models and ORM features are up to you to implement since you are using cassandra.
Re: using cassandra w/django
I see, well I am new to python + django so I wasn't sure what I really meant :) So basically I am using django for its framework related features, but excluding the ORM/autogen admin pages. That's reasonable and understable thanks. On Fri, Jun 11, 2010 at 10:38 PM, Jeremy Dunck jdu...@gmail.com wrote: There's no direct support for cassandra in django, but there are a couple starts. http://www.allbuttonspressed.com/projects/django-nonrel http://github.com/enki/tragedy http://code.djangoproject.com/wiki/SummerOfCode2010 All of the features which Django has and which build on the ORM are out, of course. The GSoC project is trying to provide some nonrel features through the ORM, I think the general understanding of what people mean when they say does Django work with nosql-X is does the Django admin work with nosql-X. The GSoC might get there, but it's pretty ambitious. On Fri, Jun 11, 2010 at 9:18 PM, S Ahmed sahmed1...@gmail.com wrote: When using cassandra with django, can you still use the rapid development freatures of django w/cassandra or are you basically just using the framework but the models and ORM features are up to you to implement since you are using cassandra.
Re: Cassandra training Jun 18 in SF
Nice! Would it be possible to give more than 2 weeks notice for the following events? Preferrably a month, its not that easy to get off work etc. On Fri, Jun 4, 2010 at 4:22 AM, Oleg Anastasjev olega...@gmail.com wrote: Jonathan Ellis jbellis at gmail.com writes: This will be Riptano's 6th training session (including the four we've done that were on-site with a specific customer), and in my humble opinion the material's really solid at this point. We are actively working on lining up other locations. Do you have plans for training sessions in Europe ?
Re: Problems running Cassandra 0.6.1 on large EC2 instances.
curious how did things turn out? On Tue, May 18, 2010 at 1:38 PM, Curt Bererton c...@zipzapplay.com wrote: We only have a few CFs (6 or 7). I've increased the MemtableThroughputInMB and MemtableOperationsInMillions as per your suggestions. Do we really need a swap file though? I suppose it can't hurt, but with my problem in particular we weren't maxing out main memory. We'll be running another test today and see if the settings changes proposed so far fix our problem ( I hope so ). Best, Curt On Tue, May 18, 2010 at 5:59 AM, Lee Parker l...@socialagency.com wrote: How many different CFs do you have? If you only have a few, I would highly recommend increasing the MemtableThroughputInMB and MemtableOperationsInMillions. We only have to CFs and I have it set at 256MB and 2.5m. Since most of our columns are relatively small, these values are practically equivalent to each other. I would also recommend dropping your heap space to 6G and adding a swap file. In our case, the large EC2 instances didn't have any swap setup by default. Lee Parker
is it possible to trace/debug cassandra?
Would it be possible to put cassandra in debug mode, so I could actually step through, line by line, the execution flow of operations I execute against it? If yes, any help would be great.
Re: Cassandra training on May 21 in Palo Alto
Jonathan, Curious how many people have signed up? I hope you will do another one soon! On Tue, May 11, 2010 at 12:42 PM, Vick Khera vi...@khera.org wrote: On Fri, May 7, 2010 at 6:56 AM, Matt Revelle mreve...@gmail.com wrote: Reston, VA is a good spot in the DC metro area for tech events. +1
Re: zookeeper, how do you feed the pets?
yes counts will be a big part of the project (user points). ok i'll wait for that vector implementation then (I think that is what it was called). thanks! On Sun, May 16, 2010 at 10:10 PM, Chris Goffinet c...@chrisgoffinet.comwrote: If you are running multiple datacenters, intend to have a lot of writes for counters, I highly advise against it. We got rid of ZK because of that. -Chris On May 16, 2010, at 7:04 PM, S Ahmed wrote: Can someone quickly go over how you go about using zookeeper if you want to store counts and have those counts be accurate? e.g. in digg's case I believe, they are using zookeeper so they can keep track of digg's for a particular digg story. Is it a backend change only and then storing API calls are uneffected? is it a config issue ? What are the ramifications of using this addon, are writes slower because you have to wait for the write to propogate to all the servers?
is cassandra really a 'handsoff' solution once setup?
realizing cassandra might be a little tricky to setup at first due to lack of docs etc. Once it is up and running/humming, is it a hands-off solution or does it require hand-holding/monitoring? I recall Joe Stump's blog post stating that it doesn't require an admin (or somethign to that effect when comparing to a sql server box). For those with live apps, how has it been? (fb/digg/twitter people, would love your experiences)
what/how do you guys monitor slow nodes?
If you have 3-4 nodes, how do you monitor the performance of each node?
Re: Cassandra training on May 21 in Palo Alto
I guess the hard part would be recording something so long (9-5pm) A video that is split between the screen (say powerpoint) and linux console would be perfect :) On Fri, May 7, 2010 at 11:24 AM, Todd Burruss bburr...@real.com wrote: +1 -Original Message- *From:* S Ahmed [sahmed1...@gmail.com] *Received:* 5/7/10 7:09 AM *To:* user@cassandra.apache.org [u...@cassandra.apache.org] *Subject:* Re: Cassandra training on May 21 in Palo Alto It would be great if you could make a video of this event. Yes it won't like being there 1-1, but it sure would help get up to speed. On Fri, May 7, 2010 at 6:56 AM, Matt Revelle mreve...@gmail.com wrote: Reston, VA is a good spot in the DC metro area for tech events. The recent Pragmatic Programmer Clojure class sold out and already has two more return visits planned. On May 7, 2010, at 6:42 AM, S Ahmed sahmed1...@gmail.comsahmed1...@gmail.com sahmed1...@gmail.com wrote: toronto :) If not toronto, Virginia. On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: We're planning that now. Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.comsahmed1...@gmail.comsahmed1...@gmail.com sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.comgreen...@gmail.comgreen...@gmail.com green...@gmail.com wrote: Jonathan, Awesome! Any plans to offer this training again in the future for those of us who can't make it this time around? -Mark On Tue, May 4, 2010 at 5:07 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of Cassandra internals - Operations - Tuning and troubleshooting Details at http://riptanobayarea20100521.eventbrite.com/http://riptanobayarea20100521.eventbrite.com/http://riptanobayarea20100521.eventbrite.com/ http://riptanobayarea20100521.eventbrite.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com
Re: Cassandra training on May 21 in Palo Alto
toronto :) If not toronto, Virginia. On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis jbel...@gmail.com wrote: We're planning that now. Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.com wrote: Jonathan, Awesome! Any plans to offer this training again in the future for those of us who can't make it this time around? -Mark On Tue, May 4, 2010 at 5:07 PM, Jonathan Ellis jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of Cassandra internals - Operations - Tuning and troubleshooting Details at http://riptanobayarea20100521.eventbrite.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Cassandra training on May 21 in Palo Alto
It would be great if you could make a video of this event. Yes it won't like being there 1-1, but it sure would help get up to speed. On Fri, May 7, 2010 at 6:56 AM, Matt Revelle mreve...@gmail.com wrote: Reston, VA is a good spot in the DC metro area for tech events. The recent Pragmatic Programmer Clojure class sold out and already has two more return visits planned. On May 7, 2010, at 6:42 AM, S Ahmed sahmed1...@gmail.comsahmed1...@gmail.com sahmed1...@gmail.com wrote: toronto :) If not toronto, Virginia. On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: We're planning that now. Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.comsahmed1...@gmail.comsahmed1...@gmail.com sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.comgreen...@gmail.comgreen...@gmail.com green...@gmail.com wrote: Jonathan, Awesome! Any plans to offer this training again in the future for those of us who can't make it this time around? -Mark On Tue, May 4, 2010 at 5:07 PM, Jonathan Ellis jbel...@gmail.comjbel...@gmail.comjbel...@gmail.com jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of Cassandra internals - Operations - Tuning and troubleshooting Details at http://riptanobayarea20100521.eventbrite.com/http://riptanobayarea20100521.eventbrite.com/http://riptanobayarea20100521.eventbrite.com/ http://riptanobayarea20100521.eventbrite.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com http://riptano.com http://riptano.com http://riptano.com
Re: Cassandra training on May 21 in Palo Alto
Do you have rough ideas when you would be doing the next one? Maybe in 1 or 2 months or much later? On Tue, May 4, 2010 at 8:50 PM, Jonathan Ellis jbel...@gmail.com wrote: Yes, although when and where are TBD. On Tue, May 4, 2010 at 7:38 PM, Mark Greene green...@gmail.com wrote: Jonathan, Awesome! Any plans to offer this training again in the future for those of us who can't make it this time around? -Mark On Tue, May 4, 2010 at 5:07 PM, Jonathan Ellis jbel...@gmail.com wrote: I'll be running a day-long Cassandra training class on Friday, May 21. I'll cover - Installation and configuration - Application design - Basics of Cassandra internals - Operations - Tuning and troubleshooting Details at http://riptanobayarea20100521.eventbrite.com/ -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Is Hector a wrapper around thrift?
Just trying to get my head wrapped around everything here, so bare with me :) So Thrift can spit out generated code for any language, be it C#, Java or python etc. Hector is a higher level wrapper around the java generated code by Thrift. Do I have this right? And Hector is probably the most worked on higher level wrapper? (is there anything similiar in python or?)
Re: getting cassandra setup on windows 7
great that worked thanks! On Fri, Apr 23, 2010 at 2:28 PM, Mark Greene green...@gmail.com wrote: Try the cassandra-with-fixes.bathttps://issues.apache.org/jira/secure/attachment/12442349/cassandra-with-fixes.bat file attached to the issue. I had the same issue an that bat file got cassandra to start. It still throws another error complaining about the log4j.properties. On Fri, Apr 23, 2010 at 1:59 PM, S Ahmed sahmed1...@gmail.com wrote: Any insights? Much appreciated! On Thu, Apr 22, 2010 at 11:13 PM, S Ahmed sahmed1...@gmail.com wrote: I was just reading that thanks. What does he mean when he says: This appears to be related to data storage paths I set, because if I switch the paths back to the default UNIX paths. Everything runs fine On Thu, Apr 22, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.comwrote: https://issues.apache.org/jira/browse/CASSANDRA-948 On Thu, Apr 22, 2010 at 10:03 PM, S Ahmed sahmed1...@gmail.com wrote: Ok so I found the config section: CommitLogDirectoryE:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\commitlog/CommitLogDirectory DataFileDirectories DataFileDirectoryE:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\data/DataFileDirectory /DataFileDirectories Now when I run: bin/cassandra I get: Starting cassandra server listening for transport dt_socket at address: exception in thread main java.lang.noclassDefFoundError: org/apache/cassthreft/cassandraDaemon could not find the main class: org.apache.cassandra.threif.cassandraDaemon... On Thu, Apr 22, 2010 at 10:53 PM, S Ahmed sahmed1...@gmail.com wrote: So I uncompressed the .tar, in the readme it says: * tar -zxvf cassandra-$VERSION.tgz * cd cassandra-$VERSION * sudo mkdir -p /var/log/cassandra * sudo chown -R `whoami` /var/log/cassandra * sudo mkdir -p /var/lib/cassandra * sudo chown -R `whoami` /var/lib/cassandra My cassandra is at: c:\java\cassandra\apache-cassandra-0.6.1/ So I have to create 2 folders log and lib? Is there a setting in a config file that I edit?
value size, is there a suggested limit?
Is there a suggested sized maximum that you can set the value of a given key? e.g. could I convert a document to bytes and store it as a value to a key? if yes, which I presume so, what if the file is 10mb? or 100mb?
Re: getting cassandra setup on windows 7
Any insights? Much appreciated! On Thu, Apr 22, 2010 at 11:13 PM, S Ahmed sahmed1...@gmail.com wrote: I was just reading that thanks. What does he mean when he says: This appears to be related to data storage paths I set, because if I switch the paths back to the default UNIX paths. Everything runs fine On Thu, Apr 22, 2010 at 11:07 PM, Jonathan Ellis jbel...@gmail.comwrote: https://issues.apache.org/jira/browse/CASSANDRA-948 On Thu, Apr 22, 2010 at 10:03 PM, S Ahmed sahmed1...@gmail.com wrote: Ok so I found the config section: CommitLogDirectoryE:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\commitlog/CommitLogDirectory DataFileDirectories DataFileDirectoryE:\java\cassandra\apache-cassandra-0.6.1-bin\apache-cassandra-0.6.1\data/DataFileDirectory /DataFileDirectories Now when I run: bin/cassandra I get: Starting cassandra server listening for transport dt_socket at address: exception in thread main java.lang.noclassDefFoundError: org/apache/cassthreft/cassandraDaemon could not find the main class: org.apache.cassandra.threif.cassandraDaemon... On Thu, Apr 22, 2010 at 10:53 PM, S Ahmed sahmed1...@gmail.com wrote: So I uncompressed the .tar, in the readme it says: * tar -zxvf cassandra-$VERSION.tgz * cd cassandra-$VERSION * sudo mkdir -p /var/log/cassandra * sudo chown -R `whoami` /var/log/cassandra * sudo mkdir -p /var/lib/cassandra * sudo chown -R `whoami` /var/lib/cassandra My cassandra is at: c:\java\cassandra\apache-cassandra-0.6.1/ So I have to create 2 folders log and lib? Is there a setting in a config file that I edit?
Re: cassandra instability
If digg uses PHP with cassandra, can the library really be that old? Or they are using their own custom php cassandra client? (probably, but just making sure). On Fri, Apr 16, 2010 at 2:13 PM, Jonathan Ellis jbel...@gmail.com wrote: On Fri, Apr 16, 2010 at 12:50 PM, Lee Parker l...@socialagency.com wrote: Each time I start it up, it will work fine for about 1 hour and then it will crash the servers. The error message on the servers is usually an out of memory error. Sounds like http://wiki.apache.org/cassandra/FAQ#slows_down_after_lotso_inserts to me. I will get several time out errors on the clients Symtomatic of running out of memory. and occasionally get an error telling me that i was missing the timestamp. This is an entirely different problem. Your client is sending garbage, plain and simple. Why that is, I don't know. The PHP Thrift binding is virtually unmaintained, so it could be a bug there, but Digg uses PHP against Cassandra extensively and hasn't hit this to my knowledge. As I said in another thread, I wouldn't rule out bad hardware. The timestamp error is accompanied by a server crashing if I use framed transport instead of buffered. Thrift is fragile when the client sends it garbage. (https://issues.apache.org/jira/browse/THRIFT-601) One of the reasons we were trying cassandra was to scale out with smaller nodes rather than having to run larger instances for mysql. 2 x 1GB isn't a whole lot to do a bulk load with. You may have to throttle your clients to fix the OOM completely. -Jonathan
security, firewall level only?
Is security in terms of remote clients connecting to a cassandra node done purely at the hardware/firewall level? i.e. there is no username/pwd like in mysql/sqlserver correct? Or permissions at the column family level per user ?
Just to be clear, cassandra is web framework agnostic b/c of Thrift?
Just want to be clear, is it true that it really makes no difference if my web application is asp.net or java or python, since the way we communicate to Cassandra is via the Thrift generated interface? Obviously if you run asp.net on windows, it is probably a VERY good idea to be running cassandra on a linux box.
Re: Just to be clear, cassandra is web framework agnostic b/c of Thrift?
Interesting, I'm just finding windows to be a pain, particular starting up java apps. (I guess I just need to learn!) How exactly would you startup Cassandra on a windows machine? i.e when the server reboots, how will it run the java -jar cassandar ? On Sun, Apr 18, 2010 at 7:35 PM, Joe Stump j...@joestump.net wrote: On Apr 18, 2010, at 5:33 PM, S Ahmed wrote: Obviously if you run asp.net on windows, it is probably a VERY good idea to be running cassandra on a linux box. Actually, I'm not sure this is true. A few people have found Windows performs fairly well with Cassandra, if I recall correctly. Obviously, all of the testing and most of the bigger users are running on Linux though. --Joe
if cassandra isn't ideal for keep track of counts, how does digg count diggs?
From what I read in another thread, Cassandra isn't used for isn't 'ideal' for keeping track of counts. For example, I would undertand this to mean keeping track of which stories were dugg. If this is true, how would a site like digg keep track of the 'dugg' counter? Also, I am assuming with eventual consistancy the number *may* not be 100% accurate. If you wanted it to be accurate, would you just use the Quorom flag? (I believe quorom is to ensure all writes are written to disk)
Re: if cassandra isn't ideal for keep track of counts, how does digg count diggs?
Chris, When you so patch, does that mean for Cassandra or your own internal codebase? Sounds interesting thanks! On Tue, Apr 6, 2010 at 12:54 PM, Chris Goffinet goffi...@digg.com wrote: That's not true. We have been using the Zookeper work we posted on jira. That's what we are using internally and have been for months. We are now just wrapping up our vector clocks + distributed counter patch so we can begin transitioning away from the Zookeeper approach because there are problems with it long-term. -Chris On Apr 6, 2010, at 9:50 AM, Ryan King wrote: They don't use cassandra for it yet. -ryan On Tue, Apr 6, 2010 at 9:00 AM, S Ahmed sahmed1...@gmail.com wrote: From what I read in another thread, Cassandra isn't used for isn't 'ideal' for keeping track of counts. For example, I would undertand this to mean keeping track of which stories were dugg. If this is true, how would a site like digg keep track of the 'dugg' counter? Also, I am assuming with eventual consistancy the number *may* not be 100% accurate. If you wanted it to be accurate, would you just use the Quorom flag? (I believe quorom is to ensure all writes are written to disk)