Heterogenous cluster and vnodes

2014-08-29 Thread Jens Rantil
Hey, I have a few of VM host (bare metal) machines with varying amounts of free hard drive space on them. For simplicity let’s say I have three machine like so:  * Machine 1   - Harddrive 1: 150 GB available.  * Machine 2:   - Harddrive 1: 150 GB available.   - Harddrive 2: 150 GB available.  *

Re: Commitlog files are not being deleted

2014-08-29 Thread Pavel Kogan
Thanks Robert. On Thu, Aug 28, 2014 at 6:32 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 28, 2014 at 3:31 PM, Pavel Kogan pavel.ko...@cortica.com wrote: Shouldn't all commitlog files be auto deleted after replaying, for example after node restart? Using Cassandra 2.0.8 No,

RE: How often are JMX Cassandra metrics reset?

2014-08-29 Thread Donald Smith
Thanks, Chris. 75thPercentile is clearly NOT lifetime: its value jumps around. However, I can tell that Max is lifetime; it's been showing the exact same value for days, on various nodes. Hence my doubts. From: Chris Lohfink [mailto:clohf...@blackbirdit.com] Sent: Thursday, August 28, 2014 3:56

Re: Too many SSTables after rebalancing cluster (LCS)

2014-08-29 Thread Paulo Ricardo Motta Gomes
Deleting the json manifest worked like a charm. After 2 days of compactions I've got 50GB extra space! :) Just a quick addendum, after deleting the json metadata file, I needed to restart the node, otherwise it just reloads the file from memory. Version: 1.2.16 On Wed, Aug 27, 2014 at 8:13 PM,

Re: How often are JMX Cassandra metrics reset?

2014-08-29 Thread Robert Coli
On Thu, Aug 28, 2014 at 3:39 PM, Donald Smith donald.sm...@audiencescience.com wrote: Maybe there’s a way to reset lifetime metrics to zero. No. [1] =Rob [1] At least, they never have before and neither driftx or I believe they have been created.

Data partitioning and composite partition key

2014-08-29 Thread Drew Kutcharian
Hey Guys, AFAIK, currently Cassandra partitions (thrift) rows using the row key, basically uses the hash(row_key) to decide what node that row needs to be stored on. Now there are times when there is a need to shard a wide row, say storing events per sensor, so you’d have sensorId-datetime row

Re: Data partitioning and composite partition key

2014-08-29 Thread Jack Krupansky
With CQL3, you, the developer, get to decide whether to place a primary key column in the partition key or as a clustering column. So, make sensorID the partition key and datetime as a clustering column. -- Jack Krupansky From: Drew Kutcharian Sent: Friday, August 29, 2014 6:48 PM To:

Re: Data partitioning and composite partition key

2014-08-29 Thread Drew Kutcharian
Hi Jack, I think you missed the point of my email which was trying to avoid the problem of having very wide rows :) In the notation of sensorId-datatime, the datatime is a datetime bucket, say a day. The CQL rows would still be keyed by the actual time of the event. So you’d end up having

Re: Data partitioning and composite partition key

2014-08-29 Thread Robert Coli
On Fri, Aug 29, 2014 at 3:48 PM, Drew Kutcharian d...@venarc.com wrote: AFAIK, currently Cassandra partitions (thrift) rows using the row key, basically uses the hash(row_key) to decide what node that row needs to be stored on. Now there are times when there is a need to shard a wide row, say

Re: Data partitioning and composite partition key

2014-08-29 Thread Drew Kutcharian
Hi Rob, I agree that one should not mess around with the default partitioner. But there might be value in improving the Murmur3 partitioner to be “Composite Aware”. Since we can have composites in row keys now, why not be able to use only a part of the row key for partitioning? Makes sense? I

Re: Data partitioning and composite partition key

2014-08-29 Thread Jack Krupansky
Okay, but what benefit do you think you get from having the partitions on the same node – since they would be separate partitions anyway? I mean, what exactly do you think you’re going to do with them, that wouldn’t be a whole lot more performant by being able to process data in parallel from

Re: Data partitioning and composite partition key

2014-08-29 Thread Drew Kutcharian
Mainly lower latency and (network overhead) in multi-get requests (WHERE IN (….)). The coordinator needs to connect only to one node vs potentially all the nodes in the cluster. On Aug 29, 2014, at 5:23 PM, Jack Krupansky j...@basetechnology.com wrote: Okay, but what benefit do you think you

Re: Data partitioning and composite partition key

2014-08-29 Thread Jack Krupansky
But you already said that your have “very wide rows”, so pulling massive amounts of data off a single node is very likely to completely dwarf the connect time. Again, doing the gets in parallel from multiple nodes, with parallel requests, would be so much more performant. How many nodes are we

Rebuilding a cassandra seed node with the same tokens and same IP address

2014-08-29 Thread Donald Smith
One of our nodes is getting an increasing number of pending compactions due, we think, to https://issues.apache.org/jira/browse/CASSANDRA-7145 , which is fixed in future version 2.0.11 . (We had the same error a month ago, but at that time we were in pre-production and could just clean the

Machine Learning With Cassandra

2014-08-29 Thread Adaryl Bob Wakefield, MBA
I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible. I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then

Re: Machine Learning With Cassandra

2014-08-29 Thread Alex Kamil
Adaryl, most ML algorithms are based on some form of numerical optimization, using something like online gradient descent http://en.wikipedia.org/wiki/Stochastic_gradient_descent or conjugate gradient http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html (e.g in SVM classifiers). In