random partitioner and key scan

2012-07-18 Thread Patrik Modesto
Hi,

I know that RandomPartitioner does MD5 of a key and the MD5 is then
used for key distribution AND key ordering. I was just wondering if
it's possible to have RandomPartitioner just for key distribution and
OrderedPartitioner just for per-node key ordering. That would solve
the often requested key scan feature.

Regards,
Patrik


Re: Tripling size of a cluster

2012-07-18 Thread Mariusz Dymarek

Hi again,
we have now moved all nodes to correct position in ring, but we can see 
higher load on 2 nodes, than on other nodes:

...
node01-05   rack1  Up  Normal  244.65 GB  6,67% 
102084710076281539039012382229530463432
node02-13   rack2  Up  Normal  240.26 GB  6,67% 
107756082858297180096735292353393266961
node01-13   rack1  Up  Normal  243.75 GB  6,67% 
113427455640312821154458202477256070485
node02-05   rack2  Up  Normal  249.31 GB  6,67% 
119098828422328462212181112601118874004
node01-14   rack1  Up  Normal  244.95 GB  6,67% 
124770201204344103269904022724981677533
node02-14   rack2  Up  Normal  392.7 GB   6,67% 
130441573986359744327626932848844481058
node01-06   rack1  Up  Normal  249.3 GB   6,67% 
136112946768375385385349842972707284576
node02-15   rack2  Up  Normal  286.82 GB  6,67% 
141784319550391026443072753096570088106
node01-15   rack1  Up  Normal  245.21 GB  6,67% 
147455692332406667500795663220432891630
node02-06   rack2  Up  Normal  244.9 GB   6,67% 
153127065114422308558518573344295695148

...

Node:
* node01-15  = >  286.82 GB
* node02-14  = >  392.7 GB

average load on all other nodes is around 245 GB, nodetool cleanup 
command was invoked on problematic nodes after move operation...

Why this has happen?
And how can we balance cluster?
On 06.07.2012 20:15, aaron morton wrote:

If you have the time yes I would wait for the bootstrap to finish. It
will make you life easier.

good luck.


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/07/2012, at 7:12 PM, Mariusz Dymarek wrote:


Hi,
we`re in the middle of extending our cluster from 10 to 30 nodes,
we`re running cassandra 1.1.1...
We`ve generated initial tokens for new nodes:
"0": 0, # existing: node01-01
"1": 5671372782015641057722910123862803524, # new: node02-07
"2": 11342745564031282115445820247725607048, # new: node01-07
"3": 17014118346046923173168730371588410572, # existing: node02-01
"4": 22685491128062564230891640495451214097, # new: node01-08
"5": 28356863910078205288614550619314017621, # new: node02-08
"6": 34028236692093846346337460743176821145, # existing: node01-02
"7": 39699609474109487404060370867039624669, # new: node02-09
"8": 45370982256125128461783280990902428194, # new: node01-09
"9": 51042355038140769519506191114765231718, # existing: node02-02
"10": 56713727820156410577229101238628035242, # new: node01-10
"11": 62385100602172051634952011362490838766, # new: node02-10
"12": 68056473384187692692674921486353642291, # existing: node01-03
"13": 7372784616620750397831610216445815, # new: node02-11
"14": 79399218948218974808120741734079249339, # new: node01-11
"15": 85070591730234615865843651857942052864, # existing: node02-03
"16": 90741964512250256923566561981804856388, # new: node01-12
"17": 96413337294265897981289472105667659912, # new: node02-12
"18": 102084710076281539039012382229530463436, # existing: node01-05
"19": 107756082858297180096735292353393266961, # new: node02-13
"20": 113427455640312821154458202477256070485, # new: node01-13
"21": 119098828422328462212181112601118874009, # existing: node02-05
"22": 124770201204344103269904022724981677533, # new: node01-14
"23": 130441573986359744327626932848844481058, # new: node02-14
"24": 136112946768375385385349842972707284582, # existing: node01-06
"25": 141784319550391026443072753096570088106, # new: node02-15
"26": 147455692332406667500795663220432891630, # new: node01-15
"27": 153127065114422308558518573344295695155, # existing: node02-06
"28": 158798437896437949616241483468158498679, # new: node01-16
"29": 164469810678453590673964393592021302203 # new: node02-16
then we`ve started to boostrap new nodes,
but due to copy and paste mistake:
* node node01-14 was started with
130441573986359744327626932848844481058 as initial token(so node01-14
has initial_token, what should belong to node02-14), it
should have 124770201204344103269904022724981677533 as initial_token
* node node02-14 was started with
136112946768375385385349842972707284582 as initial token, so it has
token from existing node01-06

However we`ve used other program for generating previous
initial_tokens and actual token of node01-06 in ring is
136112946768375385385349842972707284576.
Summing up: we have currently this situation in ring:

node02-05 rack2 Up Normal 596.31 GB 6.67%
119098828422328462212181112601118874004
node01-14 rack1 Up Joining 242.92 KB 0.00%
130441573986359744327626932848844481058
node01-06 rack1 Up Normal 585.5 GB 13.33%
136112946768375385385349842972707284576
node02-14 rack2 Up Joining 113.17 KB 0.00%
136112946768375385385349842972707284582
node02-15 rack2 Up Joining 178.05 KB 0.00%
141784319550391026443072753096570088106
node01-15 rack1 Up Joining 191.7 GB 0.00%
147455692332406667500795663220432891630
node02-06 rack2 Up Normal 597.69 GB 20.00%
153127065114422308558518573344295695148


We would like to get back to our original configuration.
Is it safe to wait for finishing bootstraping of all new nodes and
after that inv

Re: An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)

2012-07-18 Thread Vivek Mishra
Thanks. Team is working on to extend support for
SimpleJPARepository(including implementation for ManagedType).

-Vivek

On Thu, Jul 19, 2012 at 9:06 AM, Roshan  wrote:

> Hi Brian
>
> This is basically a wonderful news for me, because we are using lots of
> spring support in the project. Good luck and keep post.
>
> Cheers
>
> /Roshan.
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/An-experiment-using-Spring-Data-w-Cassandra-initially-via-JPA-Kundera-tp7581319p7581320.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Cassandra startup times

2012-07-18 Thread Ben Kaehne
Good evening,

I am interested in improving the startup time of our cassandra cluster.

We have a 3 node cluster (replication factor of 3) in which our application
requires quorum reads and writes to function.

Each machine is well specced with 24gig of ram, 10 cores, jna enabled etc.

On each server our keyspace files are so far around 90 Gb (stored on NFS
although I am not seeing signs that we have much network io). This size
will grow in future.

Our startup time for 1 server at the moment is greater then half an hour
(45 minutes to 50 minutes even) which is putting a risk factor on the
resiliance of our service. I have tried version 1.09 to latest 1.12.

I do not see too much system utilization while starting either.

I gazed apon an article suggesting increased speed in 1.2 although when I
set it up, it did not seem to be any faster at all (if not slower).

I was observing what was happening during startup and I noticed (via
strace), cassandra was doing lots of 8 byte reads from:

 
/var/lib/cassandra/data/XX/YY/XXX-YYY-hc-1871-CompressionInfo.db
 
/var/lib/cassandra/data/XX/YY/XXX-YYY-hc-1874-CompressionInfo.db

Also... Is there someone I can change the 8 byte reads to something
greater? 8 byte reads across NFS is terribly inefficient (and I am guessing
the cause of our terribly slow startup times).

Regards,

-- 
-Ben


Re: An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)

2012-07-18 Thread Roshan
Hi Brian

This is basically a wonderful news for me, because we are using lots of
spring support in the project. Good luck and keep post.

Cheers

/Roshan.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/An-experiment-using-Spring-Data-w-Cassandra-initially-via-JPA-Kundera-tp7581319p7581320.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


An experiment using Spring Data w/ Cassandra (initially via JPA/Kundera)

2012-07-18 Thread Brian O'Neill
This is just an FYI.

I experimented w/ Spring Data JPA w/ Cassandra leveraging Kundera.

It sort of worked:
https://github.com/boneill42/spring-data-jpa-cassandra
http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.html

I'm now working on a pure Spring Data adapter using Astyanax:
https://github.com/boneill42/spring-data-cassandra

I'll keep you posted.

(Thanks to all those that helped out w/ advice)

-brian

-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Can't change replication factor in Cassandra 1.1.2

2012-07-18 Thread Douglas Muth
Hi folks,

I have an interesting problem in Cassandra 1.1.2, a Google Search
wasn't much help, so I thought I'd ask here.

Essentially, I have a "problem keyspace" in my 2-node cluster that
keeps me from changing the replication factor on a specific keyspace.
It's probably easier to show what I'm seeing in cassandra-cli:

[default@foobar] update keyspace test1 with strategy_options =
{replication_factor:1};
2d5f0d16-bb4b-3d75-a084-911fe39f7629
Waiting for schema agreement...
... schemas agree across the cluster
[default@foobar] update keyspace test1 with strategy_options =
{replication_factor:1};
7745dd06-ee5d-3e74-8734-7cdc18871e67
Waiting for schema agreement...
... schemas agree across the cluster

Even though keyspace "test1" had a replication_factor of 1 to start
with, each of the above UPDATE KEYSPACE commands caused a new UUID to
be generated for the schema, which I assume is normal and expected.

Then I try it with the problem keyspace:

[default@foobar] update keyspace foobar with strategy_options =
{replication_factor:1};
7745dd06-ee5d-3e74-8734-7cdc18871e67
Waiting for schema agreement...
... schemas agree across the cluster

Note that the UUID did not change, and the replication_factor in the
underlying database did not change either.

The funny thing is that foobar had a replication_factor of 1
yesterday, then I brought my second node online and changed the
replication_factor to 2 without incident.  I only ran into issues when
I tried changing it back to 1.

I tried running "nodetool clean" on both nodes, but the problem persists.

Any suggestions?

Thanks,

-- Doug

-- 
http://twitter.com/dmuth


Re: Composite Column Expiration Behavior

2012-07-18 Thread Thomas Van de Velde
I answered my own question with a test:


Using default limit of 100
---
RowKey: test
=> (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:A, value=,
timestamp=1342628020428000, ttl=10)
=> (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:B, value=,
timestamp=1342628020428000, ttl=30)

1 Row Returned.
Elapsed time: 4 msec(s).
[default@context] list context_session_views;
Using default limit of 100
---
RowKey: test
=> (column=89b81b00-d0f3-11e1-8d4c-000c29d2a972:B, value=,
timestamp=1342628020428000, ttl=30)

1 Row Returned.
Elapsed time: 3 msec(s).


On Wed, Jul 18, 2012 at 11:06 AM, rohit bhatia  wrote:

> Hi,
>
> I don't think that composite columns have "parent columns". your point
> might be true for supercolumns ..
> but each composite column is probably independent..
>
> On Wed, Jul 18, 2012 at 9:14 PM, Thomas Van de Velde
>  wrote:
> > Hi there,
> >
> > I am trying to understand the expiration behavior of composite columns.
> > Assume I have two entries both have the same parent column name but each
> one
> > has a different ttl. Would expiration be applied at the parent column
> level
> > (taking into account ttls set per column under the parent and expiring
> all
> > of the child columns when the most recent ttl is met) or is each each
> child
> > entry expired independently?
> >
> > Would this be correct?
> >
> > A:B->ttl=5
> > A:C->ttl=10
> >
> >
> > t+5: Nothing gets expired (because A:C's expiration has not yet been
> > reached)
> > t+10: Both A:B and A:C are expired
> >
> >
> > Thanks,
> > Thomas
>


Re: Composite Column Expiration Behavior

2012-07-18 Thread rohit bhatia
Hi,

I don't think that composite columns have "parent columns". your point
might be true for supercolumns ..
but each composite column is probably independent..

On Wed, Jul 18, 2012 at 9:14 PM, Thomas Van de Velde
 wrote:
> Hi there,
>
> I am trying to understand the expiration behavior of composite columns.
> Assume I have two entries both have the same parent column name but each one
> has a different ttl. Would expiration be applied at the parent column level
> (taking into account ttls set per column under the parent and expiring all
> of the child columns when the most recent ttl is met) or is each each child
> entry expired independently?
>
> Would this be correct?
>
> A:B->ttl=5
> A:C->ttl=10
>
>
> t+5: Nothing gets expired (because A:C's expiration has not yet been
> reached)
> t+10: Both A:B and A:C are expired
>
>
> Thanks,
> Thomas


Re: Batch update efficiency with composite key

2012-07-18 Thread Dave Brosius
 Cassandra doesn't do reads before writes. It just places the updates in 
memtables. In effect updates are the same as inserts.Batches certainly help 
with network latency, and some minor amount of code repetitiion on the server 
side.  - Original Message -From: "Leonid Ilyevsky" 
>;lilyev...@mooncapital.com 

Re: Cassandra Evaluation/ Benchmarking: Throughput not scaling as expected neither latency showing good numbers

2012-07-18 Thread Hontvári József Levente

On 2012.07.18. 7:13, Code Box wrote:
The cassandra stress tool gives me values around 2.5 milli seconds for 
writing. The problem with the Cassandra Stress Tool is that it just 
gives the average latency numbers and the average latency numbers that 
i am getting are comparable in some cases. It is the 95 percentile and 
99 percentile numbers are the ones that are bad. So it means that the 
95% of requests are really bad and the rest 5% are really good that 
makes the average go down.



No, the opposite is true. 95% of the requests are fast, and 5% is slow. 
Or in case of the 99 percentile, 99% is fast, 1% is slow. Except if you 
order your samples in the opposite direction, not in the usual.


Batch update efficiency with composite key

2012-07-18 Thread Leonid Ilyevsky
I have a question about efficiency of updates to a CF with composite key.

Let say I have 100 of logical rows to update, and they all belong to the same 
physical wide row. In my naïve understanding (correct me if I am wrong), in 
order to update a logical row, Cassandra has to retrieve the whole physical 
row, add columns to it, and put it back. So I put all my 100 updates in a batch 
and send it over. Would Cassandra be smart enough to recognize that they all 
belong to one physical row, retrieve it once, do all the updates and put it 
back once? Is my batch thing even relevant in this case? What happens if I just 
send updates one by one?

I want to understand why I should use batches. I don't really care about one 
timestamp for all records, I only care about efficiency. So I thought, I want 
to at least save on the number of remote calls, but I also wonder what happens 
on Cassandra side.



This email, along with any attachments, is confidential and may be legally 
privileged or otherwise protected from disclosure. Any unauthorized 
dissemination, copying or use of the contents of this email is strictly 
prohibited and may be in violation of law. If you are not the intended 
recipient, any disclosure, copying, forwarding or distribution of this email is 
strictly prohibited and this email and any attachments should be deleted 
immediately. This email and any attachments do not constitute an offer to sell 
or a solicitation of an offer to purchase any interest in any investment 
vehicle sponsored by Moon Capital Management LP ("Moon Capital"). Moon Capital 
does not provide legal, accounting or tax advice. Any statement regarding 
legal, accounting or tax matters was not intended or written to be relied upon 
by any person as advice. Moon Capital does not waive confidentiality or 
privilege as a result of this email.


Re: Cassandra Evaluation/ Benchmarking: Throughput not scaling as expected neither latency showing good numbers

2012-07-18 Thread Manoj Mainali
How kind of client are you using in YCSB? If you want to improve latency,
try distributing the requests among nodes instead of stressing a single
node, try host connection pooling instead of creating connection for each
request. Check high level clients like hector or asyantax for use if you
are not already using them. Some clients have ring aware request handling.

You have a 3 nodes cluster and using a RF of three, that means all the node
will get the data. What CL are you using for writes? Latency increases for
strong CL.

If you want to increase throughput, try increasing the number of clients.
Of course, it doesnt mean that throughtput will always increase. My
observation was that it will increase and after certain number of clients
throughput decrease again.

Regards,
Manoj Mainali


On Wednesday, July 18, 2012, Code Box wrote:

> The cassandra stress tool gives me values around 2.5 milli seconds for
> writing. The problem with the Cassandra Stress Tool is that it just gives
> the average latency numbers and the average latency numbers that i am
> getting are comparable in some cases. It is the 95 percentile and 99
> percentile numbers are the ones that are bad. So it means that the 95% of
> requests are really bad and the rest 5% are really good that makes the
> average go down. I want to make sure that the 95% and 99% values are in one
> digit milli seconds. I want them to be single digit because i have seen
> people getting those numbers.
>
> This is my conclusion till now with all the investigations:-
>
> Three node cluster with replication factor of 3 gets me around 10 ms 100%
> writes with consistency equal to ONE. The reads are really bad and they are
> around 65ms.
>
> I thought that network is the issue so i moved the client on a local
> machine. Client on the local machine with one node cluster gives me again
> good average write latencies but the 99%ile and 95%ile are bad. I am
> getting around 10 ms for write and 25 ms for read.
>
> Network Bandwidth between the client and server is 1 Gigabit/second. I was
> able to at the max generate 25 K requests. So it could be the client is the
> bottleneck. I am using YCSB. May be i should change my client to some other.
>
> Throughput that i got from a client at the maximum local was 35K and
> remote was 17K.
>
>
> I can try these things now:-
>
> Use a different client and see how much numbers i get for 99% and 95%. I
> am not sure if there is any client that gives me this detailed or i have to
> write one of my own.
>
> Tweak some hard disk settings raid0 and xfs / ext4 and see if that helps.
>
> Could be a possibility that the cassandra 0.8 to 1.1 the 95% and 99%
> numbers have gone down.  The throughput numbers have also gone down.
>
> Is there any other client that i can use except the cassandra stress tool
> and YCSB  and what ever numbers i have got are they good ?
>
>
> --Akshat Vig.
>
>
>
>
> On Tue, Jul 17, 2012 at 9:22 PM, aaron morton wrote:
>
> I would benchmark a default installation, then start tweaking. That way
> you can see if your changes result in improvements.
>
> To simplify things further try using the tools/stress utility in the
> cassandra source distribution first. It's pretty simple to use.
>
> Add clients until you see the latency increase and tasks start to back up
> in nodetool tpstats. If you see it report dropped messages it is over
> loaded.
>
> Hope that helps.
>
>   -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/07/2012, at 4:48 AM, Code Box wrote:
>
> Thanks a lot for your reply guys. I was trying fsyn = batch and window
> =0ms to see if the disk utilization is happening full on my drive. I
> checked the  numbers using iostat the numbers were around 60% and the CPU
> usage was also not too high.
>
> Configuration of my Setup :-
>
> I have three m1.xlarge hosts each having 15 GB RAM and 4 CPU. It has 8
> EC2 Compute Units.
> I have kept the replication factor equal to 3. The typical write size is 1
> KB.
>
> I tried adding different nodes each with 200 threads and the throughput
> got split into two. If i do it from a single host with FSync Set to
> Periodic and Window Size equal to 1000ms and using two nodes i am getting
> these numbers :-
>
>
> [OVERALL], Throughput(ops/sec), 4771
> [INSERT], AverageLatency(us), 18747
> [INSERT], MinLatency(us), 1470
> [INSERT], MaxLatency(us), 446413
> [INSERT], 95thPercentileLatency(ms), 55
> [INSERT], 99thPercentileLatency(ms), 167
>
> [OVERALL], Throughput(ops/sec), 4678
> [INSERT], AverageLatency(us), 22015
> [INSERT], MinLatency(us), 1439
> [INSERT], MaxLatency(us), 466149
> [INSERT], 95thPercentileLatency(ms), 62
> [INSERT], 99thPercentileLatency(ms), 171
>
> Is there something i am doing wrong in cassandra Setup ?? What is the bet
> Setup for Cassandra to get high throughput and good write latency numbers ?
>
>
>
> On Tue, Jul 17, 2012 at 7:02 AM, Sylvain Lebresne 
>
>