Re: SSTable format

2012-07-13 Thread prasenjit mukherjee
>
> It depends on what partitioner you use. You should be using the
> RandomPartitioner, and if so, the rows are sorted by the hash of the row
> key. there are partitioners that sort based on the raw key value but these
> partitioners shouldn't be used as they have problems due to uneven
> partitioning of data.
>

Any reason  rowkeys are not stored by their raws keys on a given node
for RP ? I understand the partitioning across nodes should be
randomized, but on a given node why they are sorted by hash of their
keys and not just by the raw keys ?

What are we gaining by 'decorating' the keys with a random number ?  (
ref section 3 in http://wiki.apache.org/cassandra/ArchitectureSSTable
)

-Thanks,
Prasenjit


Re: SSTable format

2012-07-13 Thread Dave Brosius
While in memory cassandra calls it a MemTable, but yes sstables are 
write-once, and later combined with others into new ones thru compaction.




On 07/13/2012 09:54 PM, Michael Theroux wrote:

Thanks for the information,

So is the SStable essentially kept in memory, then sorted and written to disk 
on flush?  After that point, an SStable is not modified, but can be written to 
another SStable through compaction?

-Mike

On Jul 13, 2012, at 8:22 PM, Rob Coli wrote:


On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius  wrote:

It depends on what partitioner you use. You should be using the
RandomPartitioner, and if so, the rows are sorted by the hash of the row
key. there are partitioners that sort based on the raw key value but these
partitioners shouldn't be used as they have problems due to uneven
partitioning of data.

The formal way this works in the code is that SSTables are ordered by
"decorated" row key, where "decoration" is only a transformation when
you are not using OrderedPartitioner. FWIW, in case you see that
"DecoratedKey" syntax while reading code..

=Rob

--
=Robert Coli
AIM>ALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb






Re: SSTable format

2012-07-13 Thread Michael Theroux
Thanks for the information,

So is the SStable essentially kept in memory, then sorted and written to disk 
on flush?  After that point, an SStable is not modified, but can be written to 
another SStable through compaction?

-Mike

On Jul 13, 2012, at 8:22 PM, Rob Coli wrote:

> On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius  
> wrote:
>> It depends on what partitioner you use. You should be using the
>> RandomPartitioner, and if so, the rows are sorted by the hash of the row
>> key. there are partitioners that sort based on the raw key value but these
>> partitioners shouldn't be used as they have problems due to uneven
>> partitioning of data.
> 
> The formal way this works in the code is that SSTables are ordered by
> "decorated" row key, where "decoration" is only a transformation when
> you are not using OrderedPartitioner. FWIW, in case you see that
> "DecoratedKey" syntax while reading code..
> 
> =Rob
> 
> -- 
> =Robert Coli
> AIM>ALK - rc...@palominodb.com
> YAHOO - rcoli.palominob
> SKYPE - rcoli_palominodb



Re: SSTable format

2012-07-13 Thread Rob Coli
On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius  wrote:
> It depends on what partitioner you use. You should be using the
> RandomPartitioner, and if so, the rows are sorted by the hash of the row
> key. there are partitioners that sort based on the raw key value but these
> partitioners shouldn't be used as they have problems due to uneven
> partitioning of data.

The formal way this works in the code is that SSTables are ordered by
"decorated" row key, where "decoration" is only a transformation when
you are not using OrderedPartitioner. FWIW, in case you see that
"DecoratedKey" syntax while reading code..

=Rob

-- 
=Robert Coli
AIM>ALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: SSTable format

2012-07-13 Thread Dave Brosius

On 07/13/2012 08:00 PM, Michael Theroux wrote:

Hello,

I've been trying to understand in greater detail how SStables are stored, and 
how information is transferred between Cassandra nodes, especially when a new 
node is joining a cluster.

Specifically, Is information stored to SStables ordered by rowkeys?  Some of 
the articles I've read suggests this is the case (although it's a little vague 
if they actually mean that the columns are stored in order, not the rowkeys).  
However, if data is stored in rowkey order, how is this achieved, as sstables 
are immutable?

Thanks for any insights,
-Mike


It depends on what partitioner you use. You should be using the 
RandomPartitioner, and if so, the rows are sorted by the hash of the row 
key. there are partitioners that sort based on the raw key value but 
these partitioners shouldn't be used as they have problems due to uneven 
partitioning of data.


As for how this is done, remember an sstable doesn't hold all the data 
for a column family. Not only does the data for a column family exist on 
multiple servers, there are usually multiple sstable files on disk that 
represent data from one column family on one machine. So at the time the 
sstable is written, the rows that are to be put in the sstable are 
sorted, and written in sorted order. In fact the same rowkey may be 
written in multiple sstables, one sstable having one set of columns for 
the key, the other sstable having other columns for the same key.


On query for some row based on a key, cassandra is responsible for 
finding where the columns are found in which sstables (potentially 
several) and merging the results.


SSTable format

2012-07-13 Thread Michael Theroux
Hello,

I've been trying to understand in greater detail how SStables are stored, and 
how information is transferred between Cassandra nodes, especially when a new 
node is joining a cluster.

Specifically, Is information stored to SStables ordered by rowkeys?  Some of 
the articles I've read suggests this is the case (although it's a little vague 
if they actually mean that the columns are stored in order, not the rowkeys).  
However, if data is stored in rowkey order, how is this achieved, as sstables 
are immutable?

Thanks for any insights,
-Mike 

Re: Increased replication factor not evident in CLI

2012-07-13 Thread Dustin Wenz
I was able to apply the patch in the cited bug report to the public source for 
version 1.1.2. It seemed pretty straightforward; six lines in 
MigrationManager.java were switched from System.currentTimeMillis() to 
FBUtilities.timestampMicros(). I then re-built the project by running 'ant 
artifacts' in the cassandra root.

After I was up and running with the new version, I attempted to increase the 
replication factor, and then the compressions options.

Unfortunately, new patch did not seem to help in my case. Neither of the schema 
attributes would change. Running a "describe cluster" shows that all node 
schemas are consistent.

Are there any other ways that I could potentially force Cassandra to accept 
these changes?

- .Dustin

On Jul 13, 2012, at 10:02 AM, Dustin Wenz wrote:

> It sounds plausible that is what we are running into. All of our nodes report 
> a replication factor of 2 (both using describe, and show schema), even though 
> the cluster reported that all schemas agree after I issued the change to 4.
> 
> If this is related to the bug that you filed, it might also explain why I've 
> had difficulty changing the compression options on this same cluster. I issue 
> an update command, schemas agree, but yet the change is not evident.
> 
>   - .Dustin
> 
> On Jul 12, 2012, at 7:56 PM, Michael Theroux wrote:
> 
>> Sounds a lot like a bug that I hit that was filed and fixed recently:
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-4432
>> 
>> -Mike
>> 
>> On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote:
>> 
>>> Possibly the bug with nanotime causing cassandra to think the change 
>>> happened in the past. Talked about onlist in past few days.
>>> On Thursday, July 12, 2012, aaron morton  wrote:
 Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? 
 Do show schema and show keyspace say the same thing ?
 Cheers
 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 13/07/2012, at 7:39 AM, Dustin Wenz wrote:
 
 We recently increased the replication factor of a keyspace in our 
 cassandra 1.1.1 cluster from 2 to 4. This was done by setting the 
 replication factor to 4 in cassandra-cli, and then running a repair on 
 each node.
 
 Everything seems to have worked; the commands completed successfully and 
 disk usage increased significantly. However, if I perform a describe on 
 the keyspace, it still shows replication_factor:2. So, it appears that the 
 replication factor might be 4, but it reports as 2. I'm not entirely sure 
 how to confirm one or the other.
 
 Since then, I've stopped and restarted the cluster, and even ran an 
 upgradesstables on each node. The replication factor still doesn't report 
 as I would expect. Am I missing something here?
 
 - .Dustin
 
 
 
>> 
> 



2012 Cassandra MVP nominations

2012-07-13 Thread Jonathan Ellis
DataStax would like to recognize individuals who go above and beyond
in their contributions to Apache Cassandra.  To formalize this a
little bit, we're creating an MVP program, the first of which will be
announced at the Cassandra summit [1] in August.

To make this program a success, we need your help to nominate either
yourself or another you think merits consideration.  We're looking for
people who take the initiative organizing user groups, who explain
Cassandra in talks, blogs, Twitter, or other forums, or who answer
questions on the mailing list, IRC, StackOverflow, etc.

Please take five minutes and submit your nomination today at [2].
Nominations will be open throughout the next week.  Those selected
will be notified in advance.

[1] http://www.datastax.com/events/cassandrasummit2012
[2] http://www.surveymonkey.com/s/WVBZGHR

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Cassandra Summit 2012

2012-07-13 Thread Jonathan Ellis
Hi all,

The 2012 Cassandra Summit will be in San Jose on August 8.  The 2011
Summit sold out with almost 500 attendees; this year we found a bigger
venue to accommodate 700+.  It's fantastic to see the Cassandra
community grow like this!

The 2012 Summit will have *four* talk tracks, plus the popular "Ask
the Experts" breakout room where DataStax engineers will take any
question, all day.  Accepted talks are posted at
http://www.datastax.com/events/cassandrasummit2012#Sessions, and
speaker bios at
http://www.datastax.com/events/cassandrasummit2012#Speakers.  More
abstracts will be posted as they are confirmed.

Learn more and register at
http://www.datastax.com/events/cassandrasummit2012.  Use the
"cassandra-list-20" code when registering and save 20%!

P.S. Brandon Williams and I will be conducting a developer training
course immediately before the Summit.  More information at
http://www.datastax.com/services/training

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: How to speed up data loading

2012-07-13 Thread Tupshin Harper
Any chance your server has been running for the last two weeks with the
leap second bug?
http://www.datastax.com/dev/blog/linux-cassandra-and-saturdays-leap-second-problem

-Tupshin
On Jul 12, 2012 1:43 PM, "Leonid Ilyevsky" 
wrote:

>  I am loading a large set of data into a CF with composite key. The load
> is going pretty slow, hundreds or even thousands times slower than it would
> do in RDBMS.
>
> I have a choice of how granular my physical key (the first component of
> the primary key) is, this way I can balance between smaller rows and too
> many keys vs. wide rows and fewer keys. What are the guidelines about this?
> How the width of the physical row affects the speed of load?
>
> ** **
>
> I see that Cassandra is doing a lot of processing behind the scene, even
> when I kill the client, the server is still consuming a lot of CPU for a
> long time.
>
> ** **
>
> What else should I look at ? Anything in configuration? 
>
> --
> This email, along with any attachments, is confidential and may be legally
> privileged or otherwise protected from disclosure. Any unauthorized
> dissemination, copying or use of the contents of this email is strictly
> prohibited and may be in violation of law. If you are not the intended
> recipient, any disclosure, copying, forwarding or distribution of this
> email is strictly prohibited and this email and any attachments should be
> deleted immediately. This email and any attachments do not constitute an
> offer to sell or a solicitation of an offer to purchase any interest in any
> investment vehicle sponsored by Moon Capital Management LP (“Moon
> Capital”). Moon Capital does not provide legal, accounting or tax advice.
> Any statement regarding legal, accounting or tax matters was not intended
> or written to be relied upon by any person as advice. Moon Capital does not
> waive confidentiality or privilege as a result of this email.
>


Re: Increased replication factor not evident in CLI

2012-07-13 Thread Dustin Wenz
It sounds plausible that is what we are running into. All of our nodes report a 
replication factor of 2 (both using describe, and show schema), even though the 
cluster reported that all schemas agree after I issued the change to 4.

If this is related to the bug that you filed, it might also explain why I've 
had difficulty changing the compression options on this same cluster. I issue 
an update command, schemas agree, but yet the change is not evident.

- .Dustin

On Jul 12, 2012, at 7:56 PM, Michael Theroux wrote:

> Sounds a lot like a bug that I hit that was filed and fixed recently:
> 
> https://issues.apache.org/jira/browse/CASSANDRA-4432
> 
> -Mike
> 
> On Jul 12, 2012, at 8:16 PM, Edward Capriolo wrote:
> 
>> Possibly the bug with nanotime causing cassandra to think the change 
>> happened in the past. Talked about onlist in past few days.
>> On Thursday, July 12, 2012, aaron morton  wrote:
>> > Do multiple nodes say the RF is 2 ? Can you show the output from the CLI ? 
>> > Do show schema and show keyspace say the same thing ?
>> > Cheers
>> >
>> >
>> > -
>> > Aaron Morton
>> > Freelance Developer
>> > @aaronmorton
>> > http://www.thelastpickle.com
>> > On 13/07/2012, at 7:39 AM, Dustin Wenz wrote:
>> >
>> > We recently increased the replication factor of a keyspace in our 
>> > cassandra 1.1.1 cluster from 2 to 4. This was done by setting the 
>> > replication factor to 4 in cassandra-cli, and then running a repair on 
>> > each node.
>> >
>> > Everything seems to have worked; the commands completed successfully and 
>> > disk usage increased significantly. However, if I perform a describe on 
>> > the keyspace, it still shows replication_factor:2. So, it appears that the 
>> > replication factor might be 4, but it reports as 2. I'm not entirely sure 
>> > how to confirm one or the other.
>> >
>> > Since then, I've stopped and restarted the cluster, and even ran an 
>> > upgradesstables on each node. The replication factor still doesn't report 
>> > as I would expect. Am I missing something here?
>> >
>> > - .Dustin
>> >
>> >
>> >
> 



Re: Cassandra and Tableau

2012-07-13 Thread Robin Verlangen
Thank you Aaron and Brian. We're currently investigating several options.
Hadoop + Hive combo also seems a good choice as our input files are flat.
I'll keep you up-to-date about our final decision.

- Robin

2012/7/6 aaron morton 

> Here are two links I've noticed in my travels, have not looked into what
> they offer.
>
> http://www.pentaho.com/big-data/nosql/cassandra/
>
> http://www.jaspersoft.com/bigdata
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 7/07/2012, at 3:03 AM, Brian O'Neill wrote:
>
> Robin,
>
> We have the same issue right now.  We use Tableau for all of our
> reporting needs, but we couldn't find any acceptable bridge between it
> and Cassandra.
>
> We ended up using cassandra-triggers to replicate the data to Oracle.
> https://github.com/hmsonline/cassandra-triggers/
>
> Let us know if you get things setup with a direct connection.
> We'd be *very* interested int helping out if you find a way to do it.
>
> -brian
>
>
> On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen  wrote:
>
> Hi there,
>
>
> Is there anyone out there who's using Tableau in combination with a
>
> Cassandra cluster? There seems to be no standard solution to connect, at
>
> least I couldn't find one. Does anyone know how to tackle this problem?
>
>
>
> With kind regards,
>
>
> Robin Verlangen
>
> Software engineer
>
>
> W http://www.robinverlangen.nl
>
> E ro...@us2.nl
>
>
> Disclaimer: The information contained in this message and attachments is
>
> intended solely for the attention and use of the named addressee and may be
>
> confidential. If you are not the intended recipient, you are reminded that
>
> the information remains the property of the sender. You must not use,
>
> disclose, distribute, copy, print or rely on this e-mail. If you have
>
> received this message in error, please contact the sender immediately and
>
> irrevocably delete this message and any copies.
>
>
>
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>
>


-- 
With kind regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.


Never ending manual repair after adding second DC

2012-07-13 Thread Bart Swedrowski
Hello everyone,

I'm facing quite weird problem with Cassandra since we've added
secondary DC to our cluster and have totally ran out of ideas; this
email is a call for help/advice!

History looks like:
- we used to have 4 nodes in a single DC
- running Cassandra 0.8.7
- RF:3
- around 50GB of data on each node
- randomPartitioner and SimpleSnitch

All was working fine for over 9 months.  Few weeks ago we decided we
want to add another 4 nodes in a second DC and join them to the
cluster.  Prior doing that, we upgraded Cassandra to 1.0.9 to push it
out of the doors before the multi-DC work.  After upgrade, we left it
working for over a week and it was all good; no issues.

Then, we added 4 additional nodes in another DC bringing the cluster
to 8 nodes in total, spreading across two DCs, so now we've:
- 8 nodes across 2 DCs, 4 in each DC
- 100Mbps low-latency connection (sub 5ms) running over Cisco ASA
Site-to-Site VPN (which is ikev1 based)
- DC1:3,DC2:3 RFs
- randomPartitioner and using PropertyFileSnitch now

nodetool ring looks as follows:
$ nodetool -h localhost ring
Address DC  RackStatus State   Load
OwnsToken

148873535527910577765226390751398592512
192.168.81.2DC1 RC1 Up Normal  37.9 GB
12.50%  0
192.168.81.3DC1 RC1 Up Normal  35.32 GB
12.50%  21267647932558653966460912964485513216
192.168.81.4DC1 RC1 Up Normal  39.51 GB
12.50%  42535295865117307932921825928971026432
192.168.81.5DC1 RC1 Up Normal  19.42 GB
12.50%  63802943797675961899382738893456539648
192.168.94.178  DC2 RC1 Up Normal  40.72 GB
12.50%  85070591730234615865843651857942052864
192.168.94.179  DC2 RC1 Up Normal  30.42 GB
12.50%  106338239662793269832304564822427566080
192.168.94.180  DC2 RC1 Up Normal  30.94 GB
12.50%  127605887595351923798765477786913079296
192.168.94.181  DC2 RC1 Up Normal  12.75 GB
12.50%  148873535527910577765226390751398592512

(please ignore the fact that nodes are not interleaved; they should be
however there's been hiccup during the implementation phase.  Unless
*this* is the problem!)

Now, the problem: over 7 out of 10 manual repairs are not being
finished.  They usually get stuck and show 3 different sympoms:

  1). Say node 192.168.81.2 runs manual repair, it requests merkle
trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178,
192.168.94.179, 192.168.94.181.  It receives them from 192.168.81.2,
192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not
from 192.168.94.181.  192.168.94.181 logs are saying that it has sent
the merkle tree back but it's never received by 192.168.81.2.
  2). Say node 192.168.81.2 runs manual repair, it requests merkle
trees from 192.168.81.2, 192.168.81.3, 192.168.81.5, 192.168.94.178,
192.168.94.179, 192.168.94.181.  It receives them from 192.168.81.2,
192.168.81.3, 192.168.81.5, 192.168.94.178, 192.168.94.179 but not
from 192.168.94.181.  192.168.94.181 logs are not saying *anything*
about merkle tree being sent.  Also compactionstats are not even
saying anything about them being validated (generated)
  3). Merkle trees are being delivered, and nodes are sending data
across to sync theirselves.  On a certain occasions, they'll get
"stuck" streaming files between each other at 100% and won't move
forward.  Now the interesting bit is, the ones that are getting stuck
are always placed in different DCs!

Now, pretty much every single scenario points towards connectivity
problem, however we also have few PostgreSQL replication streams
happening over this connection, some other traffic going over and
quite a lot of monitoring happening and none of those are being
affected in any way.

Also, if random packets are being lost, I'd expect TCP to correct that
(re-transmit them).

It doesn't matter whether its manual repair or just -pr repair, both
end with pretty much the same.

Has anyone came across this kind of issue before or have any ideas how
else I could investigate this?  The issue is pressing me massively as
this is our live cluster and I've to run manual repairs pretty much
manually (usually multiple times before it finally goes through) every
single day…  And also I'm not sure whether cluster is getting affected
in any other way BTW.

I've gone through Jira issues and considered upgrading to 1.1.X but I
can't see anything that would even look like something that is
happening to my cluster.

If any further information, like logs, configuration files are needed,
please let me know.

Any informations, suggestions, advices - greatly appreciated.

Kind regards,
Bart


Re: Using a node in separate cluster without decommissioning.

2012-07-13 Thread rohit bhatia
Hi

Just wanted to say that it worked. I also made sure to modify thrift
rpc_port and storage port so that the two clusters don't interfere.
Thanks for the suggestion

Thanks
Rohit

On Thu, Jul 12, 2012 at 10:01 AM, aaron morton  wrote:
> Since replication factor is 2 in first cluster, I
> won't lose any data.
>
> Assuming you have been running repair or working at CL QUORUM (which is the
> same as CL ALL for RF 2)
>
> Is it advisable and safe to go ahead?
>
> um, so the plan is to turn off 2 nodes in the first cluster, restask them
> into the new cluster and then reverse the process ?
>
> If you simply turn two nodes off in the first cluster you will have reduce
> the availability for a portion of the ring. 25% of the keys will now have at
> best 1 node they can be stored on. If a node is having any sort of problems,
> and it's is a replica for one of the down nodes, the cluster will appear
> down for 12.5% of the keyspace.
>
> If you work at QUORUM you will not have enough nodes available to write /
> read 25% of the keys.
>
> If you decomission the nodes, you will still have 2 replicas available for
> each key range. This is the path I would recommend.
>
> If you _really_ need to do it what you suggest will probably work. Some
> tips:
>
> * do safe shutdowns - nodetool disablegossip, disablethrift, drain
> * don't forget to copy the yaml file.
> * in the first cluster the other nodes will collect hints for the first hour
> the nodes are down. You are not going to want these so disable HH.
> * get the nodes back into the first cluster before gc_grace_seconds expires.
> * bring them back and repair them.
> * when you bring them back, reading at CL ONE will give inconsistent
> results. Reading at QUOURM may result in a lot of repair activity.
>
> Hope that helps.
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 11/07/2012, at 6:35 AM, rohit bhatia wrote:
>
> Hi
>
> I want to take out 2 nodes from a 8 node cluster and use in another
> cluster, but can't afford the overhead of streaming the data and
> rebalance cluster. Since replication factor is 2 in first cluster, I
> won't lose any data.
>
> I'm planning to save my commit_log and data directories and
> bootstrapping the node in the second cluster. Afterwards I'll just
> replace both the directories and join the node back to the original
> cluster.  This should work since cassandra saves all the cluster and
> schema info in the system keyspace.
>
> Is it advisable and safe to go ahead?
>
> Thanks
> Rohit
>
>