Re: Deployment on AWS and replication strategies

Mike Gallamore Sat, 03 Apr 2010 15:42:15 -0700

Hi everyone,

At my work we are in the early stages of moving our data which lives on EC2 
machines from a Flare/memcache system to Cassandra so your chat has been 
interesting to me.

I realize that this might complicate things and make things less "simple" but 
would it be useful for the nodes themselves to advertise some of their info? So 
for example a node starts to bootstrap, it pushes its specs over to the seed 
node, the seed node uses that to figure out what configuration to push back.

Useful things that nodes could advertise:

data-centre they are in,
performance info: mem, CPU etc (these could be used to more intelligently 
decide how to partition the data that the new node gets for example)
geographical info
perhaps a preferred hash range not just a token (and presumably everything else 
would automatically rebalance itself to make that happen)

P.S.The last two could be useful for someone if they had their data in 
Cassandra but it was more relevant more local to the geography. Think of 
something like Craigslist. Having the data corresponding to San Fransisco lists 
just happen to bootstrap over to a datacenter on the east coast wouldn't be 
very efficient. But having two completely separate datastores might not be the 
simplest design either. It would be nice to just tell the datastore where the 
info is most relevant and have it make intelligent choices of where to store 
things for you.

 In my case we are making a reputation system. It would be nice if we had a way 
to make sure that at least one replica of the data stayed on the customers 
machine and one or more copies over on our servers. I don't know how to do that 
and the reverse would be important too make sure other customers data doesn't 
get replicated to another customers node. I guess rather than a ring topology 
I'd like to try to get a star "everything in the center + location specific 
info at the points". An option would be to use different datastores at both 
ends and push updates over to the central store which would be Cassandra but 
that isn't as transparent as just having Cassandra nodes everywhere and just 
have the replication happen in a smart way.

On 2010-04-03, at 3:04 PM, Joe Stump wrote:

> 
> On Apr 3, 2010, at 2:54 PM, Benjamin Black wrote:
> 
>> I'm pretty familiar with EC2, hence the question.  I don't believe any
>> patches are required to do these things.  Regardless, as I noted in
>> that ticket, you definitely do NOT need AWS credentials to determine
>> your availability zone.  It is available through the metadata web
>> server for each instance as 'placement_availability_zone', avoiding
>> the need to speak the EC2 API or store credentials in the configs.
> 
> Good point on the metadata web server. Though I'm unsure how Cassandra would 
> know anything about those AZ's without using code that's aware of such 
> things, such as the rack-aware strategy we made.
> 
> Am I missing something further? I asked a friend on the EC2 networking team 
> if you could determine AZ by IP address and he said, "No." 
> 
> --Joe
>

Re: Deployment on AWS and replication strategies

Reply via email to