Hi everyone, At my work we are in the early stages of moving our data which lives on EC2 machines from a Flare/memcache system to Cassandra so your chat has been interesting to me.
I realize that this might complicate things and make things less "simple" but would it be useful for the nodes themselves to advertise some of their info? So for example a node starts to bootstrap, it pushes its specs over to the seed node, the seed node uses that to figure out what configuration to push back. Useful things that nodes could advertise: data-centre they are in, performance info: mem, CPU etc (these could be used to more intelligently decide how to partition the data that the new node gets for example) geographical info perhaps a preferred hash range not just a token (and presumably everything else would automatically rebalance itself to make that happen) P.S.The last two could be useful for someone if they had their data in Cassandra but it was more relevant more local to the geography. Think of something like Craigslist. Having the data corresponding to San Fransisco lists just happen to bootstrap over to a datacenter on the east coast wouldn't be very efficient. But having two completely separate datastores might not be the simplest design either. It would be nice to just tell the datastore where the info is most relevant and have it make intelligent choices of where to store things for you. In my case we are making a reputation system. It would be nice if we had a way to make sure that at least one replica of the data stayed on the customers machine and one or more copies over on our servers. I don't know how to do that and the reverse would be important too make sure other customers data doesn't get replicated to another customers node. I guess rather than a ring topology I'd like to try to get a star "everything in the center + location specific info at the points". An option would be to use different datastores at both ends and push updates over to the central store which would be Cassandra but that isn't as transparent as just having Cassandra nodes everywhere and just have the replication happen in a smart way. On 2010-04-03, at 3:04 PM, Joe Stump wrote: > > On Apr 3, 2010, at 2:54 PM, Benjamin Black wrote: > >> I'm pretty familiar with EC2, hence the question. I don't believe any >> patches are required to do these things. Regardless, as I noted in >> that ticket, you definitely do NOT need AWS credentials to determine >> your availability zone. It is available through the metadata web >> server for each instance as 'placement_availability_zone', avoiding >> the need to speak the EC2 API or store credentials in the configs. > > Good point on the metadata web server. Though I'm unsure how Cassandra would > know anything about those AZ's without using code that's aware of such > things, such as the rack-aware strategy we made. > > Am I missing something further? I asked a friend on the EC2 networking team > if you could determine AZ by IP address and he said, "No." > > --Joe >