We have been facing lot of slowdown in production, whenever a shard-server is added or removed...
Shards which were locally served via short-circuit suddenly becomes fully remote & at scale, this melts down. Block cache is kind of reactive cache & takes a lot of time to settle down (at-least for us!!) Have been thinking of handling this locality issue for some time now.. 1. For every shard, Blur can map a primary server & a secondary server in ZooKeeper 2. File-writes can use the favored nodes hint of Hadoop & write to both these servers [https://issues.apache.org/jira/browse/HDFS-2576] 3. When a machine goes down, instead of randomly assigning shards to different shard-servers, Blur can decide to allocate shards to designated secondary servers. Adding a new machine is another problem, where it will immediately start serving shards from remote machines. It must need data copies of all primary shards it is supposed serve from local disk.. hadoop has something called BlockPlacementPolicy that can be hacked into. [ http://hadoopblog.blogspot.in/2009/09/hdfs-block-replica-placement-in-your.html ] When booting a new machine, lets say we increase replication-factor from 3 to 4, for shards that will be hosted by new machine (setrep command from hdfs console) Now hadoop will call our CustomBlockPlacementPolicy class to arrange extra replication, where we can sneak in the new IP.. Once all shards to be hosted by this new machine are replicated, we can close these shards, update the mappings in ZK & open them. Data will be served locally Similarly, when restoring replication-factor from 4 to 3, our CustomBlockPlacementPolicy class can hook up to ZK, find out which node to delete the data & proceed... Do let know your thoughts on this...
