From: [email protected]
Date: Wed, 25 Jan 2012 06:48:45 -0800
Subject: Re: Should Riak have used dedicated nodes for secondary indices?
To: [email protected]
CC: [email protected]

Good news! Riak doesn't use sharding.

Data locality is critical in a distributed system. When you create an index, 
your structure looks something like:

indexed_value:record_id

Reading from an index requires locating indexed_value, finding all matching 
values, and then retrieving all matching record_ids. By keeping index data on 
the same node as the source data, Riak avoids having to remote the query to 
retrieve object data. This is a Good Thing. The network is slow and unreliable. 
Just ask an Australian.



Riak's approach is intended to provide a uniform system where you can treat any 
node equally. The idea that there should be an unsharded index node is a bit 
ludicrous. Let's say you have 1TB of raw data. Your indexes are pretty light 
and are only about 20% of your data size. This means that you need 200GB of 
good storage (not some cheap $150 SATA HDD you found on NewEgg). 200GB of RAID 
10 SAS storage isn't that pricey to put in a single unsharded machine. Over 
time as your data grows and your indexing changes, you may have 10TB and your 
index size is ~40% of your data. Your unsharded index server now has to have 
4TB of fast, reliable storage. And, since this is an unsharded system, you'll 
want multiple replicas of your unsharded index server to make sure that a 
hardware hiccup doesn't take down your ability to perform fast lookups. Besides 
- a single indexing server becomes a single bottleneck and a single point of 
failure in your system.



Most people using Lucene as their indexing store are sharding Lucene. From an 
anecdotal standpoint, about 70% of the people I've talked to using Lucene are 
getting to the point of sharding their replicated Lucene indexes.



I'm not saying that either approach is good or bad; just remember that every 
solution has drawbacks.---
Jeremiah Peschka, SQL Server MVP
Managing Director, Brent Ozar PLF, LLC





On Wed, Jan 25, 2012 at 5:15 AM, Runar Jordahl <[email protected]> wrote:


Siddharth Anand, says that secondary indices (for a key-value store)

best is placed on a separate node, avoiding the need to look up 1 / N

nodes during a query:



"Systems that shard data based on a primary key will do well when

routed by that key. When routed by a secondary key, the system will

need to “spray” a query across all shards. If one of the shards is

experiencing high latency, the system will return either no results or

incomplete (i.e. inconsistent) results. For this reason, it would make

sense to store the secondary index on an unsharded (but replicated)

system."



http://highscalability.com/blog/2012/1/24/the-state-of-nosql-in-2012.html



If I understand Riak correctly, it takes the opposite approach,

storing secondary indices together with the data.



To me at appears like Riak’s approach gives a more uniform system,

with all nodes having the same responsibilities. Does anyone else have

any thoughts on this?



Kind regards

Runar Jordahl

blog.epigent.com



_______________________________________________

riak-users mailing list

[email protected]

http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com              
                          
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to