Hi Dibyendu, First off let me say welcome. Sorry for my delayed response I have been digging out from a large snow storm. :-) I will attempt to answer your questions below.
On Thu, Feb 13, 2014 at 9:04 AM, Dibyendu Bhattacharya < [email protected]> wrote: > Hi, > > We are presently evaluating Blur for our distributed search platform. I > have few questions on Blur architecture which I need to clarify. > > 1. During Table creation, we mention the Shard count. Is this count be > changed at later point of time, or this is fixed ? > At this point yes, they are fixed. Lucene can operate very efficiently from small to fairly large indexes. Each shard can easily operate into the millions of records. Depending on what you are indexing and how many fields you are indexing I would say that for narrow records with small text fields or numeric types should be able to reach 40-60 million easily perhaps more. There is a backlog issue to allow for splitting shards at runtime. We haven't tackled that yet though. > > 2. As the shards are self sufficient indexes, is there any Routing support > available for keys to a specific shards during indexing and during search ? > Sort of. A standard query will spray all the shards of a table with the inbound query. Fetches (data lookups) are routed to the specify shard. Also there is a feature if you are performing a record level search to route that to a specific row for searching. Since records exists within groups called rows it can provide a search within a row. Also shard servers present the same api as the controllers, though this is normally used for debugging. > > 3. Do you have any writeup available how Lucene Segments got merged in HDFS > and how the Block Cache gets updated after segments merge. > I am working on finishing the 0.2.2 release and I will hopefully have more descriptions in the documentation. However I will attempt to describe here: There are 2 ways for data to enter Blur. The first is via Thrift mutates. The easy way to answer the block cache question is that certain files are written into the block cache via a write-through pattern (configurable). There few areas in Blur where we cache pieces of the index, but Block Cache is by far the largest. But as for what we do during merges, Blur has a centralize thread pool per shard server for merging segments across any of the shards (indexes) that it is currently serving. Because of this separation we can configure merges to be written into the block cache during the merge or not. By default it writes into the block cache if the file type is anything except the fdt file (field store file). Also Blur commits the index after every mutate and gets a refresh reader. Therefore the client can always read there own writes (there's no delay). A lot of work was done to make the commit time as low as possible so that real time update performance would not suffer. The biggest component here is a key value store written to store data in HDFS. We use it inside a custom Lucene directory that gives us much better performance with small rapidly changing files than the standard HDFS directory. The second way to get data into Blur is via MapReduce updates. After the given MapReduce job runs the segments are moved into the respective locations in the shard. And then an index reader refresh occurs and the data becomes visible. After that normal cache behaviors take over. > > 4. How Blur handles fail-over ? If a Shard Server dies, how replica shards > took over ? I understand Zookeeper comes here, but do you have some > writeup on this ? > ZooKeeper is used to know if a shard server is alive though the use of ephemeral nodes. When the ZooKeeper session from the shard server that dies is expired (or closed) the shard cluster is notified by ZooKeeper of the change. When the change happens the first shard server to react in the cluster calculates a new layout for the shards that are currently offline. This calculation minimizes shard movement and ensures level load across the shard cluster. After this calculation is made the shard servers that need to open the missing shards do so as quickly as possible. Since all the data is stored in HDFS no replication is needed from Blur's perspective. However if the server that failed was also running a HDFS datanoce, HDFS will need to replicate the missing data but that usually occurs several minutes after Blur reacts. > > 4. Do you have any benchmark for Blur MTTR (Mean time to recover) ? > Not yet. There are two stages to a recovery, read and write are separated. In most systems MTTR for reads (searching) is usually a few seconds (5-10 seconds) depending on index size and number of shards and shard servers. The MTTR for writes can take longer, possibly 30 seconds in a worst case for very large indexes. While the writer is opening all write operations block on that shard. As with any project we strive to improve these numbers but the main focus of Blur is read side because that effects end users performing searches. > > 5. From my initial understanding, Blur has similar issue like HBase on Data > Locality on fail over. If all files for a given Shards are replicated all > over HDFS, during fail over , it will loose the data locality. Is that > correct ? > Yes it will. However if you are performing updates via MapReduce you will be losing HDFS locality anyway. And if you are using real time updates via the Thrift api then the rapid merging that Lucene performs will likely begin to bring that data local as it's rewritten during the merging. To date I haven't seen it be a problem, but I know people want to get the most performance out of their systems as possible. I hope this helps answer your questions. Let me know if you need anything else. Also if you are going to do an evaluation of Blur I strongly recommend the 0.2.2 version that's in git. If you need help getting started with that please let me know. Aaron > > Regards, >
