Re: Query on Blur Shards and Architecture

Aaron McCurry Thu, 13 Feb 2014 10:26:18 -0800

Hi Dibyendu,

First off let me say welcome.  Sorry for my delayed response I have been
digging out from a large snow storm.  :-)  I will attempt to answer your
questions below.

On Thu, Feb 13, 2014 at 9:04 AM, Dibyendu Bhattacharya <
[email protected]> wrote:

> Hi,
>
> We are presently evaluating Blur for our distributed search platform. I
> have few questions on Blur architecture which I need to clarify.
>
> 1. During Table creation, we mention the Shard count. Is this count be
> changed at later point of time, or this is fixed ?
>

At this point yes, they are fixed.  Lucene can operate very efficiently
from small to fairly large indexes.  Each shard can easily operate into the
millions of records.  Depending on what you are indexing and how many
fields you are indexing I would say that for narrow records with small text
fields or numeric types should be able to reach 40-60 million easily
perhaps more.

There is a backlog issue to allow for splitting shards at runtime.  We
haven't tackled that yet though.

>
> 2. As the shards are self sufficient indexes, is there any Routing support
> available for keys to a specific shards during indexing and during search ?
>

Sort of.  A standard query will spray all the shards of a table with the
inbound query.  Fetches (data lookups) are routed to the specify shard.
 Also there is a feature if you are performing a record level search to
route that to a specific row for searching.  Since records exists within
groups called rows it can provide a search within a row.  Also shard
servers present the same api as the controllers, though this is normally
used for debugging.

>
> 3. Do you have any writeup available how Lucene Segments got merged in HDFS
> and how the Block Cache gets updated after segments merge.
>

I am working on finishing the 0.2.2 release and I will hopefully have more
descriptions in the documentation.  However I will attempt to describe here:

There are 2 ways for data to enter Blur.  The first is via Thrift mutates.

The easy way to answer the block cache question is that certain files are
written into the block cache via a write-through pattern (configurable).
 There few areas in Blur where we cache pieces of the index, but Block
Cache is by far the largest.  But as for what we do during merges, Blur has
a centralize thread pool per shard server for merging segments across any
of the shards (indexes) that it is currently serving.  Because of this
separation we can configure merges to be written into the block cache
during the merge or not.  By default it writes into the block cache if the
file type is anything except the fdt file (field store file).

Also Blur commits the index after every mutate and gets a refresh reader.
 Therefore the client can always read there own writes (there's no delay).
 A lot of work was done to make the commit time as low as possible so that
real time update performance would not suffer.  The biggest component here
is a key value store written to store data in HDFS.  We use it inside a
custom Lucene directory that gives us much better performance with small
rapidly changing files than the standard HDFS directory.

The second way to get data into Blur is via MapReduce updates.

After the given MapReduce job runs the segments are moved into the
respective locations in the shard.  And then an index reader refresh occurs
and the data becomes visible.  After that normal cache behaviors take over.

>
> 4. How Blur handles fail-over ? If a Shard Server dies, how replica shards
> took over ? I understand Zookeeper comes here, but do you have some
>  writeup on this ?
>

ZooKeeper is used to know if a shard server is alive though the use
of ephemeral nodes.  When the ZooKeeper session from the shard server that
dies is expired (or closed) the shard cluster is notified by ZooKeeper of
the change.  When the change happens the first shard server to react in the
cluster calculates a new layout for the shards that are currently offline.
 This calculation minimizes shard movement and ensures level load across
the shard cluster.  After this calculation is made the shard servers that
need to open the missing shards do so as quickly as possible.  Since all
the data is stored in HDFS no replication is needed from Blur's
perspective.  However if the server that failed was also running a HDFS
datanoce, HDFS will need to replicate the missing data but that usually
occurs several minutes after Blur reacts.

>
> 4. Do you have any benchmark for Blur MTTR (Mean time to recover) ?
>

Not yet.  There are two stages to a recovery, read and write are separated.
 In most systems MTTR for reads (searching) is usually a few seconds (5-10
seconds) depending on index size and number of shards and shard servers.
 The MTTR for writes can take longer, possibly 30 seconds in a worst case
for very large indexes.  While the writer is opening all write operations
block on that shard.  As with any project we strive to improve these
numbers but the main focus of Blur is read side because that effects end
users performing searches.

>
> 5. From my initial understanding, Blur has similar issue like HBase on Data
> Locality on fail over. If all files for a given Shards are replicated all
> over HDFS, during fail over , it will loose the data  locality. Is that
> correct ?
>

Yes it will.  However if you are performing updates via MapReduce you will
be losing HDFS locality anyway.  And if you are using real time updates via
the Thrift api then the rapid merging that Lucene performs will likely
begin to bring that data local as it's rewritten during the merging.

To date I haven't seen it be a problem, but I know people want to get the
most performance out of their systems as possible.

I hope this helps answer your questions.  Let me know if you need anything
else.  Also if you are going to do an evaluation of Blur I strongly
recommend the 0.2.2 version that's in git.  If you need help getting
started with that please let me know.

Aaron

>
> Regards,
>

Re: Query on Blur Shards and Architecture

Reply via email to