Re: Contrast of Blur to ElasticSearch, Solr

Aaron McCurry Sun, 08 Dec 2013 06:27:34 -0800

Hi James,

Thanks for your interest and questions, I will attempt to answer your
questions below.

On Sat, Dec 7, 2013 at 8:47 AM, James Kebinger <[email protected]> wrote:

> Hi Aaron, I'm wondering if you can talk a little about how you Blur
> differentiating itself from ElasticSearch and Solr. It seems like both of
> them, in particular Solr after picking up some Blur code, are gaining more
> abilities to interact with hadoop and HDFS.
>

Unfortunately I'm not an expert in Solr or ElasticSearch.  I tell you that
Blur's high level features when talking about how it's interacts with
Hadoop.

- Index storage (The obvious one)
- Bulk offline indexing, with incremental updates.
This one gives you the ability to perform indexing on a dedicated MapReduce
cluster and simply move the index updates to the running Blur cluster for
importing.
- WAL (write ahead log) is written to use HDFS
- Also we are currently moving most of the meta data from ZooKeeper storage
to HDFS storage.  This makes interacting with the meta data of a table easy
to do form within MapReduce jobs

> How does a blur install differ from a solr setup reading off hdfs?
>

Again I'm not an expert in Solr.  Blur's setup runs a cluster of shard
servers that serve shards (indexes) of the table within that shard cluster.
 The indexes are stored once in HDFS (not counting the HDFS replication
here) and evenly distributed across whatever shard servers are online.
 Blur utilizes a BlockCache (think file system cache) that is an off-heap
based system.  The first version of this was originally picked up by
Cloudera and modified (I'm assuming) and committed back into the
Lucene/Solr code base.  The second version of this block cache (Blur 0.2.2
stable) is now the default in Blur.  It has several advantages of the first
version:

http://mail-archives.apache.org/mod_mbox/incubator-blur-dev/201310.mbox/%3CCAB6tTr0Nr2aDLc4kkHoeqiO-utwzBAhb=Ru==gmhqry4axp...@mail.gmail.com%3E

One interesting feature of Blur is the ability to run a cluster of
controllers (controllers are used to make the shard cluster look like a
single service) in front multiple shard clusters.  This can help to deal
with reindexes of data, meaning that you can reindex all your index to a
new cluster and not effect performance of the cluster that your users may
be interacting with.

Some of the overall features of Blur are:
- NRT updates of data
- Offline bulk indexing
- Block cache for fast query performance
- Index warmup (pulls parts of the index up into block cache when a segment
is brought online)
- Performance metrics gathering
- Distributed tracing
- Custom index types
- Custom server side logic can be implemented (basic)

I'm sure there are many more.

Hope this helps.

Aaron

>
> thanks
>
> James
>

Re: Contrast of Blur to ElasticSearch, Solr

Reply via email to