Hi James, Thanks for your interest and questions, I will attempt to answer your questions below.
On Sat, Dec 7, 2013 at 8:47 AM, James Kebinger <[email protected]> wrote: > Hi Aaron, I'm wondering if you can talk a little about how you Blur > differentiating itself from ElasticSearch and Solr. It seems like both of > them, in particular Solr after picking up some Blur code, are gaining more > abilities to interact with hadoop and HDFS. > Unfortunately I'm not an expert in Solr or ElasticSearch. I tell you that Blur's high level features when talking about how it's interacts with Hadoop. - Index storage (The obvious one) - Bulk offline indexing, with incremental updates. This one gives you the ability to perform indexing on a dedicated MapReduce cluster and simply move the index updates to the running Blur cluster for importing. - WAL (write ahead log) is written to use HDFS - Also we are currently moving most of the meta data from ZooKeeper storage to HDFS storage. This makes interacting with the meta data of a table easy to do form within MapReduce jobs > How does a blur install differ from a solr setup reading off hdfs? > Again I'm not an expert in Solr. Blur's setup runs a cluster of shard servers that serve shards (indexes) of the table within that shard cluster. The indexes are stored once in HDFS (not counting the HDFS replication here) and evenly distributed across whatever shard servers are online. Blur utilizes a BlockCache (think file system cache) that is an off-heap based system. The first version of this was originally picked up by Cloudera and modified (I'm assuming) and committed back into the Lucene/Solr code base. The second version of this block cache (Blur 0.2.2 stable) is now the default in Blur. It has several advantages of the first version: http://mail-archives.apache.org/mod_mbox/incubator-blur-dev/201310.mbox/%3CCAB6tTr0Nr2aDLc4kkHoeqiO-utwzBAhb=Ru==gmhqry4axp...@mail.gmail.com%3E One interesting feature of Blur is the ability to run a cluster of controllers (controllers are used to make the shard cluster look like a single service) in front multiple shard clusters. This can help to deal with reindexes of data, meaning that you can reindex all your index to a new cluster and not effect performance of the cluster that your users may be interacting with. Some of the overall features of Blur are: - NRT updates of data - Offline bulk indexing - Block cache for fast query performance - Index warmup (pulls parts of the index up into block cache when a segment is brought online) - Performance metrics gathering - Distributed tracing - Custom index types - Custom server side logic can be implemented (basic) I'm sure there are many more. Hope this helps. Aaron > > thanks > > James >
