(Re)Introducing Culvert - A secondary indexing framework for BigTable like systems

Jesse Yates Thu, 22 Dec 2011 11:45:08 -0800

Culvert was originally introduced at Hadoop Summit 2011, but recent updates
have made it very applicable to current systems. Recently, we added support
for Accumulo as well as upgraded HBase support to 0.92. Since Hadoop
Summit, there have also been significant code cleanup and added some small
features. However, we found that most people hadn't heard of Culvert, so we
wanted to re-release the framework.

For an introduction to using Culvert, check out the blog post here:
http://jyates.github.com/2011/11/17/intro-to-culvert.html

Also, the original presentation (where we discuss the internals) is
available on
slideshare<http://www.slideshare.net/jesse_yates/culvert-a-robust-framework-for-secondary-indexing-of-structured-and-unstructured-data>
.

There is a Culvert hackathon in the middle of January:
http://culverthackathon2012.eventbrite.com/

Oh, and you can find the code on
github<https://github.com/booz-allen-hamilton/culvert>
.

Below is an overview of why we wrote Culvert and what it does.

Secondary indexing is a common design pattern in BigTable-like databases
that allows users to index one or more columns in a table. This technique
enables fast search of records in a database based on a particular column
instead of the row id, thus enabling relational-style semantics in a NoSQL
environment. Frequently, the index is stored either in a reserved namespace
in the table or another index table.

Despite the fact that this is a common design pattern in BigTable-based
applications, most implementations of this practice to date have been
tightly coupled with a particular application. As a result, few
general-purpose frameworks for secondary indexing on BigTable-like
databases exist, and those that do are tied to a particular implementation
of the BigTable model.

There are several existing tools (Solr, Lily), but these are focused on
doing text based search and are highly restrictive to indexes created
through their framework. What if you want to use your existing indexes? Or
leverage the indexes to do complex queries?

We developed a solution to this problem called Culvert that supports online
index updates as well as a variation of the HIVE query language. In
designing Culvert, we sought to make the solution pluggable so that it can
be used on any of the many BigTable-like databases (HBase, Cassandra,
etc.). Furthermore, it is also easily extensible to existing, hand rolled
indexes.

As well as being a secondary indexing framework, it is also a query
execution mechanism - think pig/hive minus the fancy command line. We
support a subset of SQL, but are able to take full advantage of home-rolled
and built-in indexes, leading to query execution times potentially orders
of magnitude smaller than existing approaches and certainly orders of
magnitude more easily.

-- Jesse
-------------------
Jesse Yates
240-888-2200
@jesse_yates

(Re)Introducing Culvert - A secondary indexing framework for BigTable like systems

Reply via email to