Culvert was originally introduced at Hadoop Summit 2011, but recent updates have made it very applicable to current systems. Recently, we added support for Accumulo as well as upgraded HBase support to 0.92. Since Hadoop Summit, there have also been significant code cleanup and added some small features. However, we found that most people hadn't heard of Culvert, so we wanted to re-release the framework.
For an introduction to using Culvert, check out the blog post here: http://jyates.github.com/2011/11/17/intro-to-culvert.html Also, the original presentation (where we discuss the internals) is available on slideshare<http://www.slideshare.net/jesse_yates/culvert-a-robust-framework-for-secondary-indexing-of-structured-and-unstructured-data> . There is a Culvert hackathon in the middle of January: http://culverthackathon2012.eventbrite.com/ Oh, and you can find the code on github<https://github.com/booz-allen-hamilton/culvert> . Below is an overview of why we wrote Culvert and what it does. Secondary indexing is a common design pattern in BigTable-like databases that allows users to index one or more columns in a table. This technique enables fast search of records in a database based on a particular column instead of the row id, thus enabling relational-style semantics in a NoSQL environment. Frequently, the index is stored either in a reserved namespace in the table or another index table. Despite the fact that this is a common design pattern in BigTable-based applications, most implementations of this practice to date have been tightly coupled with a particular application. As a result, few general-purpose frameworks for secondary indexing on BigTable-like databases exist, and those that do are tied to a particular implementation of the BigTable model. There are several existing tools (Solr, Lily), but these are focused on doing text based search and are highly restrictive to indexes created through their framework. What if you want to use your existing indexes? Or leverage the indexes to do complex queries? We developed a solution to this problem called Culvert that supports online index updates as well as a variation of the HIVE query language. In designing Culvert, we sought to make the solution pluggable so that it can be used on any of the many BigTable-like databases (HBase, Cassandra, etc.). Furthermore, it is also easily extensible to existing, hand rolled indexes. As well as being a secondary indexing framework, it is also a query execution mechanism - think pig/hive minus the fancy command line. We support a subset of SQL, but are able to take full advantage of home-rolled and built-in indexes, leading to query execution times potentially orders of magnitude smaller than existing approaches and certainly orders of magnitude more easily. -- Jesse ------------------- Jesse Yates 240-888-2200 @jesse_yates