Any reason why you aren't using Lucandra directly? On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jun...@gmail.com>wrote:
> Greetings, > > Started getting my feet wet with Cassandra in earnest this week. I'm > building a custom inverted index of sorts on top of Cassandra, in part > inspired by the work of Jake Luciani in Lucandra. I've successfully loaded > nearly a million documents over a 3-node cluster, and initial query tests > look promising. > > The problem is that our target use case has hundreds of millions of > documents (each document is very small however). Loading time will be an > important factor. I've investigated using the BinaryMemtable interface (as > found in contrib/bmt_example) to speed up bulk insertion. I have a prototype > up that successfully inserts data using BMT, but there is a problem. > > If I perform multiple writes for the same row key & column family, the row > ends up containing only one of the writes. I'm guessing this is because with > BMT I need to group all writes for a given row key & column family into one > operation, rather than doing it incrementally as is possible with the thrift > interface. Hadoop obviously is the solution for doing such a grouping. > Unfortunately, we can't perform such a process over our entire dataset, we > will need to do it in increments. > > So my question is: If I properly flush every node after performing a larger > bulk insert, can Cassandra merge multiple writes on a single row & column > family when using the BMT interface? Or is using BMT only feasible for > loading data on rows that don't exist yet? > > Thanks in advance, > Toby Jungen > > > >