[jira] Updated: (LUCENE-2456) A Column-Oriented Cassandra-Based Lucene Directory

Karthick Sankarachary (JIRA) Tue, 11 May 2010 16:50:04 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karthick Sankarachary updated LUCENE-2456:
------------------------------------------

    Description: 
Herein, we describe a type of Lucene directory that stores its file in a 
Cassandra server, which makes for a scalable and robust store for Lucene 
indices.

In brief, the CassandraDirectory maps the concept of a Lucene directory to a 
column family that belongs to a certain keyspace located in a given Cassandra 
server. Further, it stores each file under this directory as a row in that 
column family.

Specifically, its files are broken down into blocks (whose sizes are capped), 
where each block (see FileBlock) is stored as the value of a column in the 
corresponding row. As per 
http://wiki.apache.org/cassandra/CassandraLimitations, this is the recommended 
approach for dealing with large objects, which Lucene files tend to be. In 
addition, a descriptor of the file (see FileDescriptor) that outlines a map of 
blocks therein is stored as one of the columns in that row as well. Think of 
this descriptor as an inode for Cassandra-based files.

The exhaustive mapping of a Lucene directory (file) to a Cassandra column 
family (row) is captured in the ColumnOrientedDirectory (ColumnOrientedFile) 
inner-class. Specifically, it interprets Cassandra's data model in terms of 
Lucene's, and vice verca. More importantly, these are the only two 
inner-classes that have a foot in both the Lucene and Cassandra camps.

All writes to a file in this directory occur through a CassandraIndexOutput, 
which puts the data flushed from a write-behind buffer into the fitting set of 
blocks. By the same token, all reads from a file in this directory occur 
through a CassandraIndexInput, which gets the data needed by a read-ahead 
buffer from the right set of blocks.

The last (but not the least) inner-class, CassandraClient, acts as a facade 
over a Thrift-based Cassandra client. In short, it provides operations to 
get/put rows/columns in the column family and keyspace associated with this 
directory.

Unlike Lucandra, which attempts to bridge the gap between Lucene and Cassandra 
at the document-level, the CassandraDirectory is self-sufficient in the sense 
that it does not require a re-write of any other component in the Lucene stack. 
In other words, one may use the CassandraDirectory in conjunction with the 
Lucene IndexWriter and IndexReader, as you would any other kind of Lucene 
Directory. Moreover, given the the data unit that is transferred to and from 
Cassandra is a large-sized block, one may expect fewer round trips, and hence 
better throughputs, from the CassandraDirectory.

In conclusion, this directory attempts to marry the rich search-based query 
language of Lucene with the distributed fault-tolerant database that is 
Cassandra. By delegating the responsibilities of replication, durability and 
elasticity to the directory, we free the layers above from such non-functional 
concerns. Our hope is that users will choose to make their large-scale indices 
instantly scalable by seamlessly migrating them to this type of directory 
(using Directory#copyTo(Directory)).

  was:
In brief, the CassandraDirectory maps the concept of a Lucene directory to a 
column family that belongs to a certain keyspace located in a given Cassandra 
server. Further, it stores each file under this directory as a row in that 
column family.

Specifically, its files are broken down into blocks (whose sizes are capped), 
where each block (see FileBlock) is stored as the value of a column in the 
corresponding row. As per 
http://wiki.apache.org/cassandra/CassandraLimitations, this is the recommended 
approach for dealing with large objects, which Lucene files tend to be. In 
addition, a descriptor of the file (see FileDescriptor) that outlines a map of 
blocks therein is stored as one of the columns in that row as well. Think of 
this descriptor as an inode for Cassandra-based files.

The exhaustive mapping of a Lucene directory (file) to a Cassandra column 
family (row) is captured in the ColumnOrientedDirectory (ColumnOrientedFile) 
inner-class. Specifically, it interprets Cassandra's data model in terms of 
Lucene's, and vice verca. More importantly, these are the only two 
inner-classes that have a foot in both the Lucene and Cassandra camps.

All writes to a file in this directory occur through a CassandraIndexOutput, 
which puts the data flushed from a write-behind buffer into the fitting set of 
blocks. By the same token, all reads from a file in this directory occur 
through a CassandraIndexInput, which gets the data needed by a read-ahead 
buffer from the right set of blocks.

The last (but not the least) inner-class, CassandraClient, acts as a facade 
over a Thrift-based Cassandra client. In short, it provides operations to 
get/put rows/columns in the column family and keyspace associated with this 
directory.

Unlike Lucandra, which attempts to bridge the gap between Lucene and Cassandra 
at the document-level, the CassandraDirectory is self-sufficient in the sense 
that it does not require a re-write of any other component in the Lucene stack. 
In other words, one may use the CassandraDirectory in conjunction with the 
Lucene IndexWriter and IndexReader, as you would any other kind of Lucene 
Directory. Moreover, given the the data unit that is transferred to and from 
Cassandra is a large-sized block, one may expect fewer round trips, and hence 
better throughputs, from the CassandraDirectory.

In conclusion, this directory attempts to marry the rich search-based query 
language of Lucene with the distributed fault-tolerant database that is 
Cassandra. By delegating the responsibilities of replication, durability and 
elasticity to the directory, we free the layers above from such non-functional 
concerns. Our hope is that users will choose to make their large-scale indices 
instantly scalable by seamlessly migrating them to this type of directory 
(using Directory#copyTo(Directory)).


> A Column-Oriented Cassandra-Based Lucene Directory
> --------------------------------------------------
>
>                 Key: LUCENE-2456
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2456
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Store
>    Affects Versions: 3.0.1
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2456.patch, LUCENE-2456.zip
>
>
> Herein, we describe a type of Lucene directory that stores its file in a 
> Cassandra server, which makes for a scalable and robust store for Lucene 
> indices.
> In brief, the CassandraDirectory maps the concept of a Lucene directory to a 
> column family that belongs to a certain keyspace located in a given Cassandra 
> server. Further, it stores each file under this directory as a row in that 
> column family.
> Specifically, its files are broken down into blocks (whose sizes are capped), 
> where each block (see FileBlock) is stored as the value of a column in the 
> corresponding row. As per 
> http://wiki.apache.org/cassandra/CassandraLimitations, this is the 
> recommended approach for dealing with large objects, which Lucene files tend 
> to be. In addition, a descriptor of the file (see FileDescriptor) that 
> outlines a map of blocks therein is stored as one of the columns in that row 
> as well. Think of this descriptor as an inode for Cassandra-based files.
> The exhaustive mapping of a Lucene directory (file) to a Cassandra column 
> family (row) is captured in the ColumnOrientedDirectory (ColumnOrientedFile) 
> inner-class. Specifically, it interprets Cassandra's data model in terms of 
> Lucene's, and vice verca. More importantly, these are the only two 
> inner-classes that have a foot in both the Lucene and Cassandra camps.
> All writes to a file in this directory occur through a CassandraIndexOutput, 
> which puts the data flushed from a write-behind buffer into the fitting set 
> of blocks. By the same token, all reads from a file in this directory occur 
> through a CassandraIndexInput, which gets the data needed by a read-ahead 
> buffer from the right set of blocks.
> The last (but not the least) inner-class, CassandraClient, acts as a facade 
> over a Thrift-based Cassandra client. In short, it provides operations to 
> get/put rows/columns in the column family and keyspace associated with this 
> directory.
> Unlike Lucandra, which attempts to bridge the gap between Lucene and 
> Cassandra at the document-level, the CassandraDirectory is self-sufficient in 
> the sense that it does not require a re-write of any other component in the 
> Lucene stack. In other words, one may use the CassandraDirectory in 
> conjunction with the Lucene IndexWriter and IndexReader, as you would any 
> other kind of Lucene Directory. Moreover, given the the data unit that is 
> transferred to and from Cassandra is a large-sized block, one may expect 
> fewer round trips, and hence better throughputs, from the CassandraDirectory.
> In conclusion, this directory attempts to marry the rich search-based query 
> language of Lucene with the distributed fault-tolerant database that is 
> Cassandra. By delegating the responsibilities of replication, durability and 
> elasticity to the directory, we free the layers above from such 
> non-functional concerns. Our hope is that users will choose to make their 
> large-scale indices instantly scalable by seamlessly migrating them to this 
> type of directory (using Directory#copyTo(Directory)).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2456) A Column-Oriented Cassandra-Based Lucene Directory

Reply via email to