Hi, data modeling question,

I have been investigating cassandra to store small objects as a trivial 
replacement for s3.  GET/PUT/DELETE are all easy, but LIST is what is tripping 
me up.


S3 does a hierarchical list that kinda simulates traversing folders.

http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html


So say my schema is this:

CREATE TABLE "stuff" (key BLOB PRIMARY KEY, value BLOB)


I know that the prefix part is easy with a ByteOrderedPartitioner (and possibly 
with a secondary index in Cassandra 3.x? ).  What trips me up is the delimiter 
part.


I have looked at a handful of open source projects that are s3 clones and use 
cassandra, and they seem to do the prefix match then manually search for the 
delimiter.  I have looked at doing a UDA, but they also seem to send all of the 
data to a single node to do the aggregation.


What I am hoping to do is achieve what S3 does: "List performance is not 
substantially affected by the total number of keys in your bucket, nor by the 
presence or absence of the prefix, marker, maxkeys, or delimiter arguments." (

http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html)<http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html>


Is there some sort of denormalization, indexing, querying that I am missing 
that might help solve this?  I think if UDA's could do some summary operation 
on each node before returning it then aggregating the results it would work, 
but as far as I know that isn't possible.  It seems like a binary search of 
each partition involved in the list prefix would be a really quick and easy way 
to return the first 1000 results.


Is this even possible using cassandra?


Thanks,

Jake Willoughby

Reply via email to