Others have certainly found benefits in combining Spark/Shark with a Dynamo-type KV-store. With robust Hadoop Input/OutputFormats it's not too difficult (e.g. see this<http://www.slideshare.net/EvanChan2/cassandra2013-spark-talk-final>and this <http://tuplejump.github.io/calliope/>), and It may be possible to do as you suggest with the s3 API of Riak CS. What also may be worth exploring is if Riak and Spark/Shark can rendezvous via Tachyon<https://github.com/amplab/tachyon/wiki>. That would be more of a research project right now, but it could end up someplace interesting.
On Tue, Jul 30, 2013 at 1:24 PM, Dan Kerrigan <[email protected]>wrote: > Geert-Jan - > > We're currently working on a somewhat similar project to integrate Flume > to ingest data into Riak CS for later processing using Hadoop. The > limitations of HDFS/S3, when using the s3:// or s3n:// URIs, seem to > revolve around renaming objects (copy/delete) in Riak CS. If you can avoid > that, this link should work fine. > > Regarding how data is stored in Riak CS, the data block storage is Bitcask > with manifest storage being held in LevelDB. Riak CS is optimized for > larger object sizes and I believe smaller object sizes would not be nearly > as efficient as working with plain Riak if only because of the overhead > incurred by Riak CS. The benefits of Riak generally carry over to Riak CS > so there shouldn't be any need to worry about losing raw power. > > Respectfully - > Dan Kerrigan > > > On Tue, Jul 30, 2013 at 2:21 PM, gbrits <[email protected]> wrote: > >> This may be totally missing the mark but I've been reading up on ways to >> do >> fast iterative processing in Storm or Spark/shark, with the ultimate goal >> of >> results ending up in Riak for fast multi-key retrieval. >> >> I want this setup to be as lean as possible for obvious reasons so I've >> started to look more closely at the possible Riak CS / Spark combo. >> >> Apparently, please correct if wrong, Riak CS sits on top of Riak and is >> S3-api compliant. Underlying the db for the objects is levelDB (which >> would >> have been my choice anyway, bc of the low in-mem key overhead) Apparently >> Bitcask is also used, although it's not clear to me what for exactly. >> >> At the same time Spark (with Shark on top, which is what Hive is for >> Hadoop >> if that in any way makes things clearer) can use HDFS or S3 as it's so >> called 'deep store'. >> >> Combining this it seems, Riak CS and Spark/Shark could be a nice pretty >> tight combo providing interative and adhoc quering through Shark + all the >> excellent stuff of Riak through the S3 protocol which they both speak . >> >> Is this correct? >> Would I loose any of the raw power of Riak when going with Riak CS? Anyone >> ever tried this combo? >> >> Thanks, >> Geert-Jan >> >> >> >> -- >> View this message in context: >> http://riak-users.197444.n3.nabble.com/combining-Riak-CS-and-Spark-shark-by-speaking-over-s3-protocol-tp4028621.html >> Sent from the Riak Users mailing list archive at Nabble.com. >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
