Joe, I would say that a rule of thumb would be tens of megabytes for a single cell. There are two limits that affect this:
1) Amount of memory used: This includes ingesting into the batchwriter, buffering in the in-memory maps, scanning RFiles, and preparing query responses. At any given point, there could be a few copies of the cell hanging out in memory, so you don't want to pack things too tightly. If you have ridiculous amounts of memory then you can squeeze in some pretty large docs. 2) Message size for client/server communication: This is limited to 1G by default, but can be increased if needed. A single key/value pair will not be fragmented across these message frames. Whether to store bigger files in fragmented cells or as references to HDFS files typically has to do with security and lifecycle management. If you want cell-level security and encryption protection, you'll probably want to go with a fragmented key/value approach. If you want to keep all of your data in one spot for easier management you might also prefer to fragment the files in Accumulo. Otherwise sticking it in HDFS and storing a reference is a pretty simple and good solution. Billie did a project a while ago to fragment and store larger files in Accumulo. I'm not sure what happened with that, but it might be out there somewhere for you to use. Cheers, Adam On Mon, Aug 18, 2014 at 11:36 AM, Joe Stein <[email protected]> wrote: > Hi, for Accumulo is there a recommended max for column value size? So if > want to store files at what point do we have to split the file into parts > or (rather) just store it in HDFS with a reference path to it? > > /******************************************* > Joe Stein > Founder, Principal Consultant > Big Data Open Source Security LLC > http://www.stealth.ly > Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> > ********************************************/ >
