Just as a point of reference, in one of our systems we have 500+million rows 
that have a cell in its own column family that is about usually about 100bytes, 
but in about 10,000 of rows the cell can get to 300mb (average is probably 
about 30mb for the larger data). The jumbo sized data gets loaded in separately 
from the smaller data, although it all goes through the same pipeline. We are 
using cdh3b45 (0.90.1) GZ compression, region size of 1GB and with a max value 
size of 500mb. So far we have had no problems with the larger values.

Our largest problem was performance related to inserting into several column 
families for the small sized value loads and pauses when flushing the 
memstores. 0.90.1 helped quite a bit with that.

-chris



On Mar 8, 2011, at 10:54 AM, Jean-Daniel Cryans wrote:

>> The blobs vary in size from smallish (10K) to largish (20MB).
> 
> 20MB is quite large, but could be harmless if most of the rows are under 1MB
> 
>> They are too small to put into individual files in HDFS, but if I have too 
>> many largish rows in a region, I think I would suffer.
> 
> Yeah, need more info about the size distribution.
> 
>> 
>> Would it be possible to put the blobs in their own column family that has a 
>> significantly different block size (10x).  I hesitate to do this mostly 
>> because I already have too many column families, but since I don't expect 
>> the blobs to be touched very often, a separate column family would make them 
>> mostly harmless.
> 
> The block size is dynamic, if you store a single cell of 20MB then
> that will be 1 block of the same size. Instead of creating a new
> family, you could also create a new table.
> 
> J-D

Reply via email to