Hi all, 

I would like to discuss available implementations of data block encoding in 
HBase and how we can improve them. 
The most interesting for me is FastDiffDeltaEncoder because it encodes not only 
keys but also anothers fields 
like timestamp, type, keyLen, etc. Also it removes duplicated values and it is 
the most controversial feature 
as for me. Look at following image: 

[IMG]http://i68.tinypic.com/8z2wzn.png[/IMG] 

This is an example of small table with row keys: Row-1, Row-2, Row-3 and 
columns Column-A, Column-B, Column-C. 
DataBlockEncoder encodes cells ordered by keys. Each key consists of RowKey, 
Family and Qualifier. That's why 
we will encode cells in order which is displayed by blue line in the image. 

FastDiffDeltaEncoder calculates difference between two serial cells. In this 
way duplicated values in Column-A 
will not be removed. The only case when it works it is in single column tables. 

So, my suggestion is to detect duplicates in columns, not only in neighboring 
cells. Also I've heard an idea 
not just to remove duplicated values, but to calculate prefix difference 
between them, like for keys. 

To implement this we have to keep previous value for each column. The most 
efficient way in my opinion is to 
keep them in HashMap using ByteArrayWrapper for keys. Size of this map will be 
the same as count of unique 
columns in the encoding block. 

It looks very easy to implement this but I guess there must be some hidden 
obstacles, because this has not 
implemented yet. 

What do you think about the idea? Is there more efficient way (by CPU/Memory) 
to keep previous values? 
Should I try to implement prefix delta encoding for values?

-- реклама -----------------------------------------------------------
Огромный выбор и скидки на телевизоры на Palladium.ua!
http://goo.gl/HBFW3x

Reply via email to