Hi Michael,

Thank you for your reply.

I will answer your questions one by one.

---I have to ask... why are you deleting anything from your columns?

In the event that there is a null value in a source system, we have to
reflect that somehow and assume that there could be a value in HBase and
that the source system has recently dropped the value. The options are a
Put or a Delete. Delete is preferable because it reduces disk space.

---Your current logic is to take a field that contains a NULL and then
delete the contents from HBase. Why?
No really, Why?

Our current logic is to write a masking KeyValue containing nothing in the
value that makes the most recent value empty. Unfortunately this takes up
a quarter of our space in the grid. If truly deleting can reacquire 25% of
the HBase space, that sounds awesome, and should speed up reads with less

---You then have to make your DAO object class more complex when reading
from HBase because you need to account for the fields that are NULL.

This is a non-issue for us. It already accounts for nulls.

----If you just insert the NULL value in the column, you don't have that
Do you waste disk space? Sure. But really, does that matter?

Yes. The accompanying metadata eats up a quarter of our space.

----For your specific use case, you may actually want to track
(versioning) the values in that column. So that if there was a value and
then its gone, you're going to want to know its history.

I think this is the most applicable downside to propagating deletes
instead of puts.

Storing a null in a row store makes a lot of sense to me. You eat a fixed
amount of space depending on the size of the row. In a sparse column store
like HBase, the frame and metadata are a real overhead. There's definitely
a tradeoff between this storage and the benefits of simply treating HBase
like a row store, and thats why I am curious if other engineers have
addressed this and are willing to weigh in.

Thanks for your suggestions. I think the point about versioning is very
good and I will think long about that.


On 7/6/12 7:51 AM, "Michael Segel" <michael_se...@hotmail.com> wrote:

>I was going to post this yesterday, but real work got in the way...
>I have to ask... why are you deleting anything from your columns?
>The reason I ask is that you're sync'ing an object from an RDBMS to
>HBase. While HBase allows fields that contain NULL not to exist, your
>RDBMS doesn't. 
>Your current logic is to take a field that contains a NULL and then
>delete the contents from HBase. Why?
>No really, Why? 
>You then have to make your DAO object class more complex when reading
>from HBase because you need to account for the fields that are NULL.
>(In your use case, the DAO is reading/writing against NoSQL and RDBMSs.
>So its a consistency issue.)
>If you just insert the NULL value in the column, you don't have that
>Do you waste disk space? Sure. But really, does that matter?
>For your specific use case, you may actually want to track (versioning)
>the values in that column. So that if there was a value and then its
>gone, you're going to want to know its history.
>(For Auditing at a minimum.)
>I don't know, its your application. The point is that you are making
>things more complex and you should think about the alternatives in design
>and the true cost differences.
>On Jul 5, 2012, at 1:28 PM, Ted Yu wrote:
>> Take a look at HBASE-3584: Allow atomic put/delete in one call
>> It is in 0.94, meaning it is not even in cdh4
>> Cheers
>> On Thu, Jul 5, 2012 at 11:19 AM, Keith Wyss <keith.w...@explorys.com>
>>> Hi,
>>> My organization has been doing something zany to simulate atomic row
>>> operations is HBase.
>>> We have a converter-object model for the writables that are populated
>>> an HBase table, and one of the governing assumptions
>>> is that if you are dealing with an Object record, you read all the
>>> that compose it out of HBase or a different data source.
>>> When we read lots of data in from a source system that we are trying to
>>> mirror with HBase, if a column is null that means that whatever is
>>> in HBase for that column is no longer valid. We  have simulated what I
>>> believe is now called a AtomicRowMutation by using a single Put
>>> and populating it with blanks. The downside is the wasted space
>>>accrued by
>>> the metadata for the blank columns.
>>> Atomicity is not of utmost importance to us, but performance is. My
>>> approach has been to create a Put and Delete object for a record and
>>> populate the Delete with the null columns. Then we call
>>> HTable.batch(List<Row>) on a bunch of these. It is my impression that
>>> shouldn't appreciably increase network traffic as the RPC calls will be
>>> bundled.
>>> Has anyone else addressed this problem? Does this seem like a
>>> approach?
>>> What sort of performance overhead should I expect?
>>> Also, I've seen some Jira tickets about making this an atomic
>>>operation in
>>> its own right. Is that something that
>>> I can expect with CDH3U4?
>>> Thanks,
>>> Keith Wyss

Reply via email to