Hi Michael,

Thank you for your reply.

I will answer your questions one by one.

---I have to ask... why are you deleting anything from your columns?


In the event that there is a null value in a source system, we have to
reflect that somehow and assume that there could be a value in HBase and
that the source system has recently dropped the value. The options are a
Put or a Delete. Delete is preferable because it reduces disk space.


---Your current logic is to take a field that contains a NULL and then
delete the contents from HBase. Why?
No really, Why?


Our current logic is to write a masking KeyValue containing nothing in the
value that makes the most recent value empty. Unfortunately this takes up
a quarter of our space in the grid. If truly deleting can reacquire 25% of
the HBase space, that sounds awesome, and should speed up reads with less
KeyValues.

---You then have to make your DAO object class more complex when reading
from HBase because you need to account for the fields that are NULL.


This is a non-issue for us. It already accounts for nulls.

----If you just insert the NULL value in the column, you don't have that
issue.
Do you waste disk space? Sure. But really, does that matter?

Yes. The accompanying metadata eats up a quarter of our space.

----For your specific use case, you may actually want to track
(versioning) the values in that column. So that if there was a value and
then its gone, you're going to want to know its history.

I think this is the most applicable downside to propagating deletes
instead of puts.


Storing a null in a row store makes a lot of sense to me. You eat a fixed
amount of space depending on the size of the row. In a sparse column store
like HBase, the frame and metadata are a real overhead. There's definitely
a tradeoff between this storage and the benefits of simply treating HBase
like a row store, and thats why I am curious if other engineers have
addressed this and are willing to weigh in.

Thanks for your suggestions. I think the point about versioning is very
good and I will think long about that.

Keith


On 7/6/12 7:51 AM, "Michael Segel" <michael_se...@hotmail.com> wrote:

>I was going to post this yesterday, but real work got in the way...
>
>I have to ask... why are you deleting anything from your columns?
>
>The reason I ask is that you're sync'ing an object from an RDBMS to
>HBase. While HBase allows fields that contain NULL not to exist, your
>RDBMS doesn't. 
>
>Your current logic is to take a field that contains a NULL and then
>delete the contents from HBase. Why?
>No really, Why? 
>
>You then have to make your DAO object class more complex when reading
>from HBase because you need to account for the fields that are NULL.
>(In your use case, the DAO is reading/writing against NoSQL and RDBMSs.
>So its a consistency issue.)
>
>If you just insert the NULL value in the column, you don't have that
>issue. 
>Do you waste disk space? Sure. But really, does that matter?
>
>For your specific use case, you may actually want to track (versioning)
>the values in that column. So that if there was a value and then its
>gone, you're going to want to know its history.
>(For Auditing at a minimum.)
>
>I don't know, its your application. The point is that you are making
>things more complex and you should think about the alternatives in design
>and the true cost differences.
>
>HTH
>
>-Mike
>
>On Jul 5, 2012, at 1:28 PM, Ted Yu wrote:
>
>> Take a look at HBASE-3584: Allow atomic put/delete in one call
>> It is in 0.94, meaning it is not even in cdh4
>> 
>> Cheers
>> 
>> On Thu, Jul 5, 2012 at 11:19 AM, Keith Wyss <keith.w...@explorys.com>
>>wrote:
>> 
>>> Hi,
>>> 
>>> My organization has been doing something zany to simulate atomic row
>>> operations is HBase.
>>> 
>>> We have a converter-object model for the writables that are populated
>>>in
>>> an HBase table, and one of the governing assumptions
>>> is that if you are dealing with an Object record, you read all the
>>>columns
>>> that compose it out of HBase or a different data source.
>>> 
>>> When we read lots of data in from a source system that we are trying to
>>> mirror with HBase, if a column is null that means that whatever is
>>> in HBase for that column is no longer valid. We  have simulated what I
>>> believe is now called a AtomicRowMutation by using a single Put
>>> and populating it with blanks. The downside is the wasted space
>>>accrued by
>>> the metadata for the blank columns.
>>> 
>>> Atomicity is not of utmost importance to us, but performance is. My
>>> approach has been to create a Put and Delete object for a record and
>>> populate the Delete with the null columns. Then we call
>>> HTable.batch(List<Row>) on a bunch of these. It is my impression that
>>>this
>>> shouldn't appreciably increase network traffic as the RPC calls will be
>>> bundled.
>>> 
>>> Has anyone else addressed this problem? Does this seem like a
>>>reasonable
>>> approach?
>>> What sort of performance overhead should I expect?
>>> 
>>> Also, I've seen some Jira tickets about making this an atomic
>>>operation in
>>> its own right. Is that something that
>>> I can expect with CDH3U4?
>>> 
>>> Thanks,
>>> 
>>> Keith Wyss
>>> 
>
>


Reply via email to