If you are talking about updates / deletes - then I would imagine you definitely have the notion of some primary key of reference.
As far as handling deletes - from a schema design perspective of HBase - it might be required to have a secondary insert-only schema for storing delete transactions exclusively and the MR pipeline can periodically scan the insert-only schema to take note of deletions. To handle updates on a single family table (as a trivial case) - while storing the updated snapshot in a table is relatively straight-forward, from the point of capturing the update transactions - it might be necessary to have a secondary table to take care of that ( like a meta index ) since scanning through the table to look for updations, even if it were a M-R process, would be expensive. The actual decision depends on the frequency of delete/update transactions of the schema under consideration and the *width* of the column family changes, in terms of storing the transaction representations. On 01/28/2010 09:06 PM, [email protected] wrote: > What about if I want to analyse the data which have update and delete > record. > In this scenario, hbase is a good M/R source better than hdfs raw file , is > it correct? > > Fleming Chiu(邱宏明) > 707-6128 > [email protected] > 週一無肉日吃素救地球(Meat Free Monday Taiwan) > > > > > > > Kay Kay > > <kaykay.uni...@gm To: > [email protected] > > ail.com> cc: (bcc: Y_823910/TSMC) > > Subject: Re: Hbase as > Map/Reduce source > > 2010/01/29 11:05 > > AM > > Please respond to > > hbase-user > > > > > > > > > > HDFS is a double-edged sword . Being a raw file system - you can feed it > to a Map Reduce program although it might be necessary to define > InputSplit-s as appropriate to chop down the input size. > > OTOH, HBase is structured data ( well - sort of ! ) using a file format > on top of HDFS to store the schema and hence comes with predefined > InputSplit-s that make it easy to get started on a MapReduce program. > From an API simplicity point of view - HBase can get you started > relatively faster because of it ( assuming you have your data in hbase). > > Refer to - > http://wiki.apache.org/hadoop/Hbase/MapReduce . > > Although the wiki says deprecated - in reality - it is suggested to > stick with *.mapred.* packages for some time since the underlying > .mapreduce.* packages are not mature enough at this point. > > The decision is to entirely do with - the kind of the data you have and > identifying the data by a primary key amenable to your application, > which is all hbase in its rudimentary form needs. > > On the other hand - if having a schema and defining a primary key for > your data seems non-orthogonal for your app - you can stick with HDFS > and a custom InputSplit depending on your data. Especially since HBase > provides a lot more than HDFS in terms of scanning / row id ordering and > if these features are not necessary for what you do - then storing data > in HDFS should be just about ok. > > > > > On 1/28/10 6:20 PM, Otis Gospodnetic wrote: > >> I asked a similar question recently: >> >> > http://search-hadoop.com/[email protected]||hbase%20mapreduce%20otis%20TableInputFormat > > >> >> Otis >> >> >> >> ----- Original Message ---- >> >> >>> From: "[email protected]"<[email protected]> >>> To: [email protected] >>> Sent: Thu, January 28, 2010 8:02:39 PM >>> Subject: Hbase as Map/Reduce source >>> >>> Hi, >>> >>> I want to understand clearly about Hbase as Map/Reduce source. >>> Basicly, if a table with 100 regions, it means 100 map will be started, >>> right? >>> What's the difference between hdfs and hbase as a Map/Reduce source? >>> Thanks >>> >>> >>> >>> >>> Fleming Chiu(邱宏明) >>> 707-6128 >>> [email protected] >>> 週一無肉日吃素救地球(Meat Free Monday Taiwan) >>> >>> >>> >>> > --------------------------------------------------------------------------- > >>> TSMC PROPERTY >>> This email communication (and any attachments) is proprietary >>> > information > >>> for the sole use of its >>> intended recipient. Any unauthorized review, use or distribution by >>> > anyone > >>> other than the intended >>> recipient is strictly prohibited. If you are not the intended >>> > recipient, > >>> please notify the sender by >>> replying to this email, and then delete this email and any copies of it >>> immediately. Thank you. >>> >>> > --------------------------------------------------------------------------- > >>> >> > > > > > --------------------------------------------------------------------------- > TSMC PROPERTY > This email communication (and any attachments) is proprietary information > for the sole use of its > intended recipient. Any unauthorized review, use or distribution by anyone > other than the intended > recipient is strictly prohibited. If you are not the intended recipient, > please notify the sender by > replying to this email, and then delete this email and any copies of it > immediately. Thank you. > --------------------------------------------------------------------------- > > > >
