bq. the bulk of the work involves deleting the files from the column family from HDFS
I think the first step when you delete files from column family is archiving. FYI On Mon, Feb 8, 2016 at 7:53 AM, Cameron, David A <david.a.came...@lmco.com> wrote: > Hi, > > I'm working on a project where we have a strange use case. > > First off, we use bulk loading exclusively. We never use the put or bulk > put interface to load data into tables. > > We have drivers that make me want to segregate data by tables and column > families. Our data is clearly delineated by the job it came from. We > would like to quickly either delete, or export data from a given data set > quickly. To enable this I have been considering using column families to > make it quick for us and easy on hbase to delete data that is no longer > needed. > > It is my understanding that multiple column families bite you in the back > side via the put interface and memstore. That having multiple column > families with different distributions among the partitions can cause > lumpiness in your partitions. I have convinced myself that because our key > space is so incredibly consistent that we don't have the lumpiness issue. > > And so, I ask this, given that we don't use the memstore, are there any > other drawbacks to using tables and column families to segregate data for > easy/quick backup and deletion? If you are wondering about our backup > strategy it involves using snapshots and clones. Once a table is cloned we > can delete the column families from the table we don't want to export to > tape. And delete becomes quick because the bulk of the work involves > deleting the files from the column family from HDFS. > > All feedback is greatly appreciated! > > Thanks > > Dave > > > >