Since storage is your primary concern, take a look at Doug Meil's blog 'The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size': http://blogs.apache.org/hbase/
Cheers On Wed, Oct 8, 2014 at 9:45 AM, Nishanth S <nishanth.2...@gmail.com> wrote: > Thanks Andrey.In the current system the hbase cfs have a ttl of 30 days > and data gets deleted after this(has snappy compression).Below is something > what I am trying to acheive. > > 1.Export the data from hbase table before it gets deleted. > 2.Store it in some format which supports maximum compression(storage cost > is my primary concern here),so looking at parquet. > 3.Load a subset of this data back into hbase based on certain rules(say i > want to load all rows which has a particular string in one of the fields). > > > I was thinking of bulkloading this data back into hbase but I am not sure > how I can load a subset of the data using > org.apache.hadoop.hbase.mapreduce.Driver > import. > > > > > > > On Wed, Oct 8, 2014 at 10:20 AM, Andrey Stepachev <oct...@gmail.com> > wrote: > > > Hi Nishanth. > > > > Not clear what exactly you are building. > > Can you share more detailed description of what you are building, how > > parquet files are supposed to be ingested. > > Some questions arise: > > 1. is that online import or bulk load > > 2. why rules need to be deployed to cluster. Do you suppose to do reading > > inside hbase region server? > > > > As for deploying filters your cat try to use coprocessors instead. They > can > > be configurable and loadable (but not > > unloadable, so you need to think about some class loading magic like > > ClassWorlds) > > For bulk imports you can create HFiles directly and add them > incrementally: > > http://hbase.apache.org/book/arch.bulk.load.html > > > > On Wed, Oct 8, 2014 at 8:13 PM, Nishanth S <nishanth.2...@gmail.com> > > wrote: > > > > > I was thinking of using org.apache.hadoop.hbase.mapreduce.Driver > import. > > I > > > could see that we can pass in filters to this utility but looks less > > > flexible since you need to deploy a new filter every time the rules > for > > > processing records change.Is there some way that we could define a > rules > > > engine? > > > > > > > > > Thanks, > > > -Nishan > > > > > > On Wed, Oct 8, 2014 at 9:50 AM, Nishanth S <nishanth.2...@gmail.com> > > > wrote: > > > > > > > Hey folks, > > > > > > > > I am evaluating on loading an hbase table from parquet files based > on > > > > some rules that would be applied on parquet file records.Could some > > one > > > > help me on what would be the best way to do this?. > > > > > > > > > > > > Thanks, > > > > Nishan > > > > > > > > > > > > > > > -- > > Andrey. > > >