I would have to agree. The use case doesn't make much sense for HBase and sounds a bit more like a problem for Hive.
The OP indicated that the data was disposable after a round of processing. IMHO Hive is a better fit. Sent from a remote device. Please excuse any typos... Mike Segel On Apr 29, 2013, at 12:46 AM, Asaf Mesika <asaf.mes...@gmail.com> wrote: > I actually don't see the benefit of saving the data into HBase if all you > do is read per job id and purges it. Why not accumulate into HDFS per job > id and then dump the file? The way I see it, HBase is good for querying > parts of your data, even if it is only 10 rows. In your case your average > is 1 billion, so streaming it from hdfs seems faster . > > On Saturday, April 27, 2013, Enis Söztutar wrote: > >> Hi, >> >> Interesting use case. I think it depends on job many jobId's you expect to >> have. If it is on the order of thousands, I would caution against going the >> one table per jobid approach, since for every table, there is some master >> overhead, as well as file structures in hdfs. If jobId's are managable, >> going with separate tables makes sense if you want to efficiently delete >> all the data related to a job. >> >> Also pre-splitting will depend on expected number of jobIds / batchIds and >> their ranges vs desired number of regions. You would want to keep number of >> regions hosted by a single region server in the low tens, thus, your splits >> can be across jobs or within jobs depending on cardinality. Can you share >> some more? >> >> Enis >> >> >> On Fri, Apr 26, 2013 at 2:34 PM, Ted Yu <yuzhih...@gmail.com<javascript:;>> >> wrote: >> >>> My understanding of your use case is that data for different jobIds would >>> be continuously loaded into the underlying table(s). >>> >>> Looks like you can have one table per job. This way you drop the table >>> after map reduce is complete. In the single table approach, you would >>> delete many rows in the table which is not as fast as dropping the >> separate >>> table. >>> >>> Cheers >>> >>> On Sat, Apr 27, 2013 at 3:49 AM, Cameron Gandevia >>> <cgande...@gmail.com<javascript:;> >>>> wrote: >>> >>>> Hi >>>> >>>> I am new to HBase, I have been trying to POC an application and have a >>>> design questions. >>>> >>>> Currently we have a single table with the following key design >>>> >>>> jobId_batchId_bundleId_uniquefileId >>>> >>>> This is an offline processing system so data would be bulk loaded into >>>> HBase via map/reduce jobs. We only need to support report generation >>>> queries using map/reduce over a batch (And possibly a single column >>> filter) >>>> with the batchId as the start/end scan key. Once we have finished >>>> processing a job we are free to remove the data from HBase. >>>> >>>> We have varied workloads so a job could be made up of 10 rows, 100,000 >>> rows >>>> or 1 billion rows with the average falling somewhere around 10 million >>>> rows. >>>> >>>> My question is related to pre-splitting. If we have a billion rows all >>> with >>>> the same batchId (Our map/reduce scan key) my understanding is we >> should >>>> perform pre-splitting to create buckets hosted by different regions. >> If a >>>> jobs workload can be so varied would it make sense to have a single >> table >>>> containing all jobs? Or should we create 1 table per job and pre-split >>> the >>>> table for the given workload? If we had separate table we could drop >> them >>>> when no longer needed. >>>> >>>> If we didn't have a separate table per job how should we perform >>> splitting? >>>> Should we choose our largest possible workload and split for that? even >>>> though 90% of our jobs would fall in the lower bound in terms of row >>> count. >>>> Would we experience any issue purging jobs of varying sizes if >> everything >>>> was in a single table? >>>> >>>> any advice would be greatly appreciated. >>>> >>>> Thanks >>