> What I can say is that I have a billion rows of data. > I want to pull a specific 100K rows from the table.
Michael, I think I have exactly the same use case. Even numbers are the same. I posted a similar question a couple of weeks ago, but unfortunately did not get a definite answer: http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e So far, I decided to put HBase aside and experiment with Hadoop directly using its BloomMapFile and its ability to quickly discard files that do not contain requested keys. This implies that I have to have a custom InputFormat for that, many input map files, and sorted list of input keys. I do not have any performance numbers yet to compare this approach to the full scan but I am writing tests as we speak. Please keep me posted if you find a good solution for this problem in general (M/R scanning through a random key subset either based on HBase or Hadoop) On 10/12/10, Michael Segel <[email protected]> wrote: > > > Dave, > > Its a bit more complicated than that. > > What I can say is that I have a billion rows of data. > I want to pull a specific 100K rows from the table. > > The row keys are not contiguous and you could say they are 'random' such > that if I were to do a table scan, I'd have to scan the entire table (All > regions). > > Now if I had a list of the 100k rows. From a single client I could just > create 100 threads and grab rows from HBase one at a time in each thread. > > But in a m/r, I can't really do that. (I want to do processing on the data > I get returned.) > > So given a List Object with the row keys, how do I do a map reduce with this > list as the starting point. > > Sure I could write it to HDFS and then do a m/r reading from the file and > setting my own splits to control parallelism. > But I'm hoping for a more elegant solution. > > I know that its possible, but I haven't thought it out... Was hoping someone > else had this solved. > > thx > >> From: [email protected] >> To: [email protected] >> Date: Tue, 12 Oct 2010 08:35:25 -0700 >> Subject: RE: Using external indexes in an HBase Map/Reduce job... >> >> Sorry, I am not clear on exactly what you are trying to accomplish here. >> I have a table roughly of that size, and it doesn't seem to cause me any >> trouble. I also have a few separate solr indexes for data in the table >> for query -- the solr query syntax is sufficient for my current needs. >> This setup allows me to do two things efficiently: >> 1) batch processing of all records (e.g. tagging records that match a >> particular criteria) >> 2) search/lookup from a UI in an online manner >> 3) it is also fairly easy to insert a bunch of records (keeping track of >> their keys), and then run various batch processes only over those new >> records -- essentially doing what you suggest: create a file of keys and >> split the map task over that file. >> >> Dave >> >> >> -----Original Message----- >> From: Michael Segel [mailto:[email protected]] >> Sent: Tuesday, October 12, 2010 5:36 AM >> To: [email protected] >> Subject: Using external indexes in an HBase Map/Reduce job... >> >> >> Hi, >> >> Now I realize that most everyone is sitting in NY, while some of us can't >> leave our respective cities.... >> >> Came across this problem and I was wondering how others solved it. >> >> Suppose you have a really large table with 1 billion rows of data. >> Since HBase really doesn't have any indexes built in (Don't get me started >> about the contrib/transactional stuff...), you're forced to use some sort >> of external index, or roll your own index table. >> >> The net result is that you end up with a list object that contains your >> result set. >> >> So the question is... what's the best way to feed the list object in? >> >> One option I thought about is writing the object to a file and then using >> it as the file in and then control the splitters. Not the most efficient >> but it would work. >> >> Was trying to find a more 'elegant' solution and I'm sure that anyone >> using SOLR or LUCENE or whatever... had come across this problem too. >> >> Any suggestions? >> >> Thx >> >> >
