> What I can say is that I have a billion rows of data.
> I want to pull a specific 100K rows from the table.

Michael, I think I have exactly the same use case. Even numbers are the same.

I posted a similar question a couple of weeks ago, but unfortunately
did not get a definite answer:

http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[email protected]%3e

So far, I decided to put HBase aside and experiment with Hadoop
directly using its BloomMapFile and its ability to quickly discard
files that do not contain requested keys.
This implies that I have to have a custom InputFormat for that, many
input map files, and sorted list of input keys.

I do not have any performance numbers yet to compare this approach to
the full scan but I am writing tests as we speak.

Please keep me posted if you find a good solution for this problem in
general (M/R scanning through a random key subset either based on
HBase or Hadoop)



On 10/12/10, Michael Segel <[email protected]> wrote:
>
>
> Dave,
>
> Its a bit more complicated than that.
>
> What I can say is that I have a billion rows of data.
> I want to pull a specific 100K rows from the table.
>
> The row keys are not contiguous and you could say they are 'random' such
> that if I were to do a table scan, I'd have to scan the entire table (All
> regions).
>
> Now if I had a list of the 100k rows. From a single client I could just
> create 100 threads and grab rows from HBase one at a time in each thread.
>
> But in a m/r, I can't really do that.  (I want to do processing on the data
> I get returned.)
>
> So given a List Object with the row keys, how do I do a map reduce with this
> list as the starting point.
>
> Sure I could write it to HDFS and then do a m/r reading from the file and
> setting my own splits to control parallelism.
> But I'm hoping for a more elegant solution.
>
> I know that its possible, but I haven't thought it out... Was hoping someone
> else had this solved.
>
> thx
>
>> From: [email protected]
>> To: [email protected]
>> Date: Tue, 12 Oct 2010 08:35:25 -0700
>> Subject: RE: Using external indexes in an HBase Map/Reduce job...
>>
>> Sorry, I am not clear on exactly what you are trying to accomplish here.
>> I have a table roughly of that size, and it doesn't seem to cause me any
>> trouble.  I also have a few separate solr indexes for data in the table
>> for query -- the solr query syntax is sufficient for my current needs.
>> This setup allows me to do two things efficiently:
>> 1) batch processing of all records (e.g. tagging records that match a
>> particular criteria)
>> 2) search/lookup from a UI in an online manner
>> 3) it is also fairly easy to insert a bunch of records (keeping track of
>> their keys), and then run various batch processes only over those new
>> records -- essentially doing what you suggest: create a file of keys and
>> split the map task over that file.
>>
>> Dave
>>
>>
>> -----Original Message-----
>> From: Michael Segel [mailto:[email protected]]
>> Sent: Tuesday, October 12, 2010 5:36 AM
>> To: [email protected]
>> Subject: Using external indexes in an HBase Map/Reduce job...
>>
>>
>> Hi,
>>
>> Now I realize that most everyone is sitting in NY, while some of us can't
>> leave our respective cities....
>>
>> Came across this problem and I was wondering how others solved it.
>>
>> Suppose you have a really large table with 1 billion rows of data.
>> Since HBase really doesn't have any indexes built in (Don't get me started
>> about the contrib/transactional stuff...), you're forced to use some sort
>> of external index, or roll your own index table.
>>
>> The net result is that you end up with a list object that contains your
>> result set.
>>
>> So the question is... what's the best way to feed the list object in?
>>
>> One option I thought about is writing the object to a file and then using
>> it as the file in and then control the splitters. Not the most efficient
>> but it would work.
>>
>> Was trying to find a more 'elegant' solution and I'm sure that anyone
>> using SOLR or LUCENE or whatever... had come across this problem too.
>>
>> Any suggestions?
>>
>> Thx
>>
>>                                      
>

Reply via email to