Sorry, I am not clear on exactly what you are trying to accomplish here.  I 
have a table roughly of that size, and it doesn't seem to cause me any trouble. 
 I also have a few separate solr indexes for data in the table for query -- the 
solr query syntax is sufficient for my current needs.  This setup allows me to 
do two things efficiently:
1) batch processing of all records (e.g. tagging records that match a 
particular criteria)
2) search/lookup from a UI in an online manner
3) it is also fairly easy to insert a bunch of records (keeping track of their 
keys), and then run various batch processes only over those new records -- 
essentially doing what you suggest: create a file of keys and split the map 
task over that file.

Dave


-----Original Message-----
From: Michael Segel [mailto:[email protected]] 
Sent: Tuesday, October 12, 2010 5:36 AM
To: [email protected]
Subject: Using external indexes in an HBase Map/Reduce job...


Hi,

Now I realize that most everyone is sitting in NY, while some of us can't leave 
our respective cities....

Came across this problem and I was wondering how others solved it.

Suppose you have a really large table with 1 billion rows of data. 
Since HBase really doesn't have any indexes built in (Don't get me started 
about the contrib/transactional stuff...), you're forced to use some sort of 
external index, or roll your own index table.

The net result is that you end up with a list object that contains your result 
set.

So the question is... what's the best way to feed the list object in?

One option I thought about is writing the object to a file and then using it as 
the file in and then control the splitters. Not the most efficient but it would 
work.

Was trying to find a more 'elegant' solution and I'm sure that anyone using 
SOLR or LUCENE or whatever... had come across this problem too.

Any suggestions? 

Thx

                                          

Reply via email to