Sorry, I am not clear on exactly what you are trying to accomplish here. I have a table roughly of that size, and it doesn't seem to cause me any trouble. I also have a few separate solr indexes for data in the table for query -- the solr query syntax is sufficient for my current needs. This setup allows me to do two things efficiently: 1) batch processing of all records (e.g. tagging records that match a particular criteria) 2) search/lookup from a UI in an online manner 3) it is also fairly easy to insert a bunch of records (keeping track of their keys), and then run various batch processes only over those new records -- essentially doing what you suggest: create a file of keys and split the map task over that file.
Dave -----Original Message----- From: Michael Segel [mailto:[email protected]] Sent: Tuesday, October 12, 2010 5:36 AM To: [email protected] Subject: Using external indexes in an HBase Map/Reduce job... Hi, Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... Came across this problem and I was wondering how others solved it. Suppose you have a really large table with 1 billion rows of data. Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. The net result is that you end up with a list object that contains your result set. So the question is... what's the best way to feed the list object in? One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. Any suggestions? Thx
