He's saying that if you're going to be searching for one random thing at a time, and you want it to be /fast/, then issuing a full table scan to a hadoop cluster as a map/reduce job is not a good solution. It's not really practical to send out a swarm of nodes hunting through your whole dataset each time you want to find one (or several) items out of your whole table.
For example, say you have a web-form where users can search for other users by last name, "name:last" being one of your columns. This is an especially bad case for using mapreduce to scan the full table. -- Jim On Thu, May 29, 2008 at 1:06 PM, shimon golan <[EMAIL PROTECTED]> wrote: > You said "If they need to be random and low-latency, then you're going to > have some trouble using map/reduce" - > why is that ? > > Thanks > Shimon > > On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote: > >> Right now, we don't have a plan to add indexing. It's a particularly >> complicated area of functionality. We'd love to get suggestions (or patches >> :) for how to do this the right way. >> >> Map/reduce would be the perfect way to find all records that matched on a >> column value, so long as you're able to do all the accesses in batch. If >> they need to be random and low-latency, then you're going to have some >> trouble using map/reduce. HBase provides an interface to map/reduce that >> lets you split up by a region at a time, so you can highly parallelize your >> processing, so it should be reasonably fast. >> >> -Bryan >> >> >> On May 29, 2008, at 7:56 AM, shimon golan wrote: >> >> Thanx Bryan for the quick response ! >>> My table is very sparse and contains 100 column families , each containing >>> about 20 items. Also, I need to search the table by each of the columns so >>> the solution you suggested seems somewhat complicated for this purpose. >>> >>> So my ensuing questions are : >>> 1. Do you plan to support having indexes on columns and thus avoiding the >>> necessity of scanning when a column value is queried ? >>> 2. I thought about using map reduce for the purpose of parallelizing the >>> search over different regions of the table - could this be accomplished ? >>> >>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote: >>> >>> In a regular hbase table, there is no "most efficient" way to do this. >>>> The >>>> only operation you have available is a table scan. >>>> >>>> If you find yourself looking for rows by their column values, then you >>>> must >>>> do some extra work to make that possible. First, make absolutely sure >>>> that >>>> you have the right row key picked out. If your accesses are dominated by >>>> searching on a column value, then perhaps that column should be your >>>> primary >>>> key. >>>> >>>> If you must have both the existing primary key and the column value >>>> -based >>>> lookups, then probably your best bet is to make an "index" table. The way >>>> that works is that every time you write or delete some value to the >>>> primary >>>> table, you also write or delete a value to the index table with the >>>> column >>>> value as the key and the row key as a column value. Then, when you are >>>> trying to find the row by its column value, you look in the index table >>>> to >>>> find the row key, and then you query the main table with the row key. >>>> It's >>>> more work, but this is the best that HBase can offer at the moment. >>>> >>>> -Bryan >>>> >>>> >>>> >>>> On May 29, 2008, at 5:47 AM, Shimon wrote: >>>> >>>> Hi, >>>> >>>>> What is most efficient way to retrieve a row when the value of a certain >>>>> column is specified ( rather than the row key ) ? >>>>> >>>>> Thanks in advance >>>>> >>>>> >>>> >>>> >> >