Re: querying by column value rather than key value

Jim R. Wilson Thu, 29 May 2008 12:25:17 -0700

He's saying that if you're going to be searching for one random thing
at a time, and you want it to be /fast/, then issuing a full table
scan to a hadoop cluster as a map/reduce job is not a good solution.
It's not really practical to send out a swarm of nodes hunting through
your whole dataset each time you want to find one (or several) items
out of your whole table.


For example, say you have a web-form where users can search for other
users by last name, "name:last" being one of your columns.  This is an
especially bad case for using mapreduce to scan the full table.

-- Jim

On Thu, May 29, 2008 at 1:06 PM, shimon golan <[EMAIL PROTECTED]> wrote:
> You said "If they need to be random and low-latency, then you're going to
> have some trouble using map/reduce" -
> why is that ?
>
> Thanks
> Shimon
>
> On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote:
>
>> Right now, we don't have a plan to add indexing. It's a particularly
>> complicated area of functionality. We'd love to get suggestions (or patches
>> :) for how to do this the right way.
>>
>> Map/reduce would be the perfect way to find all records that matched on a
>> column value, so long as you're able to do all the accesses in batch. If
>> they need to be random and low-latency, then you're going to have some
>> trouble using map/reduce. HBase provides an interface to map/reduce that
>> lets you split up by a region at a time, so you can highly parallelize your
>> processing, so it should be reasonably fast.
>>
>> -Bryan
>>
>>
>> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>>
>>  Thanx Bryan for the quick response !
>>> My table is very sparse and contains 100 column families , each containing
>>> about 20 items. Also, I need to search the table by each of the columns so
>>> the solution you suggested seems somewhat complicated for this purpose.
>>>
>>> So my ensuing questions are  :
>>> 1. Do you plan to support having indexes on columns and thus avoiding the
>>> necessity of scanning when a column value is queried ?
>>> 2. I thought about using map reduce for the purpose of parallelizing the
>>> search over different regions of the table - could this be accomplished  ?
>>>
>>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote:
>>>
>>>  In a regular hbase table, there is no "most efficient" way to do this.
>>>> The
>>>> only operation you have available is a table scan.
>>>>
>>>> If you find yourself looking for rows by their column values, then you
>>>> must
>>>> do some extra work to make that possible. First, make absolutely sure
>>>> that
>>>> you have the right row key picked out. If your accesses are dominated by
>>>> searching on a column value, then perhaps that column should be your
>>>> primary
>>>> key.
>>>>
>>>> If you must have both the existing primary key and the column value
>>>> -based
>>>> lookups, then probably your best bet is to make an "index" table. The way
>>>> that works is that every time you write or delete some value to the
>>>> primary
>>>> table, you also write or delete a value to the index table with the
>>>> column
>>>> value as the key and the row key as a column value. Then, when you are
>>>> trying to find the row by its column value, you look in the index table
>>>> to
>>>> find the row key, and then you query the main table with the row key.
>>>> It's
>>>> more work, but this is the best that HBase can offer at the moment.
>>>>
>>>> -Bryan
>>>>
>>>>
>>>>
>>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>>>
>>>>  Hi,
>>>>
>>>>> What is most efficient way to retrieve a row when the value of a certain
>>>>> column is specified ( rather than the row key ) ?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>
>>>>
>>
>

Re: querying by column value rather than key value

Reply via email to