Going out on a limb, I think it will perform MUCH faster with multiple copies, 
as the data is already sitting in each mappers memory, ready to be accessed 
locally. The time to process per mapper should be very dramatically reduced. 
With that in mind, you only have to scale up as disk space requires it, and 
disk space is cheap.

With your current method, adding three more identical data nodes, is only going 
to cut your time in half. So unless you have the budget to get the number of 
machines required, it's at least worth a try to have multiple copies, at least 
that only costs your time.

HTH,

Travis Hegner
http://www.travishegner.com/


-----Original Message-----
From: Luke Forehand [mailto:luke.foreh...@networkedinsights.com]
Sent: Tuesday, August 03, 2010 12:37 PM
To: user@hbase.apache.org
Subject: Re: Secondary Index versus Full Table Scan

Edward Capriolo <edlinuxg...@...> writes:

> Generally speaking: If you are doing full range scans of a table
> indexes will not help. Adding indexes will make the performance worse,
> it will take longer to load your data and now fetching the data will
> involve two lookups instead of one.
>
> If you are doing full range scans adding more nodes should result in
> linear scale up.
>
>

Edward,

Can you clarify what "full range scan" means?  I am not doing "full" range
scans, but I am doing relatively large range scans (3 million records), so I
think what you are saying applies.  Thanks for the insight.

We initially implemented the secondary index out of a need to have our main data
sorted by multiple dimensions for various use cases.  Now I'm thinking it may be
better to have multiple copies of our main data, sorted in multiple ways, to
avoid the two lookups.  So I'm faced with two options right now; multiple copies
of the data sorted in multiple ways to do range scans, or buy a lot more servers
and do full scans.  Given these two choices, do people have general
recommendations on which makes the most sense?

Thanks!
-Luke


The information contained in this communication is confidential and is intended 
only for the use of the named recipient.  Unauthorized use, disclosure, or 
copying is strictly prohibited and may be unlawful.  If you have received this 
communication in error, you should know that you are bound to confidentiality, 
and should please immediately notify the sender or our IT Department at  
866.459.4599.

Reply via email to