Going out on a limb, I think it will perform MUCH faster with multiple copies, as the data is already sitting in each mappers memory, ready to be accessed locally. The time to process per mapper should be very dramatically reduced. With that in mind, you only have to scale up as disk space requires it, and disk space is cheap.
With your current method, adding three more identical data nodes, is only going to cut your time in half. So unless you have the budget to get the number of machines required, it's at least worth a try to have multiple copies, at least that only costs your time. HTH, Travis Hegner http://www.travishegner.com/ -----Original Message----- From: Luke Forehand [mailto:luke.foreh...@networkedinsights.com] Sent: Tuesday, August 03, 2010 12:37 PM To: user@hbase.apache.org Subject: Re: Secondary Index versus Full Table Scan Edward Capriolo <edlinuxg...@...> writes: > Generally speaking: If you are doing full range scans of a table > indexes will not help. Adding indexes will make the performance worse, > it will take longer to load your data and now fetching the data will > involve two lookups instead of one. > > If you are doing full range scans adding more nodes should result in > linear scale up. > > Edward, Can you clarify what "full range scan" means? I am not doing "full" range scans, but I am doing relatively large range scans (3 million records), so I think what you are saying applies. Thanks for the insight. We initially implemented the secondary index out of a need to have our main data sorted by multiple dimensions for various use cases. Now I'm thinking it may be better to have multiple copies of our main data, sorted in multiple ways, to avoid the two lookups. So I'm faced with two options right now; multiple copies of the data sorted in multiple ways to do range scans, or buy a lot more servers and do full scans. Given these two choices, do people have general recommendations on which makes the most sense? Thanks! -Luke The information contained in this communication is confidential and is intended only for the use of the named recipient. Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful. If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at 866.459.4599.