I think this is one of those "damned if you do..." situations. If you want to do a lot of quick single-record lookups (a Get is actually a Scan underneath the covers), then "1" is what you want. But for MapReduce jobs, or for scanning over a wide number of records like you're doing, then you'll want the value higher.
On 1/25/12 1:09 PM, "Jeff Whiting" <je...@qualtrics.com> wrote: >Does it make sense to have better defaults so the performance out of the >box is better? > >~Jeff > >On 1/25/2012 8:06 AM, Peter Wolf wrote: >> Ah ha! I appear to be insane ;-) >> >> Adding the following speeded things up quite a bit >> >> scan.setCacheBlocks(true); >> scan.setCaching(1000); >> >> Thank you, it was a duh! >> >> P >> >> >> >> On 1/25/12 8:13 AM, Doug Meil wrote: >>> Hi there- >>> >>> Quick sanity check: what caching level are you using? (default is 1) >>> I >>> know this is basic, but it's always good to double-check. >>> >>> If "language" is already in the lead position of the rowkey, why use >>>the >>> filter? >>> >>> As for EC2, that's a wildcard. >>> >>> >>> >>> >>> >>> On 1/25/12 7:56 AM, "Peter Wolf"<opus...@gmail.com> wrote: >>> >>>> Hello all, >>>> >>>> I am looking for advice on speeding up my Scanning. >>>> >>>> I want to iterate over all rows where a particular column (language) >>>> equals a particular value ("JA"). >>>> >>>> I am already creating my row keys using that column in the first >>>>bytes. >>>> And I do my scans using partial row matching, like this... >>>> >>>> public static byte[] calculateStartRowKey(String language) { >>>> int languageHash = language.length()> 0 ? >>>>language.hashCode() : >>>> 0; >>>> byte[] language2 = Bytes.toBytes(languageHash); >>>> byte[] accountID2 = Bytes.toBytes(0); >>>> byte[] timestamp2 = Bytes.toBytes(0); >>>> return Bytes.add(Bytes.add(language2, accountID2), >>>>timestamp2); >>>> } >>>> >>>> public static byte[] calculateEndRowKey(String language) { >>>> int languageHash = language.length()> 0 ? >>>>language.hashCode() : >>>> 0; >>>> byte[] language2 = Bytes.toBytes(languageHash + 1); >>>> byte[] accountID2 = Bytes.toBytes(0); >>>> byte[] timestamp2 = Bytes.toBytes(0); >>>> return Bytes.add(Bytes.add(language2, accountID2), >>>>timestamp2); >>>> } >>>> >>>> Scan scan = new Scan(calculateStartRowKey(language), >>>> calculateEndRowKey(language)); >>>> >>>> >>>> Since I am using a hash value for the string, I need to re-check the >>>> column to make sure that some other string does not get the same hash >>>> value >>>> >>>> Filter filter = new SingleColumnValueFilter(resultFamily, >>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language)); >>>> scan.setFilter(filter); >>>> >>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on >>>> EC2. >>>> >>>> I think that this should be really fast, but it is not. Any advice on >>>> how to debug/speed it up? >>>> >>>> Thanks >>>> Peter >>>> >>>> >>>> >>>> >>>> >>> >> > >-- >Jeff Whiting >Qualtrics Senior Software Engineer >je...@qualtrics.com > >