Re: Speeding up Scans

Jeff Whiting Wed, 25 Jan 2012 10:09:42 -0800

Does it make sense to have better defaults so the performance out of the box is 
better?


~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:

Ah ha!  I appear to be insane ;-)

Adding the following speeded things up quite a bit

        scan.setCacheBlocks(true);
        scan.setCaching(1000);

Thank you, it was a duh!

P



On 1/25/12 8:13 AM, Doug Meil wrote:

Hi there-

Quick sanity check:  what caching level are you using?  (default is 1)  I
know this is basic, but it's always good to double-check.

If "language" is already in the lead position of the rowkey, why use the
filter?

As for EC2, that's a wildcard.





On 1/25/12 7:56 AM, "Peter Wolf"<opus...@gmail.com>  wrote:

Hello all,

I am looking for advice on speeding up my Scanning.

I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").

I am already creating my row keys using that column in the first bytes.
And I do my scans using partial row matching, like this...

     public static byte[] calculateStartRowKey(String language) {
         int languageHash = language.length()>  0 ? language.hashCode() :
0;
         byte[] language2 = Bytes.toBytes(languageHash);
         byte[] accountID2 = Bytes.toBytes(0);
         byte[] timestamp2 = Bytes.toBytes(0);
         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
     }

     public static byte[] calculateEndRowKey(String language) {
         int languageHash = language.length()>  0 ? language.hashCode() :
0;
         byte[] language2 = Bytes.toBytes(languageHash + 1);
         byte[] accountID2 = Bytes.toBytes(0);
         byte[] timestamp2 = Bytes.toBytes(0);
         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
     }

     Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));


Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same hash
value

     Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
     scan.setFilter(filter);

I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
EC2.

I think that this should be really fast, but it is not.  Any advice on
how to debug/speed it up?

Thanks
Peter


--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com

Re: Speeding up Scans

Reply via email to