I think this is one of those "damned if you do..." situations.  If you
want to do a lot of quick single-record lookups (a Get is actually a Scan
underneath the covers), then "1" is what you want.  But for MapReduce
jobs, or for scanning over a wide number of records like you're doing,
then you'll want the value higher.




On 1/25/12 1:09 PM, "Jeff Whiting" <je...@qualtrics.com> wrote:

>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha!  I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>>         scan.setCacheBlocks(true);
>>         scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check:  what caching level are you using?  (default is 1)
>>> I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>>>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<opus...@gmail.com>  wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>>>>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>>>>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>>      }
>>>>
>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>>>>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>>      }
>>>>
>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same hash
>>>> value
>>>>
>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>>      scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not.  Any advice on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>-- 
>Jeff Whiting
>Qualtrics Senior Software Engineer
>je...@qualtrics.com
>
>


Reply via email to