Re: Coprocessor Aggregation supposed to be ~20x slower than Scans?

Ted Yu Tue, 15 May 2012 11:46:47 -0700

Anil:
I am having trouble accessing JIRA.

Ted Yu and Zhihong Yu are the same person :-)


I think it would be good to remind user of aggregation client to narrow
range of scan. That's why I proposed adding check of hasFilter().

Cheers

On Tue, May 15, 2012 at 10:47 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Take your time.
> Once you complete your first submission, subsequent contributions would be
> easier.
>
>
> On Tue, May 15, 2012 at 10:34 AM, anil gupta <anilgupt...@gmail.com>wrote:
>
>> Hi Ted,
>>
>> I created the jira:https://issues.apache.org/jira/browse/HBASE-5999 for
>> fixing this.
>>
>> Creating the patch might take me sometime(due to learning curve) as this
>> is
>> the first time i would be creating a patch.
>>
>> Thanks,
>> Anil Gupta
>>
>>
>> On Mon, May 14, 2012 at 4:00 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> > I was aware of the following change.
>> >
>> > Can you log a JIRA and attach the patch to it ?
>> >
>> > Thanks for trying out and improving aggregation client.
>> >
>> > On Mon, May 14, 2012 at 3:31 PM, anil gupta <anilgupt...@gmail.com>
>> wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > If we change the if statement condition in validateParameters method
>> in
>> > > AggregationClient.java to:
>> > > if (scan == null || (Bytes.equals(scan.getStartRow(),
>> scan.getStopRow())
>> > &&
>> > > !Bytes.equals(scan.getStartRow(), HConstants.EMPTY_START_ROW)) ||
>> > > (Bytes.compareTo(scan.getStartRow(), scan.getStopRow()) > 0 &&
>> > > *!Bytes.equals(scan.getStopRow(),
>> > > HConstants.EMPTY_END_ROW)* ))
>> > >
>> > > Condition specified in the bold and Italic will handle the case when
>> the
>> > > stopRow is not specified. IMHO, it's not an error if we are not
>> > specifying
>> > > the stopRow. This is what is was looking for because in my case i
>> didnt
>> > > wanted to set the stop row as I am using a prefix filter. I have
>> tested
>> > the
>> > > above specified code and it works fine when i only specify the
>> startRow.
>> > Is
>> > > this a desirable functionality? If yes, should this be added to trunk?
>> > >
>> > > Here is the link for source of AggregationClient:
>> > >
>> > >
>> >
>> http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.0/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java/?v=source
>> > >
>> > > Thanks,
>> > > Anil Gupta
>> > >
>> > >
>> > > On Mon, May 14, 2012 at 1:58 PM, Ted Yu <yuzhih...@gmail.com> wrote
>> > >
>> > > > Anil:
>> > > > As code #3 shows, having stopRow helps narrow the range of rows
>> > > > participating in aggregation.
>> > > >
>> > > > Do you have suggestion on how this process can be made more
>> > > user-friendly ?
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Mon, May 14, 2012 at 1:47 PM, anil gupta <anilgupt...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > HI Ted,
>> > > > >
>> > > > > My bad, i missed out a big difference between the Scan object i am
>> > > using
>> > > > in
>> > > > > my filter and Scan object used in coprocessors. So, scan object is
>> > not
>> > > > > same.
>> > > > > Basically, i am doing filtering on the basis of a prefix of
>> RowKey.
>> > > > >
>> > > > > So, in my filter i do this to build scanner:
>> > > > > Code 1:
>> > > > >  Filter filter = new PrefixFilter(Bytes.toBytes(strPrefix));
>> > > > >            Scan scan = new Scan();
>> > > > >            scan.setFilter(filter);
>> > > > >            scan.setStartRow(Bytes.toBytes(strPrefix)); // I dont
>> set
>> > > any
>> > > > > stopRow in this scanner.
>> > > > >
>> > > > > In coprocessor, i do the following for scanner:
>> > > > > Code 2:
>> > > > >  Scan scan = new Scan();
>> > > > > scan.setFilter(new PrefixFilter(Bytes.toBytes(prefix)));
>> > > > >
>> > > > >  I dont have startRow in above code because if i only use only the
>> > > > startRow
>> > > > > in coprocessor scanner then i get the following exception(due to
>> > this I
>> > > > > removed the startRow from CP scan object code):
>> > > > > java.io.IOException: Agg client Exception: Startrow should be
>> smaller
>> > > > than
>> > > > > Stoprow
>> > > > >    at
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.client.coprocessor.AggregationClient.validateParameters(AggregationClient.java:116)
>> > > > >    at
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.client.coprocessor.AggregationClient.max(AggregationClient.java:85)
>> > > > >    at
>> > > > >
>> > com.intuit.ihub.hbase.poc.DummyClass.doAggregation(DummyClass.java:81)
>> > > > >    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > > > >    at
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > > > >
>> > > > >
>> > > > > I modified the above code#2 to add the stopRow also:
>> > > > > Code 3:
>> > > > > Scan scan = new Scan();
>> > > > >        scan.setStartRow(Bytes.toBytes(prefix));
>> > > > >
>> > > > >
>> > >
>> scan.setStopRow(Bytes.toBytes(String.valueOf(Long.parseLong(prefix)+1)));
>> > > > >        scan.setFilter(new PrefixFilter(Bytes.toBytes(prefix)));
>> > > > >
>> > > > > When, i run the coprocessor with Code #3, its blazing fast. I
>> gives
>> > the
>> > > > > result in around 200 millisecond. :)
>> > > > > Since, this was just testing a coprocessors i added the logic to
>> add
>> > > the
>> > > > > stopRow manually. What is the reason that Scan object in
>> coprocessor
>> > > > always
>> > > > > requires stopRow along with startRow?(code #1 works fine even
>> when i
>> > > dont
>> > > > > use stopRow)  Can this restriction be relaxed?
>> > > > >
>> > > > > Thanks,
>> > > > > Anil Gupta
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Mon, May 14, 2012 at 12:55 PM, Ted Yu <yuzhih...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Anil:
>> > > > > > I think the performance was related to your custom filter.
>> > > > > >
>> > > > > > Please tell us more about the filter next time.
>> > > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > > > On Mon, May 14, 2012 at 12:31 PM, anil gupta <
>> > anilgupt...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > HI Stack,
>> > > > > > >
>> > > > > > > I'll look into Gary Helming post and try to do profiling of
>> > > > coprocessor
>> > > > > > and
>> > > > > > > share the results.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > Anil Gupta
>> > > > > > >
>> > > > > > > On Mon, May 14, 2012 at 12:08 PM, Stack <st...@duboce.net>
>> > wrote:
>> > > > > > >
>> > > > > > > > On Mon, May 14, 2012 at 12:02 PM, anil gupta <
>> > > > anilgupt...@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > > I loaded around 70 thousand 1-2KB records in HBase. For
>> > scans,
>> > > > with
>> > > > > > my
>> > > > > > > > > custom filter i am able to get 97 rows in 500 milliseconds
>> > and
>> > > > for
>> > > > > > > doing
>> > > > > > > > > sum, max, min(in built aggregations of HBase) on the same
>> > > custom
>> > > > > > filter
>> > > > > > > > its
>> > > > > > > > > taking 11000 milliseconds. Does this mean that
>> coprocessors
>> > > > > > aggregation
>> > > > > > > > is
>> > > > > > > > > supposed to be around ~20x slower than scans? Am i missing
>> > any
>> > > > > trick
>> > > > > > > over
>> > > > > > > > > here?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > That seems like a high tax to pay for running CPs.  Can you
>> dig
>> > > in
>> > > > on
>> > > > > > > > where the time is being spent?  (See another recent note on
>> > this
>> > > > list
>> > > > > > > > or on dev where Gary Helmling talks about how he did basic
>> > > > profiling
>> > > > > > > > of CPs).
>> > > > > > > > St.Ack
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Thanks & Regards,
>> > > > > > > Anil Gupta
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Thanks & Regards,
>> > > > > Anil Gupta
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks & Regards,
>> > > Anil Gupta
>> > >
>> >
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>

Re: Coprocessor Aggregation supposed to be ~20x slower than Scans?

Reply via email to