Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to match LIKE patterns case-sensitively and perform specific optimizations

Dominique Prunier Wed, 21 Mar 2012 11:34:39 -0700

Hey John,

I'm back on this topic. Did you have some time to look into this ? If not, i'm 
about to clean my initial (small and dirty) patch and if you have not started 
something on that yet, i could continue to extend it to other functions and 
send you the patch for review. Would that be ok for you ?


Thanks,

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Dominique Prunier
Sent: Sunday, March 18, 2012 11:42 AM
To: K. John Wu
Cc: FastBit Users
Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to match 
LIKE patterns case-sensitively and perform specific optimizations

Hi John,

I agree with your potential returned value (i'm also never comfortable at 
braking specs). In my test, it was very rought (didnt answered 0 very often). 
I'm not sure if the impact on perf was good or not , but ultimately, that could 
be replaced by a bitevector::isEmpty() method which would be less precise than 
count but hopefully much faster. My guess was that it is probably as expansive  
(if  not morre) to retrieve a bitvector, compute count to possibly skip the 
evaluation of a AND than evaluate the AND directly (except the obvious case 
where i could return 0). It might depend on the dataset though.

Thanks,

________________________________________
From: K. John Wu [[email protected]]
Sent: March-18-12 2:30 AM
To: Dominique Prunier
Cc: FastBit Users
Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to match 
LIKE patterns case-sensitively and perform specific optimizations

Hi, Dominique,

I have given this some thought.  In general, I agree with your
approach of trying to avoid computing the exact number of ones in the
intermediate bitvectors.  In the current code, the documentation for
doEvaluate and friends promises that the return values to be the
number of 1s in the bitvector (or the number of hits for the
expression), we will need to relax this promise to be something else.
 Probably something along the line of

>  0 : zero or more hits
== 0 : definitely zero hit
<  0  : errors

I will look into making this change in the next few days.

John




On 3/16/12 2:12 PM, Dominique Prunier wrote:
> Hey John,
>
> I'll try to isolate my individual predicates to have an idea of the size.
>
> It doesn't solve the mystery but here is what i'm trying right now and that 
> seems to be promising. I'm changing evaluation methods to only return 0 is 
> they are _sure_ there is no result, 1 if there is a chance and a negative 
> value for errors.
>
> I tried first on category (this is the only column type i've got). I also 
> removed the decompression/compression phase:
>
>         const ibis::bitvector *bv = rlc->getBitvector(tmp[j]);
>         if (bv != 0) {
>             ++ cnt;
> -           est += bv->cnt();
>             if (hits.empty()) {
>                 hits.copy(*bv);
>             }
>             else {
> -               if (cnt > 32 || (j > 3 && cnt*16 > j))
> -                   hits.decompress();
>                 hits |= *bv;
>             }
>         }
>      }
> -    if (est > static_cast<long>(hits.size() >> 7))
> -       hits.compress();
> -    return est;
> +    return cnt == 0 ? 0 : 1;
>
> Similarly, in query::doEvaluate, i removed the calls to count where i can (i 
> only have AND/OR/NOT/STR/LIKE, it would apply at other places):
>
>      case ibis::qExpr::LOGICAL_NOT: {
>         ierr = doEvaluate(term->getLeft(), mask, ht);
>         if (ierr >= 0) {
>             ht.flip();
>             ht &= mask;
> -           ierr = ht.cnt();
> +           ierr = 1;
>         }
>         break;
>      }
>
>      case ibis::qExpr::LOGICAL_OR: {
>         ierr = doEvaluate(term->getLeft(), mask, ht);
> -       if (ierr >= 0 && ht.cnt() < mask.cnt()) {
> +       if (ierr >= 0 ){
>             ibis::bitvector b1;
> -           if (ht.cnt() > mask.bytes() + ht.bytes()) {
> -               std::auto_ptr<ibis::bitvector> newmask(mask - ht);
> -               ierr = doEvaluate(term->getRight(), *newmask, b1);
> -           }
> -           else {
>                 ierr = doEvaluate(term->getRight(), mask, b1);
> -           }
>             if (ierr >= 0)
>                 ht |= b1;
> -           ierr = ht.cnt();
> +           ierr = 1;
>         }
>         break;
>      }
>
>      case ibis::qExpr::STRING: {
>         ierr = mypart->lookforString
>             (*(reinterpret_cast<const ibis::qString*>(term)), ht);
>         if (ierr >= 0) {
>             ht &= mask;
> -           ierr = ht.cnt();
> +           ierr = 1;
>         }
>         break;
>      }
>
>      case ibis::qExpr::LIKE: {
>         ierr = mypart->patternSearch
>             (*(reinterpret_cast<const ibis::qLike*>(term)), ht);
>         if (ierr >= 0) {
>             ht &= mask;
> -           ierr = ht.cnt();
> +           ierr = 1;
>         }
>         break;
>      }
>
> All this made my query test set execution time drop from ~120 sec to ~20 sec. 
> I'd like your opinion on the correctness of this, and if it ends up being 
> correct, we could imagine pushing this a bit further with #ifdef and see if 
> some use cases gets slower with this.
>
> Thanks,
>
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]]
> Sent: Friday, March 16, 2012 4:55 PM
> To: Dominique Prunier
> Cc: FastBit Users
> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
> match LIKE patterns case-sensitively and perform specific optimizations
>
> Hi, Dominique,
>
> Thanks for digging into this issue.  Regarding my previous question
> about how many answers, actually, the number of intermediate answers
> is more closely related to the expected execution time.  For example,
> if you have a query condition involve a conjunction between two simple
> expressions, e.g., "A < 5 and B > 8," the final answer might be small
> or even 0, but the two terms "A < 5" and "B > 8" might each produce
> millions of hits of their own.   One way or another these millions of
> answers have to be represented internally can computed from some
> information.  Therefore, the cost of query evaluation is more closely
> related to these intermediate answers than to the final solution.
> Maybe it is worthwhile to look into this a little bit.
>
> John
>
>
>
> On 3/16/12 1:48 PM, Dominique Prunier wrote:
>> Hey John,
>>
>> I did a quick experiment of making the evaluation much dumber that it is 
>> right now (evaluate OR and AND completely, ...) and removed the 
>> compression/decompression phase in ibis::category::patternSearch and now i'm 
>> back to my initial timing with the same number of results. I'm not 100% sure 
>> that i did things correctly but so far it seems to work.
>>
>> I'll try to do further testings.
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Dominique Prunier
>> Sent: Friday, March 16, 2012 4:06 PM
>> To: K. John Wu
>> Cc: FastBit Users
>> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
>> match LIKE patterns case-sensitively and perform specific optimizations
>>
>> Hi John,
>>
>> I ran my 90k query set against the broken version of last Friday which 
>> produced a grand total of 1426117 hits (the sum of hits of each queries). 
>> The fixed version yields a total 1633307, which is not so different. For the 
>> number of bytes, this is bit tougher to evaluate, not sure how i can do this.
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: K. John Wu [mailto:[email protected]]
>> Sent: Friday, March 16, 2012 3:42 PM
>> To: Dominique Prunier
>> Cc: FastBit Users
>> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
>> match LIKE patterns case-sensitively and perform specific optimizations
>>
>> Hi, Dominique,
>>
>> Thanks for the info.  There is one piece of information that might be
>> able to resolve the mystery about the timing difference.
>>
>> How different are the number of answers you get back from the previous
>> version of the code (incorrect) compared to the correct answers?
>>
>> If the correct answers are much more than reported previously, then
>> simply by the fact there are more answers, it would take longer to
>> process the bitmaps.  The more direct measure is the number of bytes
>> (of memory) used by the bit vector objects.  Presumably, that
>> information is a little harder to gather.
>>
>> John
>>
>>
>> On 3/16/12 12:30 PM, Dominique Prunier wrote:
>>> Hi John,
>>>
>>> I just tried the new code, it seems that it saves a few percents (1-5% 
>>> depending how much pattern matching there is in the query) but we're still 
>>> far from the initial results.
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: K. John Wu [mailto:[email protected]]
>>> Sent: Friday, March 16, 2012 2:06 PM
>>> To: Dominique Prunier
>>> Cc: FastBit Users
>>> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
>>> match LIKE patterns case-sensitively and perform specific optimizations
>>>
>>> Hi, Dominique,
>>>
>>> The function ibis::category::fillIndex now sorts the dictionary before
>>> actually creating the index.  The source code is SVN 492.  You could
>>> give it a try and see if it helps.
>>>
>>> John
>>>
>>>
>>> On 3/16/12 6:34 AM, Dominique Prunier wrote:
>>>> Hey John,
>>>>
>>>> I tried the sort. It ends up speeding up some cases but slowing down some 
>>>> others (i guess when there are many values, you are paying the sort). At 
>>>> the end, on my large query test set, it ended up being slower so i 
>>>> reverted it.
>>>>
>>>> Like i said earlier, this exact same set is now 6 times slower. I think it 
>>>> is related to the fact that my bit vectors are much more complex (and more 
>>>> distributed) and more likely to match something (since the previous 
>>>> results were pure garbage), but i'm not sure why it make such a difference 
>>>> (i'd assume the first N one would be as dispersed that an set of N 
>>>> vectors). The hot spot is ibis::bitvector::do_cnt and i have the 
>>>> impression that it is called quite often with a test like cnt() != 0 or > 
>>>> 0 which could probably be simplified. I'll take a look at that if i have 
>>>> time today.
>>>>
>>>> Thanks,
>>>>
>>>> -----Original Message-----
>>>> From: [email protected] 
>>>> [mailto:[email protected]] On Behalf Of Dominique 
>>>> Prunier
>>>> Sent: Thursday, March 15, 2012 5:08 PM
>>>> To: K. John Wu
>>>> Cc: FastBit Users
>>>> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
>>>> match LIKE patterns case-sensitively and perform specific optimizations
>>>>
>>>> Hi John,
>>>>
>>>> About the perf, i'm trying to understand what could justify such a 
>>>> difference (my ~90k queries test jumped from 20s to 120s while we should 
>>>> be handling the same number of bit vectors). So far, it seems that ~70% of 
>>>> the time is spent in ibis::bitvector::do_cnt(). I'll try the sort trick to 
>>>> see how it affects performance. I tend to be carefull with profiler 
>>>> results and favor your knowledge of the codebase.
>>>>
>>>> Thanks,
>>>>
>>>> -----Original Message-----
>>>> From: K. John Wu [mailto:[email protected]]
>>>> Sent: Thursday, March 15, 2012 4:56 PM
>>>> To: Dominique Prunier
>>>> Cc: FastBit Users
>>>> Subject: Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to 
>>>> match LIKE patterns case-sensitively and perform specific optimizations
>>>>
>>>> Hi, Dominique,
>>>>
>>>> Regarding the test cases, whenever you are ready to share is fine.
>>>>
>>>> Regarding the performance, my guess is that the bit vectors are more
>>>> spread out in the new corrected approach.  When the bit vectors are
>>>> spread out, FastBit needs to read more bytes in order to get them out.
>>>>  If this is indeed the case, the thing to do might to sort the values
>>>> before returning them from ibis::dictionary::patternSearch with
>>>> something like
>>>>
>>>> std::sort(matches.begin(), matches.end());
>>>>
>>>> John
>>>>
>>>>
>>>> On 3/15/12 12:24 PM, Dominique Prunier wrote:
>>>>> Hey John,
>>>>>
>>>>> The problem is that right know, my test suite mainly tests the
>>>>> import and not so much the queries (i have maybe 10-20 test cases
>>>>> that actually evaluates queries). It has been basically modeled
>>>>> after our existing test suite for MySQL/Oracle and is supposed to
>>>>> prove that the imported data is equivalent to our source.
>>>>>
>>>>> My next step is to do a micro testing framework to test queries. I
>>>>> have a very simplistic example which only uses CATEGORY and LONG
>>>>> (the only two types that we're using), but it is not convenient
>>>>> enough to use.
>>>>>
>>>>> Be sure that i'll post in the group if i have something good enough
>>>>> to be shared :)
>>>>>
>>>>> Regarding the bug i found yesterday, surprisingly enough, fixing it
>>>>> changed the performance quite significantly. I'm still not sure why
>>>>> (the number of selected bitmap was the same), but i'll try to
>>>>> investigate a bit further.
>>>>>
>>>>> Thanks,
>>>>>
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] PATCH: new CS_PATTERN_MATCH define added to match LIKE patterns case-sensitively and perform specific optimizations

Reply via email to