Hi, Dominique,

I see what you are trying to do.  The weird_category only has value
"\"val%\"", therefore, the where clause "weird_category LIKE
'\"val\\%'" should match every row, but "weird_category LIKE 'val\\%'"
should match no row.

If you can step into category::patternSearch, you will see that
"val\\%" has been stripped to "val%", which will have the same outcome
as you intended, but it stripped away too many back slashes.  You
intend category::patternSearch to see "val\%" to match the literal
percent sign (%), however, because the back slash was stripped twice,
you only got a bare percent sign left, which means it is a wild card
character, not a literal character as you intended.

My theory is this.  The string "val\\%" becomes "val\%" when it gets
to inside the C code.  The runtime system has stripped away the first
back slash.  The constructor of ibis::qLike take away the second one.

Since we have gone back and forth many times on this, I will wait for
you confirmation before doing anything about it.

John


On 3/29/12 8:09 AM, Dominique Prunier wrote:
> Hey John,
> 
> When i'll have some time, i'll make my test suite dump the queries and 
> expected results so that you try it yourself. It only tests category (and a 
> small bit of long) data types for the things i'm doing but it can be useful.
> 
> In the meantime, here is my test partition (see attached) and the query that 
> generated the segfault with r507 was:
> 
> SELECT weird_category FROM <partition> WHERE weird_category LIKE '"val\\%' 
> (which is not supposed to return any result)
> 
> It might help you understand what could have happened.
> 
> Thanks,
> 
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]]
> Sent: Thursday, March 29, 2012 10:55 AM
> To: Dominique Prunier
> Cc: FastBit Users
> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
> 
> Good to know.  The problem was then I did not check the array bounds.
>  Odd though, I did not think the values could be out of bounds..
> 
> On 3/29/12 7:24 AM, Dominique Prunier wrote:
>> Hi John,
>>
>> I ran the query with r507, and apparently, the problem was there:
>>
>> ==14864== Invalid read of size 8
>> ==14864==    at 0x550067A: ibis::index::sumBins(ibis::array_t<unsigned int> 
>> const&, ibis::bitvector&) const (index.cpp:6371)
>> ==14864==    by 0x576E76D: ibis::category::patternSearch(char const*, 
>> ibis::bitvector&) const (category.cpp:871)
>> ==14864==    by 0x509933C: ibis::part::patternSearch(ibis::qLike const&, 
>> ibis::bitvector&) const (part.cpp:3260)
>> ==14864==    by 0x545FF03: ibis::query::doEvaluate(ibis::qExpr const*, 
>> ibis::bitvector const&, ibis::bitvector&) const (query.cpp:3962)
>> ==14864==    by 0x545B63D: ibis::query::computeHits() (query.cpp:2771)
>> ==14864==    by 0x5452413: ibis::query::evaluate(bool) (query.cpp:847)
>> ==14864==    by 0x587328C: fastbit_build_query (capi.cpp:477)
>> ==14864==    by 0x4030F8: main (main.cpp:38)
>>
>> In r508, the problem is gone !
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: K. John Wu [mailto:[email protected]]
>> Sent: Thursday, March 29, 2012 12:33 AM
>> To: Dominique Prunier
>> Cc: FastBit Users
>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>
>> Hi, Dominique,
>>
>> The stack trace shows that it is invoking a copy constructor of the
>> ibis::bitvector class when it encountered the seg fault.  Not sure
>> what is the problem here.  I have tried to reproduce the problem by
>> modifying an existing test suite check-maurel.  However, the code
>> seems to work.
>>
>> There is a minor change ibis::index::sumBins to check that the
>> incoming array contains only values less than bits.size() (the number
>> of bitvectors stored in an index object - ibis::direkte is an index
>> object).  This might prevent attempting to out-of-bound accesses.  The
>> change is in SVN Revision 508.
>>
>> If you are able to find more information.  Please let me know.
>>
>> John
>>
>>
>> On 3/28/12 2:07 PM, Dominique Prunier wrote:
>>> Woops, r507 segfaults right away in:
>>>
>>> C  [libfastbit.so.0.0.9+0x800fbb]  
>>> ibis::bitvector::bitvector(ibis::bitvector const&)+0x23
>>> C  [libfastbit.so.0.0.9+0x6d16c1]  
>>> ibis::index::sumBins(ibis::array_t<unsigned> const&, ibis::bitvector&) 
>>> const+0x103
>>> C  [libfastbit.so.0.0.9+0x93f76e]  ibis::category::patternSearch(char 
>>> const*, ibis::bitvector&) const+0x3c8
>>> C  [libfastbit.so.0.0.9+0x26a33d]  ibis::part::patternSearch(ibis::qLike 
>>> const&, ibis::bitvector&) const+0xcf
>>> C  [libfastbit.so.0.0.9+0x630f04]  ibis::query::doEvaluate(ibis::qExpr 
>>> const*, ibis::bitvector const&, ibis::bitvector&) const+0xe1a
>>> C  [libfastbit.so.0.0.9+0x62c63e]  ibis::query::computeHits()+0x356
>>> C  [libfastbit.so.0.0.9+0x623414]  ibis::query::evaluate(bool)+0x4a6
>>> C  [libfastbit.so.0.0.9+0xa4428d]  float+0x3c0
>>> C  [jna1347076628273574295.tmp+0x11b20]  float+0x4c
>>>
>>> I'll try to investigate this latter but the last query of my tests it 
>>> executed was
>>>
>>> SELECT weird_category FROM /tmp/junit8378310225719591578 WHERE 
>>> weird_category LIKE '"val\\%'
>>>
>>> I'll try to investigate this, but it might be related to the fact that this 
>>> query is supposed to return no result.
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Dominique Prunier
>>> Sent: Wednesday, March 28, 2012 4:58 PM
>>> To: K. John Wu
>>> Cc: FastBit Users
>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>
>>> Hey John,
>>>
>>> I'll have a look to r507, probably tomorrow. To limit the risk, i've chosen 
>>> the well tested r506 as my stable version using the 
>>> FASTBIT_EMPTY_STRING_AS_NULL define.
>>>
>>> My problem with null mask is that i write my partitions in pure Java code. 
>>> It is quite straightforward to write data files and -part.txt, but writing 
>>> a NULL mask is something else. Besides, I don't really need a difference 
>>> between NULL and empty anyway, since treating empty strings as NULLs is no 
>>> different than some RDBMS engines that we support (Oracle does that for 
>>> example).
>>>
>>> I have one quick question about category columns. Do they really have a 
>>> separate NULL mask or do they use the bitmap stored at key 0 as their NULL 
>>> mask ? If so, that means that keeping the FASTBIT_EMPTY_STRING_AS_NULL 
>>> would allow me to indirectly build some real NULL values which would work 
>>> with the NOT NULL syntax (which, by the way, in SQL is IS NOT NULL/IS NULL, 
>>> maybe it would be worth changing it).
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: K. John Wu [mailto:[email protected]]
>>> Sent: Wednesday, March 28, 2012 3:26 PM
>>> To: Dominique Prunier
>>> Cc: FastBit Users
>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>
>>> Hi, Dominique,
>>>
>>> I have added code to accept "colname NOT NULL" in the where clauses.
>>> The new code is in SVN Revision 507.
>>>
>>> The new revision should also consolidate the handling of many
>>> bitvectors in category::patternSearch and address the issue of
>>> possibly missing calls to index::activate (which leads to incorrect
>>> answers).
>>>
>>> You can input null values through tablex::readCSV.  There is an
>>> example in tests/Makefile.am in the way it generates data partition
>>> w7.  The test case 15 of really-small also makes use of the new
>>> expression "NOT NULL".
>>>
>>> Let me know if you spot any problems.
>>>
>>> John
>>>
>>>
>>> On 3/28/12 10:45 AM, Dominique Prunier wrote:
>>>> Hey John,
>>>>
>>>> My guess is that empty strings are used more often that we'd think for the 
>>>> same use case: use them as NULL marker because it is the easiest way to 
>>>> both insert and query (which is exactly what i use them for). Even if it 
>>>> is not an exact synonym of the SQL NULL (especially for propagation), for 
>>>> most use cases, it is close enough. The fact that its key is fixed and 
>>>> well known provides a valid workaround to simulate IS NULL/IS NOT NULL 
>>>> predicates and since it is excluded from most string related predicates 
>>>> (most importantly, a LIKE '%' wont select it), it is a quite easy and 
>>>> intuitive to use.
>>>>
>>>> Besides, the column data file for strings can't represent NULL values so 
>>>> you would have to deal with null masks, which make things significantly 
>>>> more complex (especially, to create a partition from scratch without using 
>>>> the FastBit library) and potentially more resource hungry (have to 
>>>> read/write null mask, ...).
>>>>
>>>> That being said, my only problem with making the distinction between NULL 
>>>> and empty in the dictionary is the current revision's inability to query 
>>>> empty strings and/or NULLs (which is currently worked around by the uint 
>>>> alternative, =0) and the fact that it matches a pattern like '%'. Since 
>>>> there is no guarantee on the empty string key, i don't have any workaround 
>>>> for now.
>>>>
>>>> For the time being, i'll stick with the FASTBIT_EMPTY_STRING_AS_NULL 
>>>> define waiting for the ambiguity to vanish.
>>>>
>>>> Thanks,
>>>>
>>>> -----Original Message-----
>>>> From: K. John Wu [mailto:[email protected]]
>>>> Sent: Wednesday, March 28, 2012 1:22 PM
>>>> To: Dominique Prunier
>>>> Cc: FastBit Users
>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>
>>>> The ability to distinguish between empty string and null string can be
>>>> useful.  The ambiguity in the current code is related to some earlier
>>>> design oversight, which we intend to correct.  My hope is that not too
>>>> many people have empty strings that want to do something with, so
>>>> breaking this compatibility is not a big deal.
>>>>
>>>> There are also a problem with the implementation of
>>>> category::patternSearch that neglects to read the bitmaps from index
>>>> files (neglecting to call index::activate).  I am consolidating some
>>>> code to make use of the existing strategies involving summation of a
>>>> large number of bitmaps in ibis::index::sumBits and
>>>> ibis::index::sumBins.  These functions should encapsulate the idea of
>>>> when to decompress bitmaps a lot better than the simple version
>>>> currently in category::patternSearch.
>>>>
>>>> I am doing tests now and will let everyone know when I am ready to
>>>> check the code in.
>>>>
>>>> John
>>>>
>>>>
>>>> On 3/28/12 8:44 AM, Dominique Prunier wrote:
>>>>> Thanks John for your help. It is always very appreciated.
>>>>>
>>>>> With the macro FASTBIT_EMPTY_STRING_AS_NULL enabled, all my test cases 
>>>>> now works. I'll test performance next but it could be a good candidate 
>>>>> for my stable version.
>>>>>
>>>>> About the FASTBIT_EMPTY_STRING_AS_NULL, what do you think would be the 
>>>>> best default ? For my use case, it is obviously to enable it, but for 
>>>>> everybody else i don't know.  My concern is the backward compatibility 
>>>>> here, especially the fact that it influences the index creation, and not 
>>>>> usage. This means that somebody who upgrades won't notice this change 
>>>>> before it regenerates indexes. What's your opinion on this ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -----Original Message-----
>>>>> From: K. John Wu [mailto:[email protected]]
>>>>> Sent: Wednesday, March 28, 2012 11:20 AM
>>>>> To: Dominique Prunier
>>>>> Cc: FastBit Users
>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>
>>>>> Hi, Dominique,
>>>>>
>>>>> Thanks for the suggestions and and test cases.  Just checked in a set
>>>>> of changes as SVN Revision 506.  Here is a bit more explanation.
>>>>>
>>>>> On item 1, I have taken the option you've suggested in the first
>>>>> message, i.e., use the macro FASTBIT_EMPTY_STRING_AS_NULL
>>>>>
>>>>> On item 2, I have restored the escaping of using backslash as you
>>>>> requested.
>>>>>
>>>>> Will go through you tests cases next and see what else I need to change.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On 3/28/12 8:02 AM, Dominique Prunier wrote:
>>>>>> Hi John,
>>>>>>
>>>>>> Here is two simple test cases for my two issues with r505.
>>>>>>
>>>>>> 1. empty-strings.zip: there is no way to select empty string anymore
>>>>>>
>>>>>> **** FAILED (bad result count) with where clause << a='' >>
>>>>>>
>>>>>> **** FAILED (bad result count) with where clause << a=0 >>
>>>>>>
>>>>>> **** FAILED (hard) with where clause << a LIKE '' >>
>>>>>>
>>>>>> 2. de-escaping.zip: there is no way to select a string with a reserved 
>>>>>> char in it
>>>>>>
>>>>>> **** FAILED (bad result count) with where clause << a='it\'s good' >>
>>>>>>
>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected name 
>>>>>> string at location a='it's good':1.7
>>>>>> Warning -- query[96w0-3xn8Tg----1]::setWhereClause -- failed to parse 
>>>>>> the WHERE clause "a='it's good'"
>>>>>> Warning -- fastbit_build_query failed to assign conditions (a='it's 
>>>>>> good') to a query
>>>>>> **** FAILED (hard) with where clause << a='it's good' >>
>>>>>>
>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected 
>>>>>> $undefined at location a=it's good:1.5
>>>>>> Warning -- query[96w0-3xn8Tg----2]::setWhereClause -- failed to parse 
>>>>>> the WHERE clause "a=it's good"
>>>>>> Warning -- fastbit_build_query failed to assign conditions (a=it's good) 
>>>>>> to a query
>>>>>> **** FAILED (hard) with where clause << a=it's good >>
>>>>>>
>>>>>> With the attached patch, it fixes my test cases (except a LIKE '' that 
>>>>>> didn't work before and a='it's good' and a=it's good whieh are invalid 
>>>>>> where clauses).
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected] 
>>>>>> [mailto:[email protected]] On Behalf Of Dominique 
>>>>>> Prunier
>>>>>> Sent: Wednesday, March 28, 2012 10:04 AM
>>>>>> To: K. John Wu
>>>>>> Cc: FastBit Users
>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>>
>>>>>> Hey John,
>>>>>>
>>>>>> I'm trying r505 right know, i have two questions/remarks:
>>>>>>
>>>>>>  1. I saw a change about the dictionary now able to accept empty strings 
>>>>>> and treat them as normal strings instead of NULLs. This breaks quite a 
>>>>>> lot of my test cases, specifically those which used to test category=0 
>>>>>> or !=0 to simulate the IS NULL/IS NOT NULL because testing for an empty 
>>>>>> string doesn't work (='' or LIKE '' used to fail or return invalid 
>>>>>> results). What would be your recommendation for that use case ? Could we 
>>>>>> add a define for those relying on this behavior ? Something like:
>>>>>>
>>>>>> -    //if (*str == 0) return 0;
>>>>>> +#ifdef FASTBIT_EMPTY_STRING_AS_NULL
>>>>>> +    if (*str == 0) return 0;
>>>>>> +#endif
>>>>>>
>>>>>>  2. I can't find anywhere the de-escaping patch in r505, am i missing 
>>>>>> something ?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: K. John Wu [mailto:[email protected]]
>>>>>> Sent: Wednesday, March 28, 2012 1:55 AM
>>>>>> To: Dominique Prunier
>>>>>> Cc: FastBit Users
>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>>
>>>>>> Hi, Dominique,
>>>>>>
>>>>>> There are some updates to involving merge of dictionaries and to
>>>>>> exercise the operations involving unmatched quotes.  The new code is
>>>>>> SVN R505.
>>>>>>
>>>>>> Please let me know if you have any additional questions.
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 3/27/12 1:03 PM, Dominique Prunier wrote:
>>>>>>> Hey John,
>>>>>>>
>>>>>>> There is definitely a need for FastBit escaping. The escaping problem 
>>>>>>> is not at the shell level (though we could have one there) since in 
>>>>>>> pure C/C++ code, there's no shell involved when i'm building a where 
>>>>>>> clause from a string. The problem is at the where clause parsing level 
>>>>>>> (in the lexer to be more precise) to be able to express string literals 
>>>>>>> among other things (and not only metas, it is also white spaces, ...).
>>>>>>> Typically, my test that fails is as simple as calling 
>>>>>>> fastbit_build_query(..., ..., "a='it\\'s good'"). This is expected to 
>>>>>>> create a qString << a = it's good >> but now, it creates a qString << a 
>>>>>>> = it\'s good >> which is wrong. The attached patch restores the 
>>>>>>> descaping, but _not_ the double quote stripping (because it is already 
>>>>>>> handled in the lexer). All my test cases works after applying it on 
>>>>>>> r503.
>>>>>>
>>>>>> The constructors for ibis::qString and ibis::qLike really should not
>>>>>> strip away anything.  In your case, you should be able to do the 
>>>>>> following
>>>>>>
>>>>>> .../tcapi data-dir "a=\"it's good\""
>>>>>>
>>>>>> if you are using fastbit_build_query, you can use the same string
>>>>>> fastbit_build_query(..., ..., "a=\"it's good\"");
>>>>>>
>>>>>> Since FastBit regular expression only support four meta characters ? *
>>>>>> _ %.  There is no need to escape anything.  It is probably cleaner to
>>>>>> not introduce stripping of anything special (except the outer most
>>>>>> quotes, which should be only done once).
>>>>>>
>>>>>>>
>>>>>>> About the decompression, thanks for the link, this is very interesting 
>>>>>>> stuff ! But my point here is not about questing the fact the 
>>>>>>> decompression can be better is some case, i was just under the 
>>>>>>> impression that the hit vector given to category::patternSearch was 
>>>>>>> _always_ already decompressed since it is ultimately a bitvector that 
>>>>>>> has been created from scratch for query evaluation (it would need 
>>>>>>> verification though). My guess is that the few percent of performance 
>>>>>>> i'm loosing here are attributable to the check (hits.isCompressed() && 
>>>>>>> hits.bytes()*mult + bv->bytes() > hits.size()), since it gets executed 
>>>>>>> _A LOT_ of times. I'll try to investigate it a little bit further.
>>>>>>
>>>>>> I have rearranged the tests for decompression in layers which
>>>>>> hopefully will eliminate the need to perform more expensive tests in
>>>>>> your case that presumably involve a fairly small number of values.
>>>>>> _______________________________________________
>>>>>> FastBit-users mailing list
>>>>>> [email protected]
>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to