Hey John,
I found the problem. When the exact value doesn't exists in the dictionary,
operator[] is supposed to return size()+1 but here it returns 3 instead of 2
(it returns raw_.size()+1 instead of key_.size()+1) which make the following
code fail in dictionary::patternSearch:
if (!meta) {
uint32_t code = operator[](prefix.c_str());
if (code != size() + 1) {
matches.push_back(code);
}
return;
}
We probably never saw it before for at least 3 reasons:
* it only affects linear search from dictionary::operator[] since in the other
case it returns raw_.size() so it can't happen is the dic is larger than 16
* index::getBitvector that was previously used in category::patternSearch
validate the given index and return 0 if it is out of bounds
* category::patternSearch was validating that index::getBitvector didn't
return NULL
Thanks,
-----Original Message-----
From: K. John Wu [mailto:[email protected]]
Sent: Thursday, March 29, 2012 3:48 PM
To: Dominique Prunier
Cc: FastBit Users
Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
Hi, Dominique,
The query seg faulted in r507 because ibis::dictionary::patternSearch
placed the number 3 into the output array, however, the dictionary has
only one value "\"val%\"". This creates an opportunity for
ibis::index::sumBins attempt to access bits[3], but there are only two
values in bits. Any idea why is ibis::dictionary::patternSearch
producing 3?
John
On 3/29/12 10:05 AM, Dominique Prunier wrote:
> Hey John,
>
> I'm sorry, my test case was not really "minimal" and too complex for what i
> wanted to show you.
> Please don't change the escaping, the query evaluation works just fine.
>
> Just for your information, this test case was here to validate one thing:
> consistent de-escaping in all layers:
> - C source : "'\"val\\\\%'"
> - Compiled : '"val\\%'
> - After lexer : "val\\%
> - In qLike : "val\%
> Ultimately, this tests try to validate that % escaping works in patternMatch
> by not treating % as a wildcard but as a regular character, thus this is not
> supposed to return any match (it ends up being an equivalent of = "val%). And
> it works just as expected. There is nothing to change about it.
>
> However, it just happened to be a query that used to segfault in r507, but my
> guess is that it was related to one of these specificities:
> * this column only has a single value
> * there is no nulls
> * the query returns no results
>
> To give you a simpler example, the query:
>
> SELECT single_value_category FROM <partition> WHERE single_value_category
> LIKE 'missing'
>
> segfaulted as well in r507 for the same reasons. And this exemple doesn't
> involve any fancy de-escaping but share the same specificities (single value,
> no nulls, no results).
>
> Thanks,
>
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]]
> Sent: Thursday, March 29, 2012 12:41 PM
> To: Dominique Prunier
> Cc: FastBit Users
> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>
> Hi, Dominique,
>
> I see what you are trying to do. The weird_category only has value
> "\"val%\"", therefore, the where clause "weird_category LIKE
> '\"val\\%'" should match every row, but "weird_category LIKE 'val\\%'"
> should match no row.
>
> If you can step into category::patternSearch, you will see that
> "val\\%" has been stripped to "val%", which will have the same outcome
> as you intended, but it stripped away too many back slashes. You
> intend category::patternSearch to see "val\%" to match the literal
> percent sign (%), however, because the back slash was stripped twice,
> you only got a bare percent sign left, which means it is a wild card
> character, not a literal character as you intended.
>
> My theory is this. The string "val\\%" becomes "val\%" when it gets
> to inside the C code. The runtime system has stripped away the first
> back slash. The constructor of ibis::qLike take away the second one.
>
> Since we have gone back and forth many times on this, I will wait for
> you confirmation before doing anything about it.
>
> John
>
>
> On 3/29/12 8:09 AM, Dominique Prunier wrote:
>> Hey John,
>>
>> When i'll have some time, i'll make my test suite dump the queries and
>> expected results so that you try it yourself. It only tests category (and a
>> small bit of long) data types for the things i'm doing but it can be useful.
>>
>> In the meantime, here is my test partition (see attached) and the query that
>> generated the segfault with r507 was:
>>
>> SELECT weird_category FROM <partition> WHERE weird_category LIKE '"val\\%'
>> (which is not supposed to return any result)
>>
>> It might help you understand what could have happened.
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: K. John Wu [mailto:[email protected]]
>> Sent: Thursday, March 29, 2012 10:55 AM
>> To: Dominique Prunier
>> Cc: FastBit Users
>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>
>> Good to know. The problem was then I did not check the array bounds.
>> Odd though, I did not think the values could be out of bounds..
>>
>> On 3/29/12 7:24 AM, Dominique Prunier wrote:
>>> Hi John,
>>>
>>> I ran the query with r507, and apparently, the problem was there:
>>>
>>> ==14864== Invalid read of size 8
>>> ==14864== at 0x550067A: ibis::index::sumBins(ibis::array_t<unsigned int>
>>> const&, ibis::bitvector&) const (index.cpp:6371)
>>> ==14864== by 0x576E76D: ibis::category::patternSearch(char const*,
>>> ibis::bitvector&) const (category.cpp:871)
>>> ==14864== by 0x509933C: ibis::part::patternSearch(ibis::qLike const&,
>>> ibis::bitvector&) const (part.cpp:3260)
>>> ==14864== by 0x545FF03: ibis::query::doEvaluate(ibis::qExpr const*,
>>> ibis::bitvector const&, ibis::bitvector&) const (query.cpp:3962)
>>> ==14864== by 0x545B63D: ibis::query::computeHits() (query.cpp:2771)
>>> ==14864== by 0x5452413: ibis::query::evaluate(bool) (query.cpp:847)
>>> ==14864== by 0x587328C: fastbit_build_query (capi.cpp:477)
>>> ==14864== by 0x4030F8: main (main.cpp:38)
>>>
>>> In r508, the problem is gone !
>>>
>>> Thanks,
>>>
>>> -----Original Message-----
>>> From: K. John Wu [mailto:[email protected]]
>>> Sent: Thursday, March 29, 2012 12:33 AM
>>> To: Dominique Prunier
>>> Cc: FastBit Users
>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>
>>> Hi, Dominique,
>>>
>>> The stack trace shows that it is invoking a copy constructor of the
>>> ibis::bitvector class when it encountered the seg fault. Not sure
>>> what is the problem here. I have tried to reproduce the problem by
>>> modifying an existing test suite check-maurel. However, the code
>>> seems to work.
>>>
>>> There is a minor change ibis::index::sumBins to check that the
>>> incoming array contains only values less than bits.size() (the number
>>> of bitvectors stored in an index object - ibis::direkte is an index
>>> object). This might prevent attempting to out-of-bound accesses. The
>>> change is in SVN Revision 508.
>>>
>>> If you are able to find more information. Please let me know.
>>>
>>> John
>>>
>>>
>>> On 3/28/12 2:07 PM, Dominique Prunier wrote:
>>>> Woops, r507 segfaults right away in:
>>>>
>>>> C [libfastbit.so.0.0.9+0x800fbb]
>>>> ibis::bitvector::bitvector(ibis::bitvector const&)+0x23
>>>> C [libfastbit.so.0.0.9+0x6d16c1]
>>>> ibis::index::sumBins(ibis::array_t<unsigned> const&, ibis::bitvector&)
>>>> const+0x103
>>>> C [libfastbit.so.0.0.9+0x93f76e] ibis::category::patternSearch(char
>>>> const*, ibis::bitvector&) const+0x3c8
>>>> C [libfastbit.so.0.0.9+0x26a33d] ibis::part::patternSearch(ibis::qLike
>>>> const&, ibis::bitvector&) const+0xcf
>>>> C [libfastbit.so.0.0.9+0x630f04] ibis::query::doEvaluate(ibis::qExpr
>>>> const*, ibis::bitvector const&, ibis::bitvector&) const+0xe1a
>>>> C [libfastbit.so.0.0.9+0x62c63e] ibis::query::computeHits()+0x356
>>>> C [libfastbit.so.0.0.9+0x623414] ibis::query::evaluate(bool)+0x4a6
>>>> C [libfastbit.so.0.0.9+0xa4428d] float+0x3c0
>>>> C [jna1347076628273574295.tmp+0x11b20] float+0x4c
>>>>
>>>> I'll try to investigate this latter but the last query of my tests it
>>>> executed was
>>>>
>>>> SELECT weird_category FROM /tmp/junit8378310225719591578 WHERE
>>>> weird_category LIKE '"val\\%'
>>>>
>>>> I'll try to investigate this, but it might be related to the fact that
>>>> this query is supposed to return no result.
>>>>
>>>> Thanks,
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of Dominique
>>>> Prunier
>>>> Sent: Wednesday, March 28, 2012 4:58 PM
>>>> To: K. John Wu
>>>> Cc: FastBit Users
>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>
>>>> Hey John,
>>>>
>>>> I'll have a look to r507, probably tomorrow. To limit the risk, i've
>>>> chosen the well tested r506 as my stable version using the
>>>> FASTBIT_EMPTY_STRING_AS_NULL define.
>>>>
>>>> My problem with null mask is that i write my partitions in pure Java code.
>>>> It is quite straightforward to write data files and -part.txt, but writing
>>>> a NULL mask is something else. Besides, I don't really need a difference
>>>> between NULL and empty anyway, since treating empty strings as NULLs is no
>>>> different than some RDBMS engines that we support (Oracle does that for
>>>> example).
>>>>
>>>> I have one quick question about category columns. Do they really have a
>>>> separate NULL mask or do they use the bitmap stored at key 0 as their NULL
>>>> mask ? If so, that means that keeping the FASTBIT_EMPTY_STRING_AS_NULL
>>>> would allow me to indirectly build some real NULL values which would work
>>>> with the NOT NULL syntax (which, by the way, in SQL is IS NOT NULL/IS
>>>> NULL, maybe it would be worth changing it).
>>>>
>>>> Thanks,
>>>>
>>>> -----Original Message-----
>>>> From: K. John Wu [mailto:[email protected]]
>>>> Sent: Wednesday, March 28, 2012 3:26 PM
>>>> To: Dominique Prunier
>>>> Cc: FastBit Users
>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>
>>>> Hi, Dominique,
>>>>
>>>> I have added code to accept "colname NOT NULL" in the where clauses.
>>>> The new code is in SVN Revision 507.
>>>>
>>>> The new revision should also consolidate the handling of many
>>>> bitvectors in category::patternSearch and address the issue of
>>>> possibly missing calls to index::activate (which leads to incorrect
>>>> answers).
>>>>
>>>> You can input null values through tablex::readCSV. There is an
>>>> example in tests/Makefile.am in the way it generates data partition
>>>> w7. The test case 15 of really-small also makes use of the new
>>>> expression "NOT NULL".
>>>>
>>>> Let me know if you spot any problems.
>>>>
>>>> John
>>>>
>>>>
>>>> On 3/28/12 10:45 AM, Dominique Prunier wrote:
>>>>> Hey John,
>>>>>
>>>>> My guess is that empty strings are used more often that we'd think for
>>>>> the same use case: use them as NULL marker because it is the easiest way
>>>>> to both insert and query (which is exactly what i use them for). Even if
>>>>> it is not an exact synonym of the SQL NULL (especially for propagation),
>>>>> for most use cases, it is close enough. The fact that its key is fixed
>>>>> and well known provides a valid workaround to simulate IS NULL/IS NOT
>>>>> NULL predicates and since it is excluded from most string related
>>>>> predicates (most importantly, a LIKE '%' wont select it), it is a quite
>>>>> easy and intuitive to use.
>>>>>
>>>>> Besides, the column data file for strings can't represent NULL values so
>>>>> you would have to deal with null masks, which make things significantly
>>>>> more complex (especially, to create a partition from scratch without
>>>>> using the FastBit library) and potentially more resource hungry (have to
>>>>> read/write null mask, ...).
>>>>>
>>>>> That being said, my only problem with making the distinction between NULL
>>>>> and empty in the dictionary is the current revision's inability to query
>>>>> empty strings and/or NULLs (which is currently worked around by the uint
>>>>> alternative, =0) and the fact that it matches a pattern like '%'. Since
>>>>> there is no guarantee on the empty string key, i don't have any
>>>>> workaround for now.
>>>>>
>>>>> For the time being, i'll stick with the FASTBIT_EMPTY_STRING_AS_NULL
>>>>> define waiting for the ambiguity to vanish.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -----Original Message-----
>>>>> From: K. John Wu [mailto:[email protected]]
>>>>> Sent: Wednesday, March 28, 2012 1:22 PM
>>>>> To: Dominique Prunier
>>>>> Cc: FastBit Users
>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>
>>>>> The ability to distinguish between empty string and null string can be
>>>>> useful. The ambiguity in the current code is related to some earlier
>>>>> design oversight, which we intend to correct. My hope is that not too
>>>>> many people have empty strings that want to do something with, so
>>>>> breaking this compatibility is not a big deal.
>>>>>
>>>>> There are also a problem with the implementation of
>>>>> category::patternSearch that neglects to read the bitmaps from index
>>>>> files (neglecting to call index::activate). I am consolidating some
>>>>> code to make use of the existing strategies involving summation of a
>>>>> large number of bitmaps in ibis::index::sumBits and
>>>>> ibis::index::sumBins. These functions should encapsulate the idea of
>>>>> when to decompress bitmaps a lot better than the simple version
>>>>> currently in category::patternSearch.
>>>>>
>>>>> I am doing tests now and will let everyone know when I am ready to
>>>>> check the code in.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On 3/28/12 8:44 AM, Dominique Prunier wrote:
>>>>>> Thanks John for your help. It is always very appreciated.
>>>>>>
>>>>>> With the macro FASTBIT_EMPTY_STRING_AS_NULL enabled, all my test cases
>>>>>> now works. I'll test performance next but it could be a good candidate
>>>>>> for my stable version.
>>>>>>
>>>>>> About the FASTBIT_EMPTY_STRING_AS_NULL, what do you think would be the
>>>>>> best default ? For my use case, it is obviously to enable it, but for
>>>>>> everybody else i don't know. My concern is the backward compatibility
>>>>>> here, especially the fact that it influences the index creation, and not
>>>>>> usage. This means that somebody who upgrades won't notice this change
>>>>>> before it regenerates indexes. What's your opinion on this ?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: K. John Wu [mailto:[email protected]]
>>>>>> Sent: Wednesday, March 28, 2012 11:20 AM
>>>>>> To: Dominique Prunier
>>>>>> Cc: FastBit Users
>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>>
>>>>>> Hi, Dominique,
>>>>>>
>>>>>> Thanks for the suggestions and and test cases. Just checked in a set
>>>>>> of changes as SVN Revision 506. Here is a bit more explanation.
>>>>>>
>>>>>> On item 1, I have taken the option you've suggested in the first
>>>>>> message, i.e., use the macro FASTBIT_EMPTY_STRING_AS_NULL
>>>>>>
>>>>>> On item 2, I have restored the escaping of using backslash as you
>>>>>> requested.
>>>>>>
>>>>>> Will go through you tests cases next and see what else I need to change.
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 3/28/12 8:02 AM, Dominique Prunier wrote:
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Here is two simple test cases for my two issues with r505.
>>>>>>>
>>>>>>> 1. empty-strings.zip: there is no way to select empty string anymore
>>>>>>>
>>>>>>> **** FAILED (bad result count) with where clause << a='' >>
>>>>>>>
>>>>>>> **** FAILED (bad result count) with where clause << a=0 >>
>>>>>>>
>>>>>>> **** FAILED (hard) with where clause << a LIKE '' >>
>>>>>>>
>>>>>>> 2. de-escaping.zip: there is no way to select a string with a reserved
>>>>>>> char in it
>>>>>>>
>>>>>>> **** FAILED (bad result count) with where clause << a='it\'s good' >>
>>>>>>>
>>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected name
>>>>>>> string at location a='it's good':1.7
>>>>>>> Warning -- query[96w0-3xn8Tg----1]::setWhereClause -- failed to parse
>>>>>>> the WHERE clause "a='it's good'"
>>>>>>> Warning -- fastbit_build_query failed to assign conditions (a='it's
>>>>>>> good') to a query
>>>>>>> **** FAILED (hard) with where clause << a='it's good' >>
>>>>>>>
>>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected
>>>>>>> $undefined at location a=it's good:1.5
>>>>>>> Warning -- query[96w0-3xn8Tg----2]::setWhereClause -- failed to parse
>>>>>>> the WHERE clause "a=it's good"
>>>>>>> Warning -- fastbit_build_query failed to assign conditions (a=it's
>>>>>>> good) to a query
>>>>>>> **** FAILED (hard) with where clause << a=it's good >>
>>>>>>>
>>>>>>> With the attached patch, it fixes my test cases (except a LIKE '' that
>>>>>>> didn't work before and a='it's good' and a=it's good whieh are invalid
>>>>>>> where clauses).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: [email protected]
>>>>>>> [mailto:[email protected]] On Behalf Of Dominique
>>>>>>> Prunier
>>>>>>> Sent: Wednesday, March 28, 2012 10:04 AM
>>>>>>> To: K. John Wu
>>>>>>> Cc: FastBit Users
>>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>>>
>>>>>>> Hey John,
>>>>>>>
>>>>>>> I'm trying r505 right know, i have two questions/remarks:
>>>>>>>
>>>>>>> 1. I saw a change about the dictionary now able to accept empty
>>>>>>> strings and treat them as normal strings instead of NULLs. This breaks
>>>>>>> quite a lot of my test cases, specifically those which used to test
>>>>>>> category=0 or !=0 to simulate the IS NULL/IS NOT NULL because testing
>>>>>>> for an empty string doesn't work (='' or LIKE '' used to fail or return
>>>>>>> invalid results). What would be your recommendation for that use case ?
>>>>>>> Could we add a define for those relying on this behavior ? Something
>>>>>>> like:
>>>>>>>
>>>>>>> - //if (*str == 0) return 0;
>>>>>>> +#ifdef FASTBIT_EMPTY_STRING_AS_NULL
>>>>>>> + if (*str == 0) return 0;
>>>>>>> +#endif
>>>>>>>
>>>>>>> 2. I can't find anywhere the de-escaping patch in r505, am i missing
>>>>>>> something ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: K. John Wu [mailto:[email protected]]
>>>>>>> Sent: Wednesday, March 28, 2012 1:55 AM
>>>>>>> To: Dominique Prunier
>>>>>>> Cc: FastBit Users
>>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501
>>>>>>>
>>>>>>> Hi, Dominique,
>>>>>>>
>>>>>>> There are some updates to involving merge of dictionaries and to
>>>>>>> exercise the operations involving unmatched quotes. The new code is
>>>>>>> SVN R505.
>>>>>>>
>>>>>>> Please let me know if you have any additional questions.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> On 3/27/12 1:03 PM, Dominique Prunier wrote:
>>>>>>>> Hey John,
>>>>>>>>
>>>>>>>> There is definitely a need for FastBit escaping. The escaping problem
>>>>>>>> is not at the shell level (though we could have one there) since in
>>>>>>>> pure C/C++ code, there's no shell involved when i'm building a where
>>>>>>>> clause from a string. The problem is at the where clause parsing level
>>>>>>>> (in the lexer to be more precise) to be able to express string
>>>>>>>> literals among other things (and not only metas, it is also white
>>>>>>>> spaces, ...).
>>>>>>>> Typically, my test that fails is as simple as calling
>>>>>>>> fastbit_build_query(..., ..., "a='it\\'s good'"). This is expected to
>>>>>>>> create a qString << a = it's good >> but now, it creates a qString <<
>>>>>>>> a = it\'s good >> which is wrong. The attached patch restores the
>>>>>>>> descaping, but _not_ the double quote stripping (because it is already
>>>>>>>> handled in the lexer). All my test cases works after applying it on
>>>>>>>> r503.
>>>>>>>
>>>>>>> The constructors for ibis::qString and ibis::qLike really should not
>>>>>>> strip away anything. In your case, you should be able to do the
>>>>>>> following
>>>>>>>
>>>>>>> .../tcapi data-dir "a=\"it's good\""
>>>>>>>
>>>>>>> if you are using fastbit_build_query, you can use the same string
>>>>>>> fastbit_build_query(..., ..., "a=\"it's good\"");
>>>>>>>
>>>>>>> Since FastBit regular expression only support four meta characters ? *
>>>>>>> _ %. There is no need to escape anything. It is probably cleaner to
>>>>>>> not introduce stripping of anything special (except the outer most
>>>>>>> quotes, which should be only done once).
>>>>>>>
>>>>>>>>
>>>>>>>> About the decompression, thanks for the link, this is very interesting
>>>>>>>> stuff ! But my point here is not about questing the fact the
>>>>>>>> decompression can be better is some case, i was just under the
>>>>>>>> impression that the hit vector given to category::patternSearch was
>>>>>>>> _always_ already decompressed since it is ultimately a bitvector that
>>>>>>>> has been created from scratch for query evaluation (it would need
>>>>>>>> verification though). My guess is that the few percent of performance
>>>>>>>> i'm loosing here are attributable to the check (hits.isCompressed() &&
>>>>>>>> hits.bytes()*mult + bv->bytes() > hits.size()), since it gets executed
>>>>>>>> _A LOT_ of times. I'll try to investigate it a little bit further.
>>>>>>>
>>>>>>> I have rearranged the tests for decompression in layers which
>>>>>>> hopefully will eliminate the need to perform more expensive tests in
>>>>>>> your case that presumably involve a fairly small number of values.
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users