Hi, Dominique, The query seg faulted in r507 because ibis::dictionary::patternSearch placed the number 3 into the output array, however, the dictionary has only one value "\"val%\"". This creates an opportunity for ibis::index::sumBins attempt to access bits[3], but there are only two values in bits. Any idea why is ibis::dictionary::patternSearch producing 3?
John On 3/29/12 10:05 AM, Dominique Prunier wrote: > Hey John, > > I'm sorry, my test case was not really "minimal" and too complex for what i > wanted to show you. > Please don't change the escaping, the query evaluation works just fine. > > Just for your information, this test case was here to validate one thing: > consistent de-escaping in all layers: > - C source : "'\"val\\\\%'" > - Compiled : '"val\\%' > - After lexer : "val\\% > - In qLike : "val\% > Ultimately, this tests try to validate that % escaping works in patternMatch > by not treating % as a wildcard but as a regular character, thus this is not > supposed to return any match (it ends up being an equivalent of = "val%). And > it works just as expected. There is nothing to change about it. > > However, it just happened to be a query that used to segfault in r507, but my > guess is that it was related to one of these specificities: > * this column only has a single value > * there is no nulls > * the query returns no results > > To give you a simpler example, the query: > > SELECT single_value_category FROM <partition> WHERE single_value_category > LIKE 'missing' > > segfaulted as well in r507 for the same reasons. And this exemple doesn't > involve any fancy de-escaping but share the same specificities (single value, > no nulls, no results). > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Thursday, March 29, 2012 12:41 PM > To: Dominique Prunier > Cc: FastBit Users > Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 > > Hi, Dominique, > > I see what you are trying to do. The weird_category only has value > "\"val%\"", therefore, the where clause "weird_category LIKE > '\"val\\%'" should match every row, but "weird_category LIKE 'val\\%'" > should match no row. > > If you can step into category::patternSearch, you will see that > "val\\%" has been stripped to "val%", which will have the same outcome > as you intended, but it stripped away too many back slashes. You > intend category::patternSearch to see "val\%" to match the literal > percent sign (%), however, because the back slash was stripped twice, > you only got a bare percent sign left, which means it is a wild card > character, not a literal character as you intended. > > My theory is this. The string "val\\%" becomes "val\%" when it gets > to inside the C code. The runtime system has stripped away the first > back slash. The constructor of ibis::qLike take away the second one. > > Since we have gone back and forth many times on this, I will wait for > you confirmation before doing anything about it. > > John > > > On 3/29/12 8:09 AM, Dominique Prunier wrote: >> Hey John, >> >> When i'll have some time, i'll make my test suite dump the queries and >> expected results so that you try it yourself. It only tests category (and a >> small bit of long) data types for the things i'm doing but it can be useful. >> >> In the meantime, here is my test partition (see attached) and the query that >> generated the segfault with r507 was: >> >> SELECT weird_category FROM <partition> WHERE weird_category LIKE '"val\\%' >> (which is not supposed to return any result) >> >> It might help you understand what could have happened. >> >> Thanks, >> >> -----Original Message----- >> From: K. John Wu [mailto:[email protected]] >> Sent: Thursday, March 29, 2012 10:55 AM >> To: Dominique Prunier >> Cc: FastBit Users >> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >> >> Good to know. The problem was then I did not check the array bounds. >> Odd though, I did not think the values could be out of bounds.. >> >> On 3/29/12 7:24 AM, Dominique Prunier wrote: >>> Hi John, >>> >>> I ran the query with r507, and apparently, the problem was there: >>> >>> ==14864== Invalid read of size 8 >>> ==14864== at 0x550067A: ibis::index::sumBins(ibis::array_t<unsigned int> >>> const&, ibis::bitvector&) const (index.cpp:6371) >>> ==14864== by 0x576E76D: ibis::category::patternSearch(char const*, >>> ibis::bitvector&) const (category.cpp:871) >>> ==14864== by 0x509933C: ibis::part::patternSearch(ibis::qLike const&, >>> ibis::bitvector&) const (part.cpp:3260) >>> ==14864== by 0x545FF03: ibis::query::doEvaluate(ibis::qExpr const*, >>> ibis::bitvector const&, ibis::bitvector&) const (query.cpp:3962) >>> ==14864== by 0x545B63D: ibis::query::computeHits() (query.cpp:2771) >>> ==14864== by 0x5452413: ibis::query::evaluate(bool) (query.cpp:847) >>> ==14864== by 0x587328C: fastbit_build_query (capi.cpp:477) >>> ==14864== by 0x4030F8: main (main.cpp:38) >>> >>> In r508, the problem is gone ! >>> >>> Thanks, >>> >>> -----Original Message----- >>> From: K. John Wu [mailto:[email protected]] >>> Sent: Thursday, March 29, 2012 12:33 AM >>> To: Dominique Prunier >>> Cc: FastBit Users >>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>> >>> Hi, Dominique, >>> >>> The stack trace shows that it is invoking a copy constructor of the >>> ibis::bitvector class when it encountered the seg fault. Not sure >>> what is the problem here. I have tried to reproduce the problem by >>> modifying an existing test suite check-maurel. However, the code >>> seems to work. >>> >>> There is a minor change ibis::index::sumBins to check that the >>> incoming array contains only values less than bits.size() (the number >>> of bitvectors stored in an index object - ibis::direkte is an index >>> object). This might prevent attempting to out-of-bound accesses. The >>> change is in SVN Revision 508. >>> >>> If you are able to find more information. Please let me know. >>> >>> John >>> >>> >>> On 3/28/12 2:07 PM, Dominique Prunier wrote: >>>> Woops, r507 segfaults right away in: >>>> >>>> C [libfastbit.so.0.0.9+0x800fbb] >>>> ibis::bitvector::bitvector(ibis::bitvector const&)+0x23 >>>> C [libfastbit.so.0.0.9+0x6d16c1] >>>> ibis::index::sumBins(ibis::array_t<unsigned> const&, ibis::bitvector&) >>>> const+0x103 >>>> C [libfastbit.so.0.0.9+0x93f76e] ibis::category::patternSearch(char >>>> const*, ibis::bitvector&) const+0x3c8 >>>> C [libfastbit.so.0.0.9+0x26a33d] ibis::part::patternSearch(ibis::qLike >>>> const&, ibis::bitvector&) const+0xcf >>>> C [libfastbit.so.0.0.9+0x630f04] ibis::query::doEvaluate(ibis::qExpr >>>> const*, ibis::bitvector const&, ibis::bitvector&) const+0xe1a >>>> C [libfastbit.so.0.0.9+0x62c63e] ibis::query::computeHits()+0x356 >>>> C [libfastbit.so.0.0.9+0x623414] ibis::query::evaluate(bool)+0x4a6 >>>> C [libfastbit.so.0.0.9+0xa4428d] float+0x3c0 >>>> C [jna1347076628273574295.tmp+0x11b20] float+0x4c >>>> >>>> I'll try to investigate this latter but the last query of my tests it >>>> executed was >>>> >>>> SELECT weird_category FROM /tmp/junit8378310225719591578 WHERE >>>> weird_category LIKE '"val\\%' >>>> >>>> I'll try to investigate this, but it might be related to the fact that >>>> this query is supposed to return no result. >>>> >>>> Thanks, >>>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Dominique >>>> Prunier >>>> Sent: Wednesday, March 28, 2012 4:58 PM >>>> To: K. John Wu >>>> Cc: FastBit Users >>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>> >>>> Hey John, >>>> >>>> I'll have a look to r507, probably tomorrow. To limit the risk, i've >>>> chosen the well tested r506 as my stable version using the >>>> FASTBIT_EMPTY_STRING_AS_NULL define. >>>> >>>> My problem with null mask is that i write my partitions in pure Java code. >>>> It is quite straightforward to write data files and -part.txt, but writing >>>> a NULL mask is something else. Besides, I don't really need a difference >>>> between NULL and empty anyway, since treating empty strings as NULLs is no >>>> different than some RDBMS engines that we support (Oracle does that for >>>> example). >>>> >>>> I have one quick question about category columns. Do they really have a >>>> separate NULL mask or do they use the bitmap stored at key 0 as their NULL >>>> mask ? If so, that means that keeping the FASTBIT_EMPTY_STRING_AS_NULL >>>> would allow me to indirectly build some real NULL values which would work >>>> with the NOT NULL syntax (which, by the way, in SQL is IS NOT NULL/IS >>>> NULL, maybe it would be worth changing it). >>>> >>>> Thanks, >>>> >>>> -----Original Message----- >>>> From: K. John Wu [mailto:[email protected]] >>>> Sent: Wednesday, March 28, 2012 3:26 PM >>>> To: Dominique Prunier >>>> Cc: FastBit Users >>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>> >>>> Hi, Dominique, >>>> >>>> I have added code to accept "colname NOT NULL" in the where clauses. >>>> The new code is in SVN Revision 507. >>>> >>>> The new revision should also consolidate the handling of many >>>> bitvectors in category::patternSearch and address the issue of >>>> possibly missing calls to index::activate (which leads to incorrect >>>> answers). >>>> >>>> You can input null values through tablex::readCSV. There is an >>>> example in tests/Makefile.am in the way it generates data partition >>>> w7. The test case 15 of really-small also makes use of the new >>>> expression "NOT NULL". >>>> >>>> Let me know if you spot any problems. >>>> >>>> John >>>> >>>> >>>> On 3/28/12 10:45 AM, Dominique Prunier wrote: >>>>> Hey John, >>>>> >>>>> My guess is that empty strings are used more often that we'd think for >>>>> the same use case: use them as NULL marker because it is the easiest way >>>>> to both insert and query (which is exactly what i use them for). Even if >>>>> it is not an exact synonym of the SQL NULL (especially for propagation), >>>>> for most use cases, it is close enough. The fact that its key is fixed >>>>> and well known provides a valid workaround to simulate IS NULL/IS NOT >>>>> NULL predicates and since it is excluded from most string related >>>>> predicates (most importantly, a LIKE '%' wont select it), it is a quite >>>>> easy and intuitive to use. >>>>> >>>>> Besides, the column data file for strings can't represent NULL values so >>>>> you would have to deal with null masks, which make things significantly >>>>> more complex (especially, to create a partition from scratch without >>>>> using the FastBit library) and potentially more resource hungry (have to >>>>> read/write null mask, ...). >>>>> >>>>> That being said, my only problem with making the distinction between NULL >>>>> and empty in the dictionary is the current revision's inability to query >>>>> empty strings and/or NULLs (which is currently worked around by the uint >>>>> alternative, =0) and the fact that it matches a pattern like '%'. Since >>>>> there is no guarantee on the empty string key, i don't have any >>>>> workaround for now. >>>>> >>>>> For the time being, i'll stick with the FASTBIT_EMPTY_STRING_AS_NULL >>>>> define waiting for the ambiguity to vanish. >>>>> >>>>> Thanks, >>>>> >>>>> -----Original Message----- >>>>> From: K. John Wu [mailto:[email protected]] >>>>> Sent: Wednesday, March 28, 2012 1:22 PM >>>>> To: Dominique Prunier >>>>> Cc: FastBit Users >>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>>> >>>>> The ability to distinguish between empty string and null string can be >>>>> useful. The ambiguity in the current code is related to some earlier >>>>> design oversight, which we intend to correct. My hope is that not too >>>>> many people have empty strings that want to do something with, so >>>>> breaking this compatibility is not a big deal. >>>>> >>>>> There are also a problem with the implementation of >>>>> category::patternSearch that neglects to read the bitmaps from index >>>>> files (neglecting to call index::activate). I am consolidating some >>>>> code to make use of the existing strategies involving summation of a >>>>> large number of bitmaps in ibis::index::sumBits and >>>>> ibis::index::sumBins. These functions should encapsulate the idea of >>>>> when to decompress bitmaps a lot better than the simple version >>>>> currently in category::patternSearch. >>>>> >>>>> I am doing tests now and will let everyone know when I am ready to >>>>> check the code in. >>>>> >>>>> John >>>>> >>>>> >>>>> On 3/28/12 8:44 AM, Dominique Prunier wrote: >>>>>> Thanks John for your help. It is always very appreciated. >>>>>> >>>>>> With the macro FASTBIT_EMPTY_STRING_AS_NULL enabled, all my test cases >>>>>> now works. I'll test performance next but it could be a good candidate >>>>>> for my stable version. >>>>>> >>>>>> About the FASTBIT_EMPTY_STRING_AS_NULL, what do you think would be the >>>>>> best default ? For my use case, it is obviously to enable it, but for >>>>>> everybody else i don't know. My concern is the backward compatibility >>>>>> here, especially the fact that it influences the index creation, and not >>>>>> usage. This means that somebody who upgrades won't notice this change >>>>>> before it regenerates indexes. What's your opinion on this ? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -----Original Message----- >>>>>> From: K. John Wu [mailto:[email protected]] >>>>>> Sent: Wednesday, March 28, 2012 11:20 AM >>>>>> To: Dominique Prunier >>>>>> Cc: FastBit Users >>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>>>> >>>>>> Hi, Dominique, >>>>>> >>>>>> Thanks for the suggestions and and test cases. Just checked in a set >>>>>> of changes as SVN Revision 506. Here is a bit more explanation. >>>>>> >>>>>> On item 1, I have taken the option you've suggested in the first >>>>>> message, i.e., use the macro FASTBIT_EMPTY_STRING_AS_NULL >>>>>> >>>>>> On item 2, I have restored the escaping of using backslash as you >>>>>> requested. >>>>>> >>>>>> Will go through you tests cases next and see what else I need to change. >>>>>> >>>>>> John >>>>>> >>>>>> >>>>>> On 3/28/12 8:02 AM, Dominique Prunier wrote: >>>>>>> Hi John, >>>>>>> >>>>>>> Here is two simple test cases for my two issues with r505. >>>>>>> >>>>>>> 1. empty-strings.zip: there is no way to select empty string anymore >>>>>>> >>>>>>> **** FAILED (bad result count) with where clause << a='' >> >>>>>>> >>>>>>> **** FAILED (bad result count) with where clause << a=0 >> >>>>>>> >>>>>>> **** FAILED (hard) with where clause << a LIKE '' >> >>>>>>> >>>>>>> 2. de-escaping.zip: there is no way to select a string with a reserved >>>>>>> char in it >>>>>>> >>>>>>> **** FAILED (bad result count) with where clause << a='it\'s good' >> >>>>>>> >>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected name >>>>>>> string at location a='it's good':1.7 >>>>>>> Warning -- query[96w0-3xn8Tg----1]::setWhereClause -- failed to parse >>>>>>> the WHERE clause "a='it's good'" >>>>>>> Warning -- fastbit_build_query failed to assign conditions (a='it's >>>>>>> good') to a query >>>>>>> **** FAILED (hard) with where clause << a='it's good' >> >>>>>>> >>>>>>> Warning -- ibis::whereParser encountered syntax error, unexpected >>>>>>> $undefined at location a=it's good:1.5 >>>>>>> Warning -- query[96w0-3xn8Tg----2]::setWhereClause -- failed to parse >>>>>>> the WHERE clause "a=it's good" >>>>>>> Warning -- fastbit_build_query failed to assign conditions (a=it's >>>>>>> good) to a query >>>>>>> **** FAILED (hard) with where clause << a=it's good >> >>>>>>> >>>>>>> With the attached patch, it fixes my test cases (except a LIKE '' that >>>>>>> didn't work before and a='it's good' and a=it's good whieh are invalid >>>>>>> where clauses). >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: [email protected] >>>>>>> [mailto:[email protected]] On Behalf Of Dominique >>>>>>> Prunier >>>>>>> Sent: Wednesday, March 28, 2012 10:04 AM >>>>>>> To: K. John Wu >>>>>>> Cc: FastBit Users >>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>>>>> >>>>>>> Hey John, >>>>>>> >>>>>>> I'm trying r505 right know, i have two questions/remarks: >>>>>>> >>>>>>> 1. I saw a change about the dictionary now able to accept empty >>>>>>> strings and treat them as normal strings instead of NULLs. This breaks >>>>>>> quite a lot of my test cases, specifically those which used to test >>>>>>> category=0 or !=0 to simulate the IS NULL/IS NOT NULL because testing >>>>>>> for an empty string doesn't work (='' or LIKE '' used to fail or return >>>>>>> invalid results). What would be your recommendation for that use case ? >>>>>>> Could we add a define for those relying on this behavior ? Something >>>>>>> like: >>>>>>> >>>>>>> - //if (*str == 0) return 0; >>>>>>> +#ifdef FASTBIT_EMPTY_STRING_AS_NULL >>>>>>> + if (*str == 0) return 0; >>>>>>> +#endif >>>>>>> >>>>>>> 2. I can't find anywhere the de-escaping patch in r505, am i missing >>>>>>> something ? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: K. John Wu [mailto:[email protected]] >>>>>>> Sent: Wednesday, March 28, 2012 1:55 AM >>>>>>> To: Dominique Prunier >>>>>>> Cc: FastBit Users >>>>>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>>>>> >>>>>>> Hi, Dominique, >>>>>>> >>>>>>> There are some updates to involving merge of dictionaries and to >>>>>>> exercise the operations involving unmatched quotes. The new code is >>>>>>> SVN R505. >>>>>>> >>>>>>> Please let me know if you have any additional questions. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> On 3/27/12 1:03 PM, Dominique Prunier wrote: >>>>>>>> Hey John, >>>>>>>> >>>>>>>> There is definitely a need for FastBit escaping. The escaping problem >>>>>>>> is not at the shell level (though we could have one there) since in >>>>>>>> pure C/C++ code, there's no shell involved when i'm building a where >>>>>>>> clause from a string. The problem is at the where clause parsing level >>>>>>>> (in the lexer to be more precise) to be able to express string >>>>>>>> literals among other things (and not only metas, it is also white >>>>>>>> spaces, ...). >>>>>>>> Typically, my test that fails is as simple as calling >>>>>>>> fastbit_build_query(..., ..., "a='it\\'s good'"). This is expected to >>>>>>>> create a qString << a = it's good >> but now, it creates a qString << >>>>>>>> a = it\'s good >> which is wrong. The attached patch restores the >>>>>>>> descaping, but _not_ the double quote stripping (because it is already >>>>>>>> handled in the lexer). All my test cases works after applying it on >>>>>>>> r503. >>>>>>> >>>>>>> The constructors for ibis::qString and ibis::qLike really should not >>>>>>> strip away anything. In your case, you should be able to do the >>>>>>> following >>>>>>> >>>>>>> .../tcapi data-dir "a=\"it's good\"" >>>>>>> >>>>>>> if you are using fastbit_build_query, you can use the same string >>>>>>> fastbit_build_query(..., ..., "a=\"it's good\""); >>>>>>> >>>>>>> Since FastBit regular expression only support four meta characters ? * >>>>>>> _ %. There is no need to escape anything. It is probably cleaner to >>>>>>> not introduce stripping of anything special (except the outer most >>>>>>> quotes, which should be only done once). >>>>>>> >>>>>>>> >>>>>>>> About the decompression, thanks for the link, this is very interesting >>>>>>>> stuff ! But my point here is not about questing the fact the >>>>>>>> decompression can be better is some case, i was just under the >>>>>>>> impression that the hit vector given to category::patternSearch was >>>>>>>> _always_ already decompressed since it is ultimately a bitvector that >>>>>>>> has been created from scratch for query evaluation (it would need >>>>>>>> verification though). My guess is that the few percent of performance >>>>>>>> i'm loosing here are attributable to the check (hits.isCompressed() && >>>>>>>> hits.bytes()*mult + bv->bytes() > hits.size()), since it gets executed >>>>>>>> _A LOT_ of times. I'll try to investigate it a little bit further. >>>>>>> >>>>>>> I have rearranged the tests for decompression in layers which >>>>>>> hopefully will eliminate the need to perform more expensive tests in >>>>>>> your case that presumably involve a fairly small number of values. >>>>>>> _______________________________________________ >>>>>>> FastBit-users mailing list >>>>>>> [email protected] >>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
