Hi John, I ran the query with r507, and apparently, the problem was there:
==14864== Invalid read of size 8 ==14864== at 0x550067A: ibis::index::sumBins(ibis::array_t<unsigned int> const&, ibis::bitvector&) const (index.cpp:6371) ==14864== by 0x576E76D: ibis::category::patternSearch(char const*, ibis::bitvector&) const (category.cpp:871) ==14864== by 0x509933C: ibis::part::patternSearch(ibis::qLike const&, ibis::bitvector&) const (part.cpp:3260) ==14864== by 0x545FF03: ibis::query::doEvaluate(ibis::qExpr const*, ibis::bitvector const&, ibis::bitvector&) const (query.cpp:3962) ==14864== by 0x545B63D: ibis::query::computeHits() (query.cpp:2771) ==14864== by 0x5452413: ibis::query::evaluate(bool) (query.cpp:847) ==14864== by 0x587328C: fastbit_build_query (capi.cpp:477) ==14864== by 0x4030F8: main (main.cpp:38) In r508, the problem is gone ! Thanks, -----Original Message----- From: K. John Wu [mailto:[email protected]] Sent: Thursday, March 29, 2012 12:33 AM To: Dominique Prunier Cc: FastBit Users Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 Hi, Dominique, The stack trace shows that it is invoking a copy constructor of the ibis::bitvector class when it encountered the seg fault. Not sure what is the problem here. I have tried to reproduce the problem by modifying an existing test suite check-maurel. However, the code seems to work. There is a minor change ibis::index::sumBins to check that the incoming array contains only values less than bits.size() (the number of bitvectors stored in an index object - ibis::direkte is an index object). This might prevent attempting to out-of-bound accesses. The change is in SVN Revision 508. If you are able to find more information. Please let me know. John On 3/28/12 2:07 PM, Dominique Prunier wrote: > Woops, r507 segfaults right away in: > > C [libfastbit.so.0.0.9+0x800fbb] ibis::bitvector::bitvector(ibis::bitvector > const&)+0x23 > C [libfastbit.so.0.0.9+0x6d16c1] > ibis::index::sumBins(ibis::array_t<unsigned> const&, ibis::bitvector&) > const+0x103 > C [libfastbit.so.0.0.9+0x93f76e] ibis::category::patternSearch(char const*, > ibis::bitvector&) const+0x3c8 > C [libfastbit.so.0.0.9+0x26a33d] ibis::part::patternSearch(ibis::qLike > const&, ibis::bitvector&) const+0xcf > C [libfastbit.so.0.0.9+0x630f04] ibis::query::doEvaluate(ibis::qExpr > const*, ibis::bitvector const&, ibis::bitvector&) const+0xe1a > C [libfastbit.so.0.0.9+0x62c63e] ibis::query::computeHits()+0x356 > C [libfastbit.so.0.0.9+0x623414] ibis::query::evaluate(bool)+0x4a6 > C [libfastbit.so.0.0.9+0xa4428d] float+0x3c0 > C [jna1347076628273574295.tmp+0x11b20] float+0x4c > > I'll try to investigate this latter but the last query of my tests it > executed was > > SELECT weird_category FROM /tmp/junit8378310225719591578 WHERE weird_category > LIKE '"val\\%' > > I'll try to investigate this, but it might be related to the fact that this > query is supposed to return no result. > > Thanks, > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Dominique Prunier > Sent: Wednesday, March 28, 2012 4:58 PM > To: K. John Wu > Cc: FastBit Users > Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 > > Hey John, > > I'll have a look to r507, probably tomorrow. To limit the risk, i've chosen > the well tested r506 as my stable version using the > FASTBIT_EMPTY_STRING_AS_NULL define. > > My problem with null mask is that i write my partitions in pure Java code. It > is quite straightforward to write data files and -part.txt, but writing a > NULL mask is something else. Besides, I don't really need a difference > between NULL and empty anyway, since treating empty strings as NULLs is no > different than some RDBMS engines that we support (Oracle does that for > example). > > I have one quick question about category columns. Do they really have a > separate NULL mask or do they use the bitmap stored at key 0 as their NULL > mask ? If so, that means that keeping the FASTBIT_EMPTY_STRING_AS_NULL would > allow me to indirectly build some real NULL values which would work with the > NOT NULL syntax (which, by the way, in SQL is IS NOT NULL/IS NULL, maybe it > would be worth changing it). > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Wednesday, March 28, 2012 3:26 PM > To: Dominique Prunier > Cc: FastBit Users > Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 > > Hi, Dominique, > > I have added code to accept "colname NOT NULL" in the where clauses. > The new code is in SVN Revision 507. > > The new revision should also consolidate the handling of many > bitvectors in category::patternSearch and address the issue of > possibly missing calls to index::activate (which leads to incorrect > answers). > > You can input null values through tablex::readCSV. There is an > example in tests/Makefile.am in the way it generates data partition > w7. The test case 15 of really-small also makes use of the new > expression "NOT NULL". > > Let me know if you spot any problems. > > John > > > On 3/28/12 10:45 AM, Dominique Prunier wrote: >> Hey John, >> >> My guess is that empty strings are used more often that we'd think for the >> same use case: use them as NULL marker because it is the easiest way to both >> insert and query (which is exactly what i use them for). Even if it is not >> an exact synonym of the SQL NULL (especially for propagation), for most use >> cases, it is close enough. The fact that its key is fixed and well known >> provides a valid workaround to simulate IS NULL/IS NOT NULL predicates and >> since it is excluded from most string related predicates (most importantly, >> a LIKE '%' wont select it), it is a quite easy and intuitive to use. >> >> Besides, the column data file for strings can't represent NULL values so you >> would have to deal with null masks, which make things significantly more >> complex (especially, to create a partition from scratch without using the >> FastBit library) and potentially more resource hungry (have to read/write >> null mask, ...). >> >> That being said, my only problem with making the distinction between NULL >> and empty in the dictionary is the current revision's inability to query >> empty strings and/or NULLs (which is currently worked around by the uint >> alternative, =0) and the fact that it matches a pattern like '%'. Since >> there is no guarantee on the empty string key, i don't have any workaround >> for now. >> >> For the time being, i'll stick with the FASTBIT_EMPTY_STRING_AS_NULL define >> waiting for the ambiguity to vanish. >> >> Thanks, >> >> -----Original Message----- >> From: K. John Wu [mailto:[email protected]] >> Sent: Wednesday, March 28, 2012 1:22 PM >> To: Dominique Prunier >> Cc: FastBit Users >> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >> >> The ability to distinguish between empty string and null string can be >> useful. The ambiguity in the current code is related to some earlier >> design oversight, which we intend to correct. My hope is that not too >> many people have empty strings that want to do something with, so >> breaking this compatibility is not a big deal. >> >> There are also a problem with the implementation of >> category::patternSearch that neglects to read the bitmaps from index >> files (neglecting to call index::activate). I am consolidating some >> code to make use of the existing strategies involving summation of a >> large number of bitmaps in ibis::index::sumBits and >> ibis::index::sumBins. These functions should encapsulate the idea of >> when to decompress bitmaps a lot better than the simple version >> currently in category::patternSearch. >> >> I am doing tests now and will let everyone know when I am ready to >> check the code in. >> >> John >> >> >> On 3/28/12 8:44 AM, Dominique Prunier wrote: >>> Thanks John for your help. It is always very appreciated. >>> >>> With the macro FASTBIT_EMPTY_STRING_AS_NULL enabled, all my test cases now >>> works. I'll test performance next but it could be a good candidate for my >>> stable version. >>> >>> About the FASTBIT_EMPTY_STRING_AS_NULL, what do you think would be the best >>> default ? For my use case, it is obviously to enable it, but for everybody >>> else i don't know. My concern is the backward compatibility here, >>> especially the fact that it influences the index creation, and not usage. >>> This means that somebody who upgrades won't notice this change before it >>> regenerates indexes. What's your opinion on this ? >>> >>> Thanks, >>> >>> -----Original Message----- >>> From: K. John Wu [mailto:[email protected]] >>> Sent: Wednesday, March 28, 2012 11:20 AM >>> To: Dominique Prunier >>> Cc: FastBit Users >>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>> >>> Hi, Dominique, >>> >>> Thanks for the suggestions and and test cases. Just checked in a set >>> of changes as SVN Revision 506. Here is a bit more explanation. >>> >>> On item 1, I have taken the option you've suggested in the first >>> message, i.e., use the macro FASTBIT_EMPTY_STRING_AS_NULL >>> >>> On item 2, I have restored the escaping of using backslash as you >>> requested. >>> >>> Will go through you tests cases next and see what else I need to change. >>> >>> John >>> >>> >>> On 3/28/12 8:02 AM, Dominique Prunier wrote: >>>> Hi John, >>>> >>>> Here is two simple test cases for my two issues with r505. >>>> >>>> 1. empty-strings.zip: there is no way to select empty string anymore >>>> >>>> **** FAILED (bad result count) with where clause << a='' >> >>>> >>>> **** FAILED (bad result count) with where clause << a=0 >> >>>> >>>> **** FAILED (hard) with where clause << a LIKE '' >> >>>> >>>> 2. de-escaping.zip: there is no way to select a string with a reserved >>>> char in it >>>> >>>> **** FAILED (bad result count) with where clause << a='it\'s good' >> >>>> >>>> Warning -- ibis::whereParser encountered syntax error, unexpected name >>>> string at location a='it's good':1.7 >>>> Warning -- query[96w0-3xn8Tg----1]::setWhereClause -- failed to parse the >>>> WHERE clause "a='it's good'" >>>> Warning -- fastbit_build_query failed to assign conditions (a='it's good') >>>> to a query >>>> **** FAILED (hard) with where clause << a='it's good' >> >>>> >>>> Warning -- ibis::whereParser encountered syntax error, unexpected >>>> $undefined at location a=it's good:1.5 >>>> Warning -- query[96w0-3xn8Tg----2]::setWhereClause -- failed to parse the >>>> WHERE clause "a=it's good" >>>> Warning -- fastbit_build_query failed to assign conditions (a=it's good) >>>> to a query >>>> **** FAILED (hard) with where clause << a=it's good >> >>>> >>>> With the attached patch, it fixes my test cases (except a LIKE '' that >>>> didn't work before and a='it's good' and a=it's good whieh are invalid >>>> where clauses). >>>> >>>> Thanks, >>>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Dominique >>>> Prunier >>>> Sent: Wednesday, March 28, 2012 10:04 AM >>>> To: K. John Wu >>>> Cc: FastBit Users >>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>> >>>> Hey John, >>>> >>>> I'm trying r505 right know, i have two questions/remarks: >>>> >>>> 1. I saw a change about the dictionary now able to accept empty strings >>>> and treat them as normal strings instead of NULLs. This breaks quite a lot >>>> of my test cases, specifically those which used to test category=0 or !=0 >>>> to simulate the IS NULL/IS NOT NULL because testing for an empty string >>>> doesn't work (='' or LIKE '' used to fail or return invalid results). What >>>> would be your recommendation for that use case ? Could we add a define for >>>> those relying on this behavior ? Something like: >>>> >>>> - //if (*str == 0) return 0; >>>> +#ifdef FASTBIT_EMPTY_STRING_AS_NULL >>>> + if (*str == 0) return 0; >>>> +#endif >>>> >>>> 2. I can't find anywhere the de-escaping patch in r505, am i missing >>>> something ? >>>> >>>> Thanks, >>>> >>>> -----Original Message----- >>>> From: K. John Wu [mailto:[email protected]] >>>> Sent: Wednesday, March 28, 2012 1:55 AM >>>> To: Dominique Prunier >>>> Cc: FastBit Users >>>> Subject: Re: [FastBit-users] PATCH: perf boost on top of r501 >>>> >>>> Hi, Dominique, >>>> >>>> There are some updates to involving merge of dictionaries and to >>>> exercise the operations involving unmatched quotes. The new code is >>>> SVN R505. >>>> >>>> Please let me know if you have any additional questions. >>>> >>>> John >>>> >>>> >>>> On 3/27/12 1:03 PM, Dominique Prunier wrote: >>>>> Hey John, >>>>> >>>>> There is definitely a need for FastBit escaping. The escaping problem is >>>>> not at the shell level (though we could have one there) since in pure >>>>> C/C++ code, there's no shell involved when i'm building a where clause >>>>> from a string. The problem is at the where clause parsing level (in the >>>>> lexer to be more precise) to be able to express string literals among >>>>> other things (and not only metas, it is also white spaces, ...). >>>>> Typically, my test that fails is as simple as calling >>>>> fastbit_build_query(..., ..., "a='it\\'s good'"). This is expected to >>>>> create a qString << a = it's good >> but now, it creates a qString << a = >>>>> it\'s good >> which is wrong. The attached patch restores the descaping, >>>>> but _not_ the double quote stripping (because it is already handled in >>>>> the lexer). All my test cases works after applying it on r503. >>>> >>>> The constructors for ibis::qString and ibis::qLike really should not >>>> strip away anything. In your case, you should be able to do the following >>>> >>>> .../tcapi data-dir "a=\"it's good\"" >>>> >>>> if you are using fastbit_build_query, you can use the same string >>>> fastbit_build_query(..., ..., "a=\"it's good\""); >>>> >>>> Since FastBit regular expression only support four meta characters ? * >>>> _ %. There is no need to escape anything. It is probably cleaner to >>>> not introduce stripping of anything special (except the outer most >>>> quotes, which should be only done once). >>>> >>>>> >>>>> About the decompression, thanks for the link, this is very interesting >>>>> stuff ! But my point here is not about questing the fact the >>>>> decompression can be better is some case, i was just under the impression >>>>> that the hit vector given to category::patternSearch was _always_ already >>>>> decompressed since it is ultimately a bitvector that has been created >>>>> from scratch for query evaluation (it would need verification though). My >>>>> guess is that the few percent of performance i'm loosing here are >>>>> attributable to the check (hits.isCompressed() && hits.bytes()*mult + >>>>> bv->bytes() > hits.size()), since it gets executed _A LOT_ of times. I'll >>>>> try to investigate it a little bit further. >>>> >>>> I have rearranged the tests for decompression in layers which >>>> hopefully will eliminate the need to perform more expensive tests in >>>> your case that presumably involve a fairly small number of values. >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
