Hi John, Thank you very much for considering my use case valid, even if it means more work to fix Andrew's one. I appreciate it.
On top of my head, i think getKey needs a bit more bullet proofing: check dic == 0, call prepareMembers, ... But i'm pretty sure it'll work and won't break write spec. Thanks, -----Original Message----- From: K. John Wu [mailto:[email protected]] Sent: Friday, March 23, 2012 12:48 PM To: Dominique Prunier Cc: FastBit Users Subject: Re: [FastBit-users] Changes in src/colValues.cpp@495 breaks ibis::bundle::getUInt for CATEGORY Hi, Dominique, Thanks for the clarification. Looks like here are my immediate actionable items. - allow ibis::bundle to create integer representation of categorical values (because each bundle can only be associated with one data partition and therefore no possibility of inconsistent dictionaries) - in colUInts::write allow categorical values to make use of dictionaries more directly (through getKey instead of getString) - add function to synchronize the dictionaries when multiple data partitions are involved The first two items should take a couple of hours to implement and test, the last one might take longer. Stay tuned. John On 3/22/12 4:58 PM, Dominique Prunier wrote: > Hey John, > > Don't worry, that is exactly why i'm building my query test suite. To make > sure i catch these sooner than later. > > About the use of uints for categories, this is actually a very convenient > feature that i would not want to see disappear (the issue i raised only > applies to bundles (select distinct) since retrieving uints with > category::selectUInts still works fine). > > I understand that it raises concerns for the multi partition scenario, but > i'm pretty sure there is a way reconcialiate both use cases. > > I have two concerns with the change introduced in colValues.cpp in r495: the > potential performance hit in bundles for not using a colUInts (even though > this is pure guess, i have not measured it yet) and the inability to get a > string key for category columns when using bundles (which ultimately has a > performance impact for me). > > Using uints allows an application to handle smaller objects much more > efficiently. This applies to FastBit itself where the select distinct based > on uints is extremely efficient. In my application, i retrieves uints instead > of strings too because the this is much faster if i have processing to do on > the keys before retrieving the actual string. > > About the change in colUInts::write i don't think i'm breaking the spec. I'm > writing getKey((*array)[i]), not getKey(i) so this is still the ith row as > promised. > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Thursday, March 22, 2012 7:37 PM > To: Dominique Prunier > Cc: FastBit Users > Subject: Re: [FastBit-users] Changes in src/colValues.cpp@495 breaks > ibis::bundle::getUInt for CATEGORY > > Hi, Dominique, > > Thanks for your patience while we work through these issues. > > SVN Revision 495 has to be replaced because the strategy of using > integer representation for categorical values does not work in the > general case. That was the main reason for moving all the cases of > ibis::CATEGORY back together with ibis::TEXT. This was triggered by > the test case from Andrew Olson last week. > > When there are multiple data partitions in a data table, the > dictionaries from all partitions must agree, otherwise, the integer > value of 1 for on data partition can mean a different thing in the > next data partition. The current code allows one to work with string > values as order by keys no matter whether the dictionaries agree or not. > > The particular function, colStrings::write, you've patched can not be > modified this way. It would violate the promise of writing the ith > row of the string-valued column. Your change will print the ith word > in the dictionary. > > If you can describe a little bit more of the issues you are having, we > will see how we might resolve them. > > John > > > > On 3/22/12 3:50 PM, Dominique Prunier wrote: >> Oh, sorry, it was r495, not line 495. I reverted the change introduced in >> src/colValues.cpp at revision 495 to restore functionality. >> >> However, it ended up being a bit more complex than that because of r496 >> which changed the behavior of category::getString (use row index, not key). >> The problem, after i reverted colValues, is that the bundle call the >> colUInt::getString which call category::getString with the key, not the row >> index. The follwoing change should fix it but prevent use of colUINT with >> TEXT columns (i don't think it was used, not sure it even makes sense): >> >> diff --git a/src/colValues.h b/src/colValues.h >> index ff7c1c3..232be61 100755 >> --- a/src/colValues.h >> +++ b/src/colValues.h >> @@ -239,10 +239,9 @@ public: >> } >> /// Write the ith element as text. >> virtual void write(std::ostream& out, uint32_t i) const { >> - if (col->type() == ibis::CATEGORY || col->type() == ibis::TEXT) { >> - std::string str; >> - col->getString((*array)[i], str); >> - if (str.empty()) { >> + if (col->type() == ibis::CATEGORY) { >> + const char * str = reinterpret_cast<const >> ibis::category*>(col)->getKey((*array)[i]); >> + if (str == 0 || (*str) == 0) { >> out << "<NULL>"; >> } >> else { >> >> All my test case works after this final change. I have attached the complete >> patch i'm applying on top of r497 to make everything work. I'm not sure i'm >> not breaking something else though (especially the things i reverted from >> r495). >> >> Thanks, >> >> -----Original Message----- >> From: K. John Wu [mailto:[email protected]] >> Sent: Thursday, March 22, 2012 6:43 PM >> To: FastBit Users >> Cc: Dominique Prunier >> Subject: Re: [FastBit-users] Changes in src/colValues.cpp@495 breaks >> ibis::bundle::getUInt for CATEGORY >> >> Hi, Dominique, >> >> I see the problem of assign a nil pointer to a string object. I would >> prefer to handle this case a little earlier in the code. The change >> will appear in the next SVN check in. >> >> Would you mind check the file and line number in the original message? >> I don't see any problem with src/colValues.cpp line 495 in my copy of >> the code.. >> >> John >> >> >> >> On 3/22/12 3:04 PM, Dominique Prunier wrote: >>> While i’m at it, the following patch fixes the NUL string case (key 0) >>> >>> >>> >>> diff --git a/src/category.cpp b/src/category.cpp >>> >>> index d8c1a9f..e7d28a1 100644 >>> >>> --- a/src/category.cpp >>> >>> +++ b/src/category.cpp >>> >>> @@ -715,7 +715,7 @@ void ibis::category::getString(uint32_t i, >>> std::string &str) const { >>> >>> ibis::array_t<uint32_t> ints; >>> >>> ierr = ibis::fileManager::instance().getFile(fnm.c_str(), ints); >>> >>> if (ierr >= 0 && ints.size() == thePart->nRows()) { >>> >>> - if (i < ints.size()) { >>> >>> + if (i < ints.size() && ints[i] > 0) { >>> >>> str = dic[ints[i]]; >>> >>> } >>> >>> return; >>> >>> >>> >>> Thanks, >>> >>> >>> >>> *From:*[email protected] >>> [mailto:[email protected]] *On Behalf Of *Dominique >>> Prunier >>> *Sent:* Thursday, March 22, 2012 5:59 PM >>> *To:* FastBit Users >>> *Subject:* [FastBit-users] Changes in src/colValues.cpp@495 breaks >>> ibis::bundle::getUInt for CATEGORY >>> >>> >>> >>> Hey John, >>> >>> >>> >>> The title says pretty much everything. I’m not sure why this has been >>> changed (this is a huge changeset in a part of the lib i don’t really >>> know), but it breaks the ability to access a CATEGORY string key >>> though bundles. If i revert only this file, it seems to work. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> */Dominique Prunier/**//* >>> >>> APG Lead Developper >>> >>> Logo-W4N-100dpi >>> >>> 4388, rue Saint-Denis >>> >>> Bureau 309 >>> >>> Montreal (Quebec) H2J 2L1 >>> >>> Tel. +1 514-842-6767 x310 >>> >>> Fax +1 514-842-3989 >>> >>> [email protected] <mailto:[email protected]> >>> >>> www.watch4net.com <http://www.watch4net.com/> >>> >>> / / >>> >>> /This message is for the designated recipient only and may contain >>> privileged, proprietary, or otherwise private information. If you have >>> received it in error, please notify the sender immediately and delete >>> the original. Any other use of this electronic mail by you is prohibited. >>> >>> //Ce message est pour le récipiendaire désigné seulement et peut >>> contenir des informations privilégiées, propriétaires ou autrement >>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur >>> immédiatement et effacez l'original. Toute autre utilisation de ce >>> courrier électronique par vous est prohibée./// >>> >>> >>> >>> >>> >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
