Hi Nina, Rajarshi, thanks a lot for your feedback and ideas. There seems to be no easy solution, though.
I just tested a very simple approach: * stripping away small compounds (num non-H atoms <= 2) * for remaining mixtures / dot-connected compounds: use mean feature value from single compound feature values I tried it on the CPDBAS data (1508 entries with lots of salts and mixtures, 7 different endpoints), standard PC descriptors vs the new approach. There was no significant changes in prediction performance (5 times 10-fold cv). Our project partner (chemist expert) just stated that we have to go through each compound in our data. And I think you are right, one has to decide for each descriptor how to properly handle dot-connected compounds. Best regards, Martin On Fri, Aug 23, 2013 at 6:23 AM, Nina Jeliazkova <[email protected]> wrote: > Hi Martin, All, > > > On 22 August 2013 23:55, Martin Guetlein <[email protected]> > wrote: >> >> On Thu, Aug 22, 2013 at 6:28 PM, Rajarshi Guha <[email protected]> >> wrote: >> > Do you mean dot connected compounds? In that sense, most (if not all) >> > QSAR >> > descriptors should be evaluating descriptors for individual components >> > separately. After that what they do depends on the application - if >> > we're >> > talking about salt forms, probably drop the salt components. >> > Alternatively, >> > if we're talking about mixtures (which is not really the case for a dot >> > connected representation), there could be various ways to generate a >> > mixture >> > descriptor >> >> Hi Rajarshi, >> >> Yep, in smiles representation its dot connected compounds. >> In our data these compounds are mostly salts/ions, some mixtures and >> some isomeres. >> So in more detail, how would you compute the the descriptor values for >> the mixtures? > > > It depends. Calculating properties of mixtures is a science of its own. > Properties of salts could be different to those of parent compounds. For > isomers getting mean is just one of the options. > > On the other side, it depends on the descriptors as well. A single > fingerprint value (or structure alert) for the entire (dot-connected) > compound is perfectly fine in most cases. Simple examples where one value is > valid are the molecular mass or atoms/bonds count. I'm sure the chemists on > this list will come with more examples :) > > My preferred solution actually would be to introduce interfaces, specifying > if a descriptor accepts a disconnected structure, or not. If it does accept, > then it might returns a single value (or set of values - this means a second > interface specifying the return value) . If not, then an exception is > thrown, and it is the application responsibility to take the proper action > (e.g. most commercial software does some be kind of standardization, > splitting salts, etc.). > > Best regards, > Nina > > >> >> For isomeres, using the mean value should be fine, what do you think? >> >> Kind regards, >> Martin >> >> >> >> > >> > >> > On Thu, Aug 22, 2013 at 12:14 PM, Martin Guetlein >> > <[email protected]> wrote: >> >> >> >> Hi, >> >> >> >> How do CDK descriptors handle molecules with multiple compounds in it? >> >> >> >> I experimented a bit, and found out that it depends on the descriptor: >> >> * most descriptors apparently just add up the values of the single >> >> compounds (like xlogp, that does make no sense does it?) >> >> * some fail for multi-compound molecules >> >> * some compute sth else >> >> >> >> My application is building QSAR models. I am not a chemist, but my >> >> feeling is that the clean but complicated solution would be to have >> >> 'set-valued features' (a set of values instead of a single value) for >> >> multi-compound molecules. But thats pretty complicated and most of my >> >> molecules have only one compound. But I think that the average value >> >> of the single compounds should be preferred for descriptors like >> >> molecular weight or logp. >> >> >> >> Kind regards, >> >> Martin >> >> >> >> P.S.: Sorry, If I missed existing discussions/documentation on this >> >> issue, I had some problems to denominate (and therefore google) this >> >> issue. >> >> >> >> -- >> >> Dipl-Inf. Martin Gütlein >> >> Phone: >> >> +49 (0)761 203 8442 (office) >> >> +49 (0)177 623 9499 (mobile) >> >> Email: >> >> [email protected] >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Introducing Performance Central, a new site from SourceForge and >> >> AppDynamics. Performance Central is your source for news, insights, >> >> analysis and resources for efficient Application Performance >> >> Management. >> >> Visit us today! >> >> >> >> >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk >> >> _______________________________________________ >> >> Cdk-user mailing list >> >> [email protected] >> >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> > >> > >> > >> > >> > -- >> > Rajarshi Guha | http://blog.rguha.net >> > NIH Center for Advancing Translational Science >> >> >> >> -- >> Dipl-Inf. Martin Gütlein >> Phone: >> +49 (0)761 203 8442 (office) >> +49 (0)177 623 9499 (mobile) >> Email: >> [email protected] >> >> >> ------------------------------------------------------------------------------ >> Introducing Performance Central, a new site from SourceForge and >> AppDynamics. Performance Central is your source for news, insights, >> analysis and resources for efficient Application Performance Management. >> Visit us today! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk >> _______________________________________________ >> Cdk-user mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/cdk-user > > -- Dipl-Inf. Martin Gütlein Phone: +49 (0)761 203 8442 (office) +49 (0)177 623 9499 (mobile) Email: [email protected] ------------------------------------------------------------------------------ Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk _______________________________________________ Cdk-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cdk-user

