Hi Nina, Rajarshi,

thanks a lot for your feedback and ideas. There seems to be no easy
solution, though.

I just tested a very simple approach:
* stripping away small compounds (num non-H atoms <= 2)
* for remaining mixtures / dot-connected compounds: use mean feature
value from single compound feature values
I tried it on the CPDBAS data (1508 entries with lots of salts and
mixtures, 7 different endpoints), standard PC descriptors vs the new
approach. There was no significant changes in prediction performance
(5 times 10-fold cv).

Our project partner (chemist expert) just stated that we have to go
through each compound in our data.
And I think you are right, one has to decide for each descriptor how
to properly handle dot-connected compounds.

Best regards,
Martin














On Fri, Aug 23, 2013 at 6:23 AM, Nina Jeliazkova
<[email protected]> wrote:
> Hi Martin, All,
>
>
> On 22 August 2013 23:55, Martin Guetlein <[email protected]>
> wrote:
>>
>> On Thu, Aug 22, 2013 at 6:28 PM, Rajarshi Guha <[email protected]>
>> wrote:
>> > Do you mean dot connected compounds? In that sense, most (if not all)
>> > QSAR
>> > descriptors should be evaluating descriptors for individual components
>> > separately. After that what they do depends on the application - if
>> > we're
>> > talking about salt forms, probably drop the salt components.
>> > Alternatively,
>> > if we're talking about mixtures (which is not really the case for a dot
>> > connected representation), there could be various ways to generate a
>> > mixture
>> > descriptor
>>
>> Hi Rajarshi,
>>
>> Yep, in smiles representation its dot connected compounds.
>> In our data these compounds are mostly salts/ions, some mixtures and
>> some isomeres.
>> So in more detail, how would you compute the the descriptor values for
>> the mixtures?
>
>
> It depends. Calculating properties of mixtures is a science of its own.
> Properties of salts could be different to those of parent compounds. For
> isomers getting mean is just one of the options.
>
> On the other side, it depends on the descriptors as well. A single
> fingerprint value (or structure alert) for the entire (dot-connected)
> compound is perfectly fine in most cases. Simple examples where one value is
> valid are the molecular mass or atoms/bonds count. I'm sure the chemists on
> this list will come with more examples :)
>
> My preferred solution actually would be to introduce interfaces, specifying
> if a descriptor accepts a disconnected structure, or not. If it does accept,
> then it might returns a single value (or set of values - this means a second
> interface specifying the return value) . If not, then an exception is
> thrown, and it is the application responsibility to take the proper action
> (e.g. most commercial software does some be kind of standardization,
> splitting salts, etc.).
>
> Best regards,
> Nina
>
>
>>
>> For isomeres, using the mean value should be fine, what do you think?
>>
>> Kind regards,
>> Martin
>>
>>
>>
>> >
>> >
>> > On Thu, Aug 22, 2013 at 12:14 PM, Martin Guetlein
>> > <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >> How do CDK descriptors handle molecules with multiple compounds in it?
>> >>
>> >> I experimented a bit, and found out that it depends on the descriptor:
>> >> * most descriptors apparently just add up the values of the single
>> >> compounds (like xlogp, that does make no sense does it?)
>> >> * some fail for multi-compound molecules
>> >> * some compute sth else
>> >>
>> >> My application is building QSAR models. I am not a chemist, but my
>> >> feeling is that the clean but complicated solution would be to have
>> >> 'set-valued features' (a set of values instead of a single value) for
>> >> multi-compound molecules. But thats pretty complicated and most of my
>> >> molecules have only one compound. But I think that the average value
>> >> of the single compounds should be preferred for descriptors like
>> >> molecular weight or logp.
>> >>
>> >> Kind regards,
>> >> Martin
>> >>
>> >> P.S.: Sorry, If I missed existing discussions/documentation on this
>> >> issue, I had some problems to denominate (and therefore google) this
>> >> issue.
>> >>
>> >> --
>> >> Dipl-Inf. Martin Gütlein
>> >> Phone:
>> >> +49 (0)761 203 8442 (office)
>> >> +49 (0)177 623 9499 (mobile)
>> >> Email:
>> >> [email protected]
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Introducing Performance Central, a new site from SourceForge and
>> >> AppDynamics. Performance Central is your source for news, insights,
>> >> analysis and resources for efficient Application Performance
>> >> Management.
>> >> Visit us today!
>> >>
>> >>
>> >> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>> >> _______________________________________________
>> >> Cdk-user mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/cdk-user
>> >
>> >
>> >
>> >
>> > --
>> > Rajarshi Guha | http://blog.rguha.net
>> > NIH Center for Advancing Translational Science
>>
>>
>>
>> --
>> Dipl-Inf. Martin Gütlein
>> Phone:
>> +49 (0)761 203 8442 (office)
>> +49 (0)177 623 9499 (mobile)
>> Email:
>> [email protected]
>>
>>
>> ------------------------------------------------------------------------------
>> Introducing Performance Central, a new site from SourceForge and
>> AppDynamics. Performance Central is your source for news, insights,
>> analysis and resources for efficient Application Performance Management.
>> Visit us today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Cdk-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>



-- 
Dipl-Inf. Martin Gütlein
Phone:
+49 (0)761 203 8442 (office)
+49 (0)177 623 9499 (mobile)
Email:
[email protected]

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to