[Rdkit-discuss] how to come to a good model

Paul . Czodrowski Fri, 07 Oct 2011 03:32:13 -0700

Dear RDKitters,

I'm in the process of training a 3-class decision tree model. I have 
roughly about 1500 compounds with an almost equal distribution of the 3 
classes.


This is the Grow command I'm using for MorganFP model:
nPossible = [0]+[2]*2048+[3]
cmp.Grow(pts,attrs=[1],nPossibleVals=[3],nTries=10,buildDriver=CrossValidate.CrossValidationDriver,treeBuilder=SigTreeBuilder,needsQuantization=False,maxDepth=3)

and these are code lines using a descriptor based model:
ndescrs = len(pts[0])-2
boundsPerVar = [0]+[1]*ndescrs+[0]
nPossible = [0]+[2]*ndescrs+[3]
attrs = range(1,ndescrs+1)
cmp.Grow(pts,attrs=attrs,nPossibleVals=nPossible,nTries=10,buildDriver=CrossValidate.CrossValidationDriver,treeBuilder=QuantTreeBoot,
 
needsQuantization=False,nQuantBounds=boundsPerVar, maxDepth=3)


Apparently, I screwed up parts of my code, because the "Cycle output" is 
the following:
Cycle:    0
Cycle:    3
Cycle:    6
Cycle:    9
Cycle:   12
Cycle:   15
Cycle:   18
Cycle:   21
Cycle:   24
Cycle:   27


Up to yesterday, the numbering scheme was 0,1,2 -- however, this effect 
not really worries me. Or is it somethin to take care of?

I played around with the following settings:
* random training / test set selection (training set size: 75 %)
* diverse selection of training / test set (training set size: 75 %)
* MorganFP as well as RDKit descriptors - either a random selection of the 
training set or a diverse selection
* nTries = 10 or 20 or 30


In all cases, the statistics is really bad: about 50 percent are 
misclassified, e.g.:
"
        *** Vote Results ***
misclassified: 580/1180 (%49.15)        580/1180 (%49.15)

average correct confidence:    0.7837
average incorrect confidence:  0.7528
"

Interestingly, there is a really small difference between the average 
confidence level for the correct as well as the incorrect classifications. 
As far as I got it this tells me that the model is really bad - an 
information I already got by the vote results themselves.


Which parameters are worthhile to test?


Cheers & Thanks,
Paul


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.

Click http://disclaimer.merck.de to access the German, French, Spanish and 
Portuguese versions of this disclaimer.

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] how to come to a good model

Reply via email to