Dear RDKitters,
I'm in the process of training a 3-class decision tree model. I have
roughly about 1500 compounds with an almost equal distribution of the 3
classes.
This is the Grow command I'm using for MorganFP model:
nPossible = [0]+[2]*2048+[3]
cmp.Grow(pts,attrs=[1],nPossibleVals=[3],nTries=10,buildDriver=CrossValidate.CrossValidationDriver,treeBuilder=SigTreeBuilder,needsQuantization=False,maxDepth=3)
and these are code lines using a descriptor based model:
ndescrs = len(pts[0])-2
boundsPerVar = [0]+[1]*ndescrs+[0]
nPossible = [0]+[2]*ndescrs+[3]
attrs = range(1,ndescrs+1)
cmp.Grow(pts,attrs=attrs,nPossibleVals=nPossible,nTries=10,buildDriver=CrossValidate.CrossValidationDriver,treeBuilder=QuantTreeBoot,
needsQuantization=False,nQuantBounds=boundsPerVar, maxDepth=3)
Apparently, I screwed up parts of my code, because the "Cycle output" is
the following:
Cycle: 0
Cycle: 3
Cycle: 6
Cycle: 9
Cycle: 12
Cycle: 15
Cycle: 18
Cycle: 21
Cycle: 24
Cycle: 27
Up to yesterday, the numbering scheme was 0,1,2 -- however, this effect
not really worries me. Or is it somethin to take care of?
I played around with the following settings:
* random training / test set selection (training set size: 75 %)
* diverse selection of training / test set (training set size: 75 %)
* MorganFP as well as RDKit descriptors - either a random selection of the
training set or a diverse selection
* nTries = 10 or 20 or 30
In all cases, the statistics is really bad: about 50 percent are
misclassified, e.g.:
"
*** Vote Results ***
misclassified: 580/1180 (%49.15) 580/1180 (%49.15)
average correct confidence: 0.7837
average incorrect confidence: 0.7528
"
Interestingly, there is a really small difference between the average
confidence level for the correct as well as the incorrect classifications.
As far as I got it this tells me that the model is really bad - an
information I already got by the vote results themselves.
Which parameters are worthhile to test?
Cheers & Thanks,
Paul
This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient, you
must not copy this message or attachment or disclose the contents to any other
person. If you have received this transmission in error, please notify the
sender immediately and delete the message and any attachment from your system.
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept
liability for any omissions or errors in this message which may arise as a
result of E-Mail-transmission or for damages resulting from any unauthorized
changes of the content of this message and any attachment thereto. Merck KGaA,
Darmstadt, Germany and any of its subsidiaries do not guarantee that this
message is free of viruses and does not accept liability for any damages caused
by any virus transmitted therewith.
Click http://disclaimer.merck.de to access the German, French, Spanish and
Portuguese versions of this disclaimer.
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss