Re: [Rdkit-discuss] random forest in RDKit

Greg Landrum Fri, 06 May 2011 02:16:57 -0700

Hi Igor,

On Thu, May 5, 2011 at 5:37 PM, Igor Filippov <igor.v.filip...@gmail.com> wrote:
>
> Thank you.
> Questions :)


let's see how many answers I have. ;-)

> 1) I'm getting the following error message:
> nms.remove('MolecularFormula')
> ValueError: list.remove(x): x not in list
>
> when I check the list of descriptor names indeed I don't see
> MolecularFormula,

You're probably using an older version of the RDKit.

> 2) I'm getting the warning message:
> [11:27:24] WARNING: The AvailDescriptors module is deprecated. Please
> switch to using the Descriptors module.

I'm aware of this one and will get it fixed for the next release. It's
safe to ignore.

> 3) Does the following means that each descriptors is effectively
> binarized?
> # number of possible values of each descriptor:
> nPossible = [0]+[2]*ndescrs+[2]

Yes, that variable and boundsPerVar control it.
A pretty simple approach is used of finding the best bounds for each
variable at each node in the decision tree.
There's a very brief description of the algorithm here:
Landrum, G.A., Penzotti, J.E. & Putta, S. Machine-learning models for
combinatorial catalyst discovery. Meas. Sci. Technol. 16, 270-277
(2005).

You could also do multiple bins for each descriptor, but that is more
time consuming.

> 4) Is this the correct way to build RandomForest? The wiki page sadly
> stops at "bag of decision trees", I tried to extrapolate example from
> RandomForest for binary fingerprint:
> cmp.Grow(pts,attrs=attrs,nPossibleVals=nPossible,nTries=100,randomDescriptors=20,
>         buildDriver=CrossValidate.CrossValidationDriver,
> treeBuilder=QuantTreeBoot,needsQuantization=False,nQuantBounds=boundsPerVar,maxDepth=100)

That's correct.

> I'm specifically concerned that needsQuantization=False (should it be
> True in this case?), also maxDepth parameter - from what I understand
> the randomForest trees should not be pruned, why is there maxDepth
> parameter at all?

If you don't set the maxDepth parameter, it will grow trees as large
as it can. If you do set it, it stops growing at the specified depth
(this is not pruning, it's just stopping the growth). I usually use
the maxDepth parameter because I don't think I've ever worked on a
problem where either the data were reliable enough or the descriptors
were good enough to justify building really huge trees. If we know the
deeper bits of the the tree are just contributing noise, why not just
save a bunch of time by not generating them at all?

> 5) How do I suppress output of
> Cycle:    0
> Cycle:   10
> ...
> ?

add the argument: silent=True

>
> 6) The wiki seems to stop abruptly at
> "Composite models can be also be pickled to disk and then reloaded to
> classify new points:"

Yeah, the wiki has some formatting (and content) errors. I will work
on cleaning them up.

> 7) For some strange reason the model i build in this way predicts
> nothing but zeros! The previous model - with fingerprints was predicting
> something at least...
> Do I still use ClassifyExample similar to as before?
> cmp.ClassifyExample([i]+list(test_descrs[i])+[act])

That looks correct to me and works when I try with my test case:
[7]>>> mdl = cPickle.load(file('mdl.pkl','rb'))

[8]>>> mdl.ClassifyExample(pts[0])
Out[8] (1, 0.83999999999999997)

Are you sure that list(test_descrs[i]) is correct?

-greg

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] random forest in RDKit

Reply via email to