Re: [Rdkit-discuss] RDKit workflow in KNIME

2016-10-24 Thread Simon Saubern
Stuart,

The PAINS file is available in the RDKIT Github repository. If that's 
too complicated deal with at this early stage, try some of the workflows 
on myexperiment.org:

http://www.myexperiment.org/workflows/1841.html (embedded file)
or
http://www.myexperiment.org/workflows/4748.html (just in a table)

Simon

--
The Command Line: Reinvented for Modern Developers
Did the resurgence of CLI tooling catch you by surprise?
Reconnect with the command line and become more productive. 
Learn the new .NET and ASP.NET CLI. Get your free copy!
http://sdm.link/telerik
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] updated SMARTS filters for PAINS

2015-08-26 Thread Simon Saubern
I have the original Sybyl output from Johnathan. It's not in the most 
friendly format. All I did was run a few sed commands past it to extract 
the ID numbers, and also compile some frequency tables v. PAINS query.


I've sent a zip file to you directly.

Simon

On 26/08/2015 15:20 , Greg Landrum wrote:

Thanks for that.
Do you have a version that says which of the molecules hit which 
PAINS? That would really help with the refinement.


-greg




--

CSIRO Manufacturing Flagship,   phone: +61 3 9545-
Bag 10,   fax: +61 3 9545-2453
Clayton South VIC 3169,  http://www.csiro.au/manufacturing
Australiamailto:simon.saub...@csiro.au

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] updated SMARTS filters for PAINS

2015-08-25 Thread Simon Saubern

Attached the original list from Jonathan of the 861 SLN hits.

S.

On 26/08/2015 13:08 , Greg Landrum wrote:


On Wed, Aug 26, 2015 at 2:32 AM, Simon Saubern simon.saub...@csiro.au 
mailto:simon.saub...@csiro.au wrote:


Thanks for doing this Greg.

Fixing those SMARTS queries always looked like it would be a
real...pain.


:-)

I've dropped your Github file into the KNIME workflow, and the
RDKit version of the workflow (using nodes RDKit
2.5.0.201505221301) now hits 770 structures in the WEHI-10k test set.


For what it's worth, I now get 888 matches across the WEHI-10K set 
when running my Python test script. I am not 100% sure that the KNIME 
nodes are doing (or can do) the mergeQueryHs step; that's something 
else for me to follow up on.


But that includes 19 false positives that weren't being caught by
the SLN filters.

One filter alone is responsible for 17 of those false positives:

anil_di_alk_C(246)
old:
c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7](-[#6;X4])-[$([#1]),$([#6;X4])]
new: c:1:c:c(:c:c:c:1-[#8]-[#6;X4])-[#7;!H0,$([#7]-[#6;X4])]-[#6;X4]

An example of one of the false positive structures is the aniline
sulfonamide WEHI-18518.

I've checked with Johnathan, and the intention of that query is
that ... that the nitrogen has a single bond to a carbon that has
four atoms bonded to it (i.e. sp3), and that the other atom singly
bonded to the nitrogen atom is anything so long as it is either H
or an sp3 carbon.

So no to sulfonamides, and also some of the acetamide (sp2 C)
showing up as hits.


Thanks for pointing that out and providing the clarification about 
what is expected!

I just committed a fix for this:
https://github.com/rdkit/rdkit/commit/e2487ffe79c393a6b0e472882bfb6eb66a3bcb8b

As an aside: If you could provide a text file that has the matches 
found for each pattern in the WEHI-10k test set when you use the SLN 
version of the PAINS, I would be very happy to use that to further 
refine these patterns and to incorporate those results into the tests.


-greg





WEHI-0002757
WEHI-0003718
WEHI-0004345
WEHI-0005047
WEHI-0005752
WEHI-0006137
WEHI-0006195
WEHI-0006607
WEHI-0006892
WEHI-0007328
WEHI-0007435
WEHI-0007798
WEHI-0008187
WEHI-0008558
WEHI-0009314
WEHI-0011538
WEHI-0011957
WEHI-0012384
WEHI-0012615
WEHI-0012702
WEHI-0012773
WEHI-0012790
WEHI-0012829
WEHI-0012939
WEHI-0013053
WEHI-0013276
WEHI-0013384
WEHI-0013507
WEHI-0013892
WEHI-0013909
WEHI-0014006
WEHI-0014370
WEHI-0014546
WEHI-0014816
WEHI-0014836
WEHI-0014902
WEHI-0014937
WEHI-0015069
WEHI-0015806
WEHI-0015833
WEHI-0016142
WEHI-0016145
WEHI-0016287
WEHI-0016293
WEHI-0016316
WEHI-0016680
WEHI-0016735
WEHI-0016897
WEHI-0016957
WEHI-0016962
WEHI-0016985
WEHI-0017369
WEHI-0017518
WEHI-0017809
WEHI-0017964
WEHI-0018024
WEHI-0018269
WEHI-0018910
WEHI-0018941
WEHI-0018980
WEHI-0019026
WEHI-0019035
WEHI-0019132
WEHI-0019903
WEHI-0020161
WEHI-0020193
WEHI-0020284
WEHI-0020337
WEHI-0020458
WEHI-0020926
WEHI-0020933
WEHI-0020934
WEHI-0020935
WEHI-0020941
WEHI-0021184
WEHI-0023024
WEHI-0023287
WEHI-0023407
WEHI-0023681
WEHI-0023788
WEHI-0023867
WEHI-0023878
WEHI-0023997
WEHI-0024471
WEHI-0024472
WEHI-0024647
WEHI-0024825
WEHI-0024863
WEHI-0024880
WEHI-0024921
WEHI-0025079
WEHI-0025267
WEHI-0025330
WEHI-0025376
WEHI-0025383
WEHI-0025388
WEHI-0025503
WEHI-0025579
WEHI-0025580
WEHI-0025582
WEHI-0025928
WEHI-0026032
WEHI-0026074
WEHI-0026076
WEHI-0026387
WEHI-0026861
WEHI-0026867
WEHI-0027388
WEHI-0027950
WEHI-0028261
WEHI-0028555
WEHI-0029002
WEHI-0029119
WEHI-0029150
WEHI-0029798
WEHI-0030010
WEHI-0030096
WEHI-0030547
WEHI-0030565
WEHI-0030575
WEHI-0030680
WEHI-0030930
WEHI-0030934
WEHI-0030951
WEHI-0030982
WEHI-0031003
WEHI-0031038
WEHI-0031099
WEHI-0031466
WEHI-0031501
WEHI-0031558
WEHI-0031567
WEHI-0031580
WEHI-0031588
WEHI-0031724
WEHI-0031740
WEHI-0031760
WEHI-0031812
WEHI-0031877
WEHI-0031964
WEHI-0032008
WEHI-0032062
WEHI-0032098
WEHI-0032137
WEHI-0032203
WEHI-0032316
WEHI-0032441
WEHI-0032550
WEHI-0032578
WEHI-0032654
WEHI-0032721
WEHI-0032885
WEHI-0032911
WEHI-0033083
WEHI-0033129
WEHI-0033323
WEHI-000
WEHI-005
WEHI-0033533
WEHI-0033701
WEHI-0033845
WEHI-0033898
WEHI-0033908
WEHI-0033945
WEHI-0034021
WEHI-0034271
WEHI-0034396
WEHI-0034445
WEHI-0034452
WEHI-0034461
WEHI-0034530
WEHI-0034703
WEHI-0034822
WEHI-0034838
WEHI-0034845
WEHI-0035236
WEHI-0035238
WEHI-0035255
WEHI-0035272
WEHI-0035277
WEHI-0035450
WEHI-0035595
WEHI-0035597
WEHI-0035630
WEHI-0035869
WEHI-0035912
WEHI-0036028
WEHI-0036184
WEHI-0036307
WEHI-0036313
WEHI-0036341
WEHI-0036510
WEHI-0036533
WEHI-0036558
WEHI-0036724
WEHI-0036737
WEHI-0036751
WEHI-0036982
WEHI-0037607
WEHI-0037998
WEHI-0038095
WEHI-0038687
WEHI-0038931
WEHI-0039118
WEHI-0039383
WEHI-0039450
WEHI-0039487
WEHI-0039519
WEHI-0039633
WEHI-004
WEHI-0040073
WEHI-0040109
WEHI-0040114
WEHI-0040305
WEHI-0040558
WEHI-0040725
WEHI-0040837
WEHI-0041008
WEHI-0041069
WEHI-0041267
WEHI-0041687
WEHI-0041730
WEHI

Re: [Rdkit-discuss] PAINS

2013-04-25 Thread Simon Saubern
Nicholas, this is an O(n^2) problem (many-to-many) and difficult to make 
efficient. It is, however, 'embarrassingly parallel' so you can take 
advantage of multiple cores.

Have a look at how these 2 KNIME workflows implement the PAINS filters 
with the RDKit nodes in KNIME:

http://www.myexperiment.org/workflows/1841.html
http://www.myexperiment.org/workflows/2485.html

-- 

Cheers,

Simon


--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes

2011-09-30 Thread Simon Saubern
So the 2.0.0.1088 nodes now generate 636 matches and only 2 false positives:

WEHI-0054407S=C1N(C(=C(C2=C1CN(C(C2)(C)C)C)C#N)N)C 
[#6](-[#1])(-[#1])-[#7]([#6]:[#6])~[#6][#6]=,:[#6]-[#6]~[#6][#7] 
dyes5A(27)
WEHI-0063070N1C(=NC=C(C1=O)C)NN=Cc2ccc(cc2)N(C)C 

[#6](-[#1])(-[#1])-[#7](-[#6](-[#1])-[#1])-c:1:c(:c(:c(:c(:c:1-[#1])-[#1])-[#6](-[#1])=[#7]-[#7]-[$([#6](=[#8])-[#6](-[#1])(-[#1])-[#16]-[#6]:[#7]),$([#6](=[#8])-[#6](-[#1])(-[#1])-[!#1]:[!#1]:[#7]),$([#6](=[#8])-[#6]:[#6]-[#8]-[#1]),$([#6]:[#7]),$([#6](-[#1])(-[#1])-[#6](-[#1])-[#8]-[#1])])-[#1])-[#1]
 
hzone_anil_di_alk(35)


-- 

Cheers,

Simon

--
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] KNIME - Treatment of H in 2.0.0.1061 nodes

2011-09-27 Thread Simon Saubern

Hi Greg,

The recent updates to the way explicit hydrogens are handled in the 
RDKit nodes for KNIME  http://goo.gl/DK0FS have dramatically improved 
the number of correct matches that we observe when using the PAINS 
filters workflow http://goo.gl/T9mT2 .


Against the reference set from WEHI, we're now seeing 652 matches (up 
from 329), but we also now get 231 false positives where we were 
getting none before.


Attached is a tab-sep file containing the mis-matches (regID, smiles, 
smarts, smartsID).


The smarts strings come from Raj's blog: http://blog.rguha.net/?p=850.

Let us know if you need additional info to diagnose what's going on.
--

Cheers,

Simon

%RDKIT2-231.txt
Description: application/applefile


RDKIT2-231.txt
Description: Binary data
--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss