Re: [Rdkit-discuss] Oracle, pypl and rdkit

Jan Holst Jensen Fri, 13 Mar 2015 05:19:17 -0700

Hi Michal,

> I'm going to use fingerprints, but I was hoping to use UTL_RAW bitwise 
> operation to
> handle them (we'll see how this goes).


That sounds like something I would love to hear about :-).

> What worries me that invoking structure matching via PYPL for each
> molecule would be slow, do you see any way of doing it batchwise?

I would try to bulk collect structures with PL/SQL and pass them as an 
array or list to a PYPL function. You have to encode the structures as 
text and you will be limited by PYPL's max. 32K of text per call. So the 
input has to be split into chunks of max. 32K data.

The 32K data limitation per roundtrip can in principle be removed by 
implementing direct data access to tables in the C-code of the PYPL 
cartridge. However, since the maximum data transfer unit of Oracle is 
32K (at least it was in 10g) I expect that you would get a roundtrip 
overhead for each 32K of data anyway. So the benefit of implementing 
direct table access in the cartridge might be less than expected.

Then again, if such functionality was added to the cartridge the 
structures could be passed more directly to RDKit without encoding and 
text parsing by the Python interpreter, so... But I would see how far I 
could get with a combination of PL/SQL and plain PYPL before adding code 
to the cartidge.

Cheers
-- Jan

On 2015-03-13 10:29, Michal Krompiec wrote:
> Hi Jan and TJ,
> Thank you very much for your comments. Yes, I'm going to use
> fingerprints, but I was hoping to use UTL_RAW bitwise operation to
> handle them (we'll see how this goes).
> What worries me that invoking structure matching via PYPL for each
> molecule would be slow, do you see any way of doing it batchwise? (for
> example, using oracle's table functions)
> Best wishes,
> Michal
>
> On 13 March 2015 at 07:50, Jan Holst Jensen <[email protected]> wrote:
>> Hi Michal and TJ,
>>
>> The nice thing about Postgres extensions is that they are loaded directly
>> into the session's process space. Therefore the overhead is minimal, almost
>> non-existing. Not so with Oracle cartridges/extensions that are loaded in a
>> separate process, the extproc process.
>>
>> The overhead per call into PYPL is on the order of tens of microseconds,
>> which could be a lot or not, depending on how many calls you do and what
>> kind of calls.
>>
>> I have tried to do a naïve SSS search with PYPL and HasSubstructMatch() on a
>> database of 70 000 compounds (seventy thousand) and it took several minutes
>> to complete so it was not really usable. If you need any kind of speed you
>> need to use fingerprints to find an initial hit list, and you need to pass
>> fingerprints in bulk to PYPL to avoid too much call overhead.
>>
>>> Do consecutive pypl calls always share the same interpreter?
>> On Oracle 10g and 11g, yes. I do have a disclaimer that it might not be the
>> case if you run shared server, but in my experience even shared server
>> ensures that each session gets its own private instance of an interpreter
>> (its own extproc process). And, if you run a multi-threaded extproc
>> configuration then there are no guarantees, but I don't know anyone who does
>> that.
>>
>> On 12c I just don't know yet. The little I have done with it seems to
>> indicate that it behaves like 10 and 11, so looking good so far.
>>
>> Cheers
>> -- Jan
>>
>> On 2015-03-13 00:43, TJ O'Donnell wrote:
>>
>> I've implemented a suite of rdkit functions
>> for postgres using plpython
>> https://github.com/tjod/rdchord
>> and the overhead is minimal
>> since most of the heavy lifting of substructure searching
>> is done by rdkit.
>>
>> I think the same would be true of oracle.
>> -------------------
>> TJ O'Donnell
>>
>> On Thu, Mar 12, 2015 at 4:24 PM, Michal Krompiec <[email protected]>
>> wrote:
>>> Hello, has anybody tried to implement substructure searching in an Oracle
>>> database using PYPL and RDKit? Is it just a matter of writing a wrapper
>>> function for molecule.HasSubstructMatch(pattern) or is the overhead of
>>> calling pypl each time too costly timewise? Do consecutive pypl calls always
>>> share the same interpreter?
>>> Best wishes,
>>> Michal
>>>
>>>


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Oracle, pypl and rdkit

Reply via email to