[Rdkit-discuss] New module for RDKit - PANDAS integration

2013-04-19 Thread Nikolas Fechner
Dear all,
We developed a new module ( rdkit.Chem.PandasTools.py ) that allows for using
RDKit molecule objects directly in pandas dataframes. Pandas
(http://pandas.pydata.org/) is a python library that offers table-like
datacontainers, which are incredibly useful for anything related to data mining.
Moreover, it integrates nicely with the ipython notebook producing rendered HTML
tables for the dataframes. The RDKit integration allows to have molecule-type
columns and functionality to perform substructure-based row filtering directly
on the pandas table. Additionally, if a dataframe is exported as HTML or shown
within an ipython notebook, the molecules in the table are rendered as 2D
structures.

The new module is available in the current SF trunk and contains a doctest
header that provides examples of how to use it.

I hope some of you find that interesting. As always, bug reports, comments,
ideas... are very much appreciated.

Best,
Nikolas

--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] New module for RDKit - PANDAS integration

2013-04-19 Thread Greg Landrum
I think Nikolas is being a bit modest... the Pandas integration is
pretty cool. :-)

Here's an example of using it from the IPython prompt (it's better in
the notebook, but that doesn't paste so nicely into email)

Loading an SD file:

In [1]: from rdkit import Chem

In [2]: from rdkit.Chem import PandasTools

In [3]: import pandas as pd

In [4]: df = 
PandasTools.LoadSDF('hERG_inhibition_dataset.sdf',includeFingerprints=True)

In [5]: df
Out[5]:

Int64Index: 242 entries, 0 to 241
Data columns:
ACTIVITY_CLASS242  non-null values
CompoundName  242  non-null values
ID242  non-null values
MDLPublicKeys 242  non-null values
SMILES242  non-null values
pIC50 242  non-null values
ROMol 242  non-null values
dtypes: object(7)>


And doing a substructure search:

In [6]: N3s = df[df['ROMol']>=Chem.MolFromSmiles('N(C)(C)C')]

In [7]: N3s
Out[7]:

Int64Index: 177 entries, 0 to 239
Data columns:
ACTIVITY_CLASS177  non-null values
CompoundName  177  non-null values
ID177  non-null values
MDLPublicKeys 177  non-null values
SMILES177  non-null values
pIC50 177  non-null values
ROMol 177  non-null values
dtypes: object(7)

Because I used the "includeFingerprints" argument, that actually did
the search using a substructure fingerprint to speed things up. This
is using the avalon fingerprint at the moment, but that will change
between now and the release so as to not add an additional dependency.

-greg

On Fri, Apr 19, 2013 at 11:56 AM, Nikolas Fechner  wrote:
> Dear all,
> We developed a new module ( rdkit.Chem.PandasTools.py ) that allows for
> using RDKit molecule objects directly in pandas dataframes. Pandas
> (http://pandas.pydata.org/) is a python library that offers table-like
> datacontainers, which are incredibly useful for anything related to data
> mining. Moreover, it integrates nicely with the ipython notebook producing
> rendered HTML tables for the dataframes. The RDKit integration allows to
> have molecule-type columns and functionality to perform substructure-based
> row filtering directly on the pandas table. Additionally, if a dataframe is
> exported as HTML or shown within an ipython notebook, the molecules in the
> table are rendered as 2D structures.
>
> The new module is available in the current SF trunk and contains a doctest
> header that provides examples of how to use it.
>
> I hope some of you find that interesting. As always, bug reports, comments,
> ideas... are very much appreciated.
>
> Best,
> Nikolas
>
>
>
> --
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss