Thanks Marc, we'll take this offlist...
... copying my colleagues

On Fri, Oct 10, 2014 at 3:52 PM, Marc Davis <[email protected]> wrote:

> Peter, being an Organofluorine Chemist, this is precisely what we are
> seeking - being able to extract PDFs that contain organic structures along
> with text and tables,


We are  actively developing this. See
https://bitbucket.org/petermr/xhtml2stm/wiki/Home (ChemVisitor) where we
are extracting molecules from documents where the PDF contains Paths, not
Pixels. Paths result when the author (human or program) has used a vector
drawing tool (ChemDraw does this and so do most others).

The publisher MAY carry the vector information to the PDF (BioMedCentral
and NaturePublishingGroup does this) while other publishers (e.g. Am.
Chem.Soc.) translate these to pixel images. The first is easier but we have
made progress with interpreting the second. The results is
ChemicalMarkupLanguage which is XML.

we need to place extract this data and transfer into readable Word docs.


machine-readable or human-readable or both?


>   I guess in this case, since it’s XML, the .docx format is a lot easier
> to create.
>

Not really. Modern Word and word-like files use XML as the basis. If they
are well created they can contain semantic spectra, etc. But first we have
to extract those.


> Thanks,
> Marc
>
>
The main problem is actually sociopoliticolegal.  Until this year it has
not been clear whether it's legal to extract factual material from
copyright documents. Now, in the UK, it IS - assuming it's used for
non-commercial research. So we are starting to do this on a - hopefully -
massive scale and generating a whole new research area - knowledge-driven
research.

The value of this for this list is it validates all the hard work done by
list members in writing PDFBox. Because the process is now legal in UK
there is more incentive to develop and publish downstream analytic tools
and that's what we are doing (Apache2-Open, of course).



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to