Thanks Marc, we'll take this offlist... ... copying my colleagues On Fri, Oct 10, 2014 at 3:52 PM, Marc Davis <[email protected]> wrote:
> Peter, being an Organofluorine Chemist, this is precisely what we are > seeking - being able to extract PDFs that contain organic structures along > with text and tables, We are actively developing this. See https://bitbucket.org/petermr/xhtml2stm/wiki/Home (ChemVisitor) where we are extracting molecules from documents where the PDF contains Paths, not Pixels. Paths result when the author (human or program) has used a vector drawing tool (ChemDraw does this and so do most others). The publisher MAY carry the vector information to the PDF (BioMedCentral and NaturePublishingGroup does this) while other publishers (e.g. Am. Chem.Soc.) translate these to pixel images. The first is easier but we have made progress with interpreting the second. The results is ChemicalMarkupLanguage which is XML. we need to place extract this data and transfer into readable Word docs. machine-readable or human-readable or both? > I guess in this case, since it’s XML, the .docx format is a lot easier > to create. > Not really. Modern Word and word-like files use XML as the basis. If they are well created they can contain semantic spectra, etc. But first we have to extract those. > Thanks, > Marc > > The main problem is actually sociopoliticolegal. Until this year it has not been clear whether it's legal to extract factual material from copyright documents. Now, in the UK, it IS - assuming it's used for non-commercial research. So we are starting to do this on a - hopefully - massive scale and generating a whole new research area - knowledge-driven research. The value of this for this list is it validates all the hard work done by list members in writing PDFBox. Because the process is now legal in UK there is more incentive to develop and publish downstream analytic tools and that's what we are doing (Apache2-Open, of course). -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

