On Tue, Jan 22, 2013 at 6:54 AM, Guillaume Bailleul <[email protected]>wrote:
> Hi All, > > Is there a way to validate a pdf with PDFBox? I mean to ensure that > the document complies with the PDF Reference. > > My idea was to load the document and then ensure each xobject is > parsed (retrieving each one). Is it a good way to do it ? > > I am also very interested in this. In the PDF2SVG project ( https://bitbucket.org/petermr/pdf2svg) we convert non-standard PDFs to Unicode characters and SVG. If the input was PDF-reference-compliant (e.g. used the standard 14 fonts and Unicode) our job would be relatively easy. However we are working with STM publications (ScientificTechnicalMedical) which seem to be very non-compliant. Sadly the worst compliance comes in the mathematical and symbol components. Many fonts are proprietary and so we have developed heuristics developed by manual inspection which map to Unicode. Other fonts derive from (say) Mathematical-PI which uses proprietary codes (e.g. H11001 for "plus") and where there is no published mapping. (There is a great tool, shapecatcher.com, which allows you to look up many Unicode characters from the glyph). In many cases it may be possible to re-emit compliant PDF (although my current primary interest is to determine the Unicode point and do semantic processing). It should therefore be possible to create a PDFTidy tool which removes the non-compliance (Cf HTMLTidy). Maybe a little of the kerning would be lost but I think standards is a good idea! Question. Is Symbol *necessary* in the PDF spec or can equivalent functionality be found in Unicode codepoints? I have currently hacked about 55 fonts - normally only the characters I discover in the wild, see https://bitbucket.org/petermr/pdf2svg/src/905f2fa94bcf/src/main/resources/org/xmlcml/pdf2svg/fontFamilySets/nonStandardFontFamilySet.xml?at=defaultand https://bitbucket.org/petermr/pdf2svg/src/905f2fa94bcf5e8d3ea17eabd7bc94b53bd02ae8/src/main/resources/org/xmlcml/pdf2svg/codepoints?at=default. Any insight or contributions here would be very valuable - and please feel free to fork and develop it. FWIW the next phase (SVGPlus) uses heuristics recreate paragraphs and other objects (super/subscripts, maths equations, tables, semantic graphs). The third phase turns these into semantic chemistry, biology, etc. - all from the PDF. P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

