On Tue, Feb 5, 2013 at 6:36 PM, Andreas Lehmkuehler <[email protected]>wrote:
> Hi, > > Am 05.02.2013 15:01, schrieb kulbhushan singh: > > Hi, >> >> I am trying to extract text from a pdf file with custom fonts but it is >> giving me junk characters. The fonts used are ArialMT (embedded subset) & >> Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost >> script 8.15. I am using PDFTextStripper to extract the text. How can do it >> for custom fonts. Any reference or solution would be appreciated. >> > Did you do the "adobe" test? [1] > Does this require buying Adobe Acrobat? Or is there a free version? I have created heuristics for about 100 of these non-conformant fonts ( http://bitbucket.org/petermr/pdf2svg which uses PDFBox). If you mail me a sample file I can see whether these would help. I have done several TeX fonts (CMM etc.) but haven't done a Ghostcript one and it would be useful But as Andreas says, ultimately these are probably non-conformant. A mixure of heuristics and glyph analysis (OCR and or heuristics) are required. Again PDF2SVG is addressing these - any community involvement is valued. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

