Re: Fonts in pdf to image conversion

Hamed Iravanchi Tue, 03 Apr 2012 23:58:07 -0700

Hi Nicklas,

I've been working on this issue for a while.
Right now, PDFBox can not convert PDF files created by Open Office or Libre
Office to images correctly.
In my tests, PDF files created by Microsoft Word do not have this problem
in the latest Trunk code.


This is due to using extracted text to render the image, rather than using
code points.
Andreas used to reply my emails so we could collaborate and resolve such
issues faster, but I haven't received any reply lately.
I don't know if I'm posting in the right place or not thou...

Anyway, to fix this issue for True Type fonts (which are typically used in
your case) following things should be done by PDFBox:
- It should use code points for all true type fonts, instead of extracted
text
- The code points should be mapped to glyph codes using the font's CMAP
- Glyph codes should be used to draw text on the image.

I just managed to fix this yesterday in my code for my sample PDF files, by
modifying the trunk code.
But I'm waiting for developer team to collaborate so that I can make sure
what I'm doing is right and doesn't break other parts in PDFBox.

-Hamed


On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <[email protected]>wrote:

> Hi,
>
>  I'm using the latest LibreOffice to produce a PDF and the latest PDFBox
> to extract the pages as images but I'm having some problems with the fonts.
> If I use Times New Roman I get a
>
> org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> Changing font on <test> from <Times New Roman> to the default font
>
>  If I embed some more exotic fonts in the PDF, I get a
>
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: BMC
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: EMC
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: BDC
> org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> Changing font on <test> from <Algerian> to the default font
>
> This is all on the same machine. Is there a special trick in getting the
> fonts working?
>
> The extraction is done with something like
>
> PDDocument doc = PDDocument.load(pdf);
> List pages = doc.getDocumentCatalog().getAllPages();
> for (int i = 0; i < pages.size(); i++)
> {
> PDPage page = (PDPage) pages.get(i);
> pics.add(page.convertToImage());
> }
>
>
> Thanks in advance,
>  Nik
>
> --
> ---
> Nik
>

Re: Fonts in pdf to image conversion

Reply via email to