Hello there,

I guess it would be better if you opened a new issue in PDFBox's JIRA
and summarized your findings there.
https://issues.apache.org/jira

Please note that e-mail attachments don't survive in this mailing list.

> Hello. I want to extract text from pages and when I try to write it into a 
> new PDF, some characters are mixed up.
> I extract the text using the TextPosition objects that contain the actual 
> text strings, font, position etc.
>

You're dealing with a PDF document which contains Type1C fonts and has
been generated with pdfTeX-1.10b. This is a rather tricky combination.

> This is the important code that I use to write the text into the page:
> contentStream is a PDPageContentStream, te is a TextPosition,
> page is a PDPage
>
>
> contentStream.setFont(te.getFont(), te.getFontSizeInPt());
>                                contentStream.setTextMatrix(1, 0, 0, 1, 
> te.getXDirAdj(), page.getArtBox().getHeight()-te.getYDirAdj());
>                                contentStream.drawString(te.getCharacter());
>

The current Type1C font support has been tested with PDF text
extraction and rendering, but to my knowledge not with PDF generation.
The conversion from Java characters to raw bytes could be misbehaving.

Are you experiencing the same behaviour with PDFBox 0.8.0 (and earlier
versions)?

> It works for normal text, however there are problems with mathematical terms, 
> see the attachment please.
> The out.png has the converted page using pdftoimage; everything went fine 
> except that the sigma sign is missing. myresult.pdf on the other hand has 
> lots of font problems: nearly every special character is the root sign and if 
> it isn't the root sign, it's some other mixed character.

I rendered a couple of pages myself with PDFBox 1.0.1-SNAPSHOT and all
the greek letters (deltas and sigmas) appeared to be correct. However,
all the parentheses were missing from mathematical expressions.


VR

Reply via email to