[ 
https://issues.apache.org/jira/browse/PDFBOX-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845235#action_12845235
 ] 

Villu Ruusmann commented on PDFBOX-624:
---------------------------------------

It's in the "The Type 2 Charstring Format" specification (Technical Note 
#5177), which can be retrieved from the Adobe site:
http://www.adobe.com/devnet/font/pdfs/5177.Type2.pdf

The chapter 3.2 "Charstring Number Encoding" contains the following statement:
"If the charstring byte contains the value 255, the next four bytes
indicate a two's complement signed number. The first of these
four bytes contains the highest order bits, the second byte
contains the next higher order bits and the fourth byte contains
the lowest order bits. This number is interpreted as a Fixed; that
is, a signed number with 16 bits of fraction."

The problem with this 5-byte number encoding is that is extremely rarely used 
(this TeX generated document is the only one where I've seen it). Most numbers 
are capped at around 2000, and they are represented using 2- (value range -1131 
.. +1131) or 3-byte (value range -32768 .. +32768) number encodings.

> Misplaced text
> --------------
>
>                 Key: PDFBOX-624
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-624
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, Text extraction, Utilities
>    Affects Versions: 1.0.0
>            Reporter: Villu Ruusmann
>            Priority: Critical
>             Fix For: 1.1.0
>
>         Attachments: documenta_math-fixed.txt, documenta_math.pdf, 
> documenta_math.txt, documenta_math_page4-fixed.png, documenta_math_page4.png, 
> FontBox.patch
>
>
> Thomas Fischer reported to [email protected] that 
> org.apache.pdfbox.ExtractText interchanges typographic ligatures "fi" and 
> "fl". The sample document "documenta_math.pdf" was created using TeX and AFPL 
> Ghostscript 6.50.
> I used PDFBox 1.0.1-SNAPSHOT to verify this problem. The "fi" ligature 
> behaves correctly (ie. text extraction yields "finite" and "infinite", not 
> "flnite" and "inflnite"), but the overall text layout is a complete mess. 
> Please see the PDF text extraction result "documenta_math.txt" and PDF 
> rendering result "documenta_math_page4.png".
> The cause of the horizontal text misplacement is not yet known. This could 
> affect all PDF documents which have been created using TeX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to