[ 
https://issues.apache.org/jira/browse/PDFBOX-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726228#comment-15726228
 ] 

Tilman Hausherr commented on PDFBOX-3330:
-----------------------------------------

For the 2.0 FAQ:


Q: Why does the extracted text appear in the wrong sequence?

A: By default, text extraction is done in the same sequence as the text in the 
PDF page content stream. PDF is a graphic format, not a text format, and unlike 
HTML, it has no requirements that text one on page be rendered in a certain 
order. The order is the one that was determined by the software that created 
the PDF. To get text sorted from left to right and top to botton, use 
{{setSortByPosition(true)}}.

Q: Why are some texts in poor quality and not antialiased?
A: This is because in some PDFs (e.g. the one in PDFBOX-2814 
https://issues.apache.org/jira/browse/PDFBOX-2814), text is not rendered 
directly, but as a shaped clipping from a background. Java graphics does not 
support "soft clipping" https://bugs.openjdk.java.net/browse/JDK-4212743 , and 
because of that, the edges are not looking smooth. Soft clipping could be 
achieved with some extra steps 
https://community.oracle.com/blogs/campbell/2006/07/19/java-2d-trickery-soft-clipping
 , but these would cost additional time and memory space. You can have a higher 
quality by rendering at a higher dpi and then downscale the image.

An unrelated anti-aliasing bug (PDFBOX-3615) has been fixed in 2.0.4.

> Enhance and update PDFBox website & documentation
> -------------------------------------------------
>
>                 Key: PDFBOX-3330
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3330
>             Project: PDFBox
>          Issue Type: Task
>          Components: Documentation
>            Reporter: Maruan Sahyoun
>
> General purpose ticket to track enhancements to the website and documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to