[
https://issues.apache.org/jira/browse/PDFBOX-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726228#comment-15726228
]
Tilman Hausherr commented on PDFBOX-3330:
-----------------------------------------
For the 2.0 FAQ:
Q: Why does the extracted text appear in the wrong sequence?
A: By default, text extraction is done in the same sequence as the text in the
PDF page content stream. PDF is a graphic format, not a text format, and unlike
HTML, it has no requirements that text one on page be rendered in a certain
order. The order is the one that was determined by the software that created
the PDF. To get text sorted from left to right and top to botton, use
{{setSortByPosition(true)}}.
Q: Why are some texts in poor quality and not antialiased?
A: This is because in some PDFs (e.g. the one in PDFBOX-2814
https://issues.apache.org/jira/browse/PDFBOX-2814), text is not rendered
directly, but as a shaped clipping from a background. Java graphics does not
support "soft clipping" https://bugs.openjdk.java.net/browse/JDK-4212743 , and
because of that, the edges are not looking smooth. Soft clipping could be
achieved with some extra steps
https://community.oracle.com/blogs/campbell/2006/07/19/java-2d-trickery-soft-clipping
, but these would cost additional time and memory space. You can have a higher
quality by rendering at a higher dpi and then downscale the image.
An unrelated anti-aliasing bug (PDFBOX-3615) has been fixed in 2.0.4.
> Enhance and update PDFBox website & documentation
> -------------------------------------------------
>
> Key: PDFBOX-3330
> URL: https://issues.apache.org/jira/browse/PDFBOX-3330
> Project: PDFBox
> Issue Type: Task
> Components: Documentation
> Reporter: Maruan Sahyoun
>
> General purpose ticket to track enhancements to the website and documentation
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]