[
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910107#comment-14910107
]
Andreas Meier commented on PDFBOX-2252:
---------------------------------------
I tested the latest patch with the documents.
There are only small differences, but the files with the differences are so
complex that I don't want to judge anything.
I think you've done a great job, rewriting the code. Thanks, that you took
hands on the code.
The only problem of this implementation is that "setSortByPosition" must be set
to true at the moment.
This will lead to problems with text that is ordered in two or more column
blocks, since the will not be handled in the correct way anymore.
Do you know if there is already a ticket for that problem?
Furthermore it would be nice to know your opinion about document layout
analysis programs. Is something planned in that direction?
> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6, 2.0.0
> Reporter: Amir
> Assignee: Maruan Sahyoun
> Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt,
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch,
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf,
> bugzilla867751.pdf, overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left
> and left-to-right languages, the output characters of one language is
> reversed.
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole
> content should be revered or not. It's not true, it must operate on each
> word, not the whole document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]