[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Tilman Hausherr (JIRA) Sat, 03 Oct 2015 05:01:46 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942255#comment-14942255
 ]


Tilman Hausherr commented on PDFBOX-2252:
-----------------------------------------

No, that would be a separate thing, related to the 2.0 release, and would bring 
tons of differences (hopefully improvements). The wish now would only be to 
find out whether any new bugs were create in 2.0 text extraction recently.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Reply via email to