[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908359#comment-14908359
 ] 

Tilman Hausherr edited comment on PDFBOX-2252 at 9/25/15 6:48 PM:
------------------------------------------------------------------

Page 52 and 56 of bugzilla867751.pdf might help. I'll attach it.

We'd need to search more of such files on the web. Maybe we could find pdfs of 
home appliance user manuals. We'd need some words that are common for such a 
manual in an RTL language. In german the words would be e.g. "einschalten", 
"reinigen", "Öffnen", "Bedienungsanleitung" (which mean switch on, clean, open, 
user manual) and we'd need some similar words in an RTL language but that are 
not too short, i.e. they don't have to be translations of the words I mention, 
just to be specific to a home appliance user manual.


was (Author: tilman):
Page 52 and 56 of bugzilla867751.pdf might help. I'll attach it.

We'd need to search more of such files on the web. Maybe we could find pdfs of 
kitchen appliance user manuals. We'd need some words that are common for such a 
manual in an RTL language. In german (e.g. "einschalten", "reinigen", "Öffnen", 
"Bedienungsanleitung") or english.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to