[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634577#comment-14634577
 ] 

Andreas Meier edited comment on PDFBOX-2252 at 7/21/15 9:43 AM:
----------------------------------------------------------------

Yes, numbers are written ltr

Everything happening in this case depends on the strong LTR and RTL characters.

For example:
The (( occurs, because Adobe Reader notices the strong RTL characters and 
extracts them. The strong RTL characters will then turn the direction of the 
trailing ")" and write it to the left of the RTL-word.

A simple approach without markers (which relies on the direction of strong 
RTL/LTR characters) will not handle this problem.

That's the reason why I suggested the multi-stage approach:

We can't rely on the person who creates a pdf or on the integrity of the 
software that converts a text pdf's.


was (Author: andreasmeier):
Yes, numbers are written ltr

Everything happening in this case depends on the strong LTR and RTL characters.

For example:
The (( occurs, because Adobe Reader notices the strong RTL characters and 
writes the next character ")" to the left.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, atest.pdf, overlap.jpg, 
> test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to