[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634565#comment-14634565
 ] 

Andreas Meier edited comment on PDFBOX-2252 at 7/21/15 6:14 AM:
----------------------------------------------------------------

I copied the single parts of atest.pdf together in LibreOffice.

The dates are not typeset correctly, that's true.
The reason to create this document was, that I got some arabic documents where 
this is the case.
I think this is what Tilman Hausherr meant with:
{quote}
Yes, but it is risky to create something ourselves in a foreign language that 
we don't understand.
{quote}
I don't know why this happens in text of arabic newspapers, whether it is a bug 
in any conversion software or copied together by hand but this happens in 
different documents I got for testing.
Unfortunately I can't provide these files to you.

If you think we should not address that problem, it would be better to delete 
the atest.pdf so it won't confuse others.


The problem is, that characters from the line
"This is the story of my life and = (waḥabbī"
and the following line
"my love"

are mixed up at extraction:

"This ‎is ‎the ‎storyofmy ‎life ‎and=(mwayḥlaobvbeī"

With "non-visual" overlapping I meant, that the characters from "waḥabbī" and 
"my love" don't touch each other. See the red fields in overlap.jpg




was (Author: andreasmeier):
I copied the single parts of atest.pdf together in LibreOffice.

The dates are not typeset correclty, that's true.
The reason to create this document was, that I got some arabic documents where 
this is the case.
I don't know why this happens in text of arabic newspapers, whether it is a bug 
in any conversion software or copied together by hand but this happens in 
different documents I got for testing.
Unfortunately I can't provide them to you.

If you think we should not address that problem, it would be better to delete 
the atest.pdf so it won't confuse others.


The problem is, that characters from the line
"This is the story of my life and = (waḥabbī"
and the following line
"my love"

are mixed up at extraction:

"This ‎is ‎the ‎storyofmy ‎life ‎and=(mwayḥlaobvbeī"

With "non-visual" overlapping I meant, that the characters from "waḥabbī" and 
"my love" don't touch each other. See the red fields in overlap.jpg



> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, atest.pdf, overlap.jpg, 
> test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to