[jira] [Comment Edited] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Andreas Meier (JIRA) Fri, 17 Jul 2015 01:03:59 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630834#comment-14630834
 ]


Andreas Meier edited comment on PDFBOX-2252 at 7/17/15 8:02 AM:
----------------------------------------------------------------

After some investigation I can say: Acrobat uses nearly no markers. It strictly 
relies on strong characters, affecting surrounded neutrals so they will be 
written in the correct direction.

I also found out, that acrobat got a problem extracting arabic dates. In one 
case I got a document with

"number text number" ( all arabic,  my guess this is a date:  year text/month 
day ), where acrobat extracted the date in the wrong order: "number number 
text". In this special case acrobat would have needed one or more RTL/LTR marks 
to provide the correct result.

I guess there is a lot of "know how" in acrobats text extraction mechanism, 
because normally it handles text very well without using many marks, but it 
also got bugs with mixed language directions.

If we want to provide a really good text extraction without the amount of 
LTR/RTL marks I used (while having good or even better 1-to-1 text extraction 
than acrobat), we need to do this in many steps: 
1. Extract the text and set the LTR/RTL marks I already set, so we got the 
correct order and layout of the text
2. Check whether RTL/LTR marks are not needed, due to Strong characters (at 
that time we need to check words/characters or tuples around the markers
3. Remove not needed marks
4. Add a few new marks to remove many others

The above patch is only my first approach to extract the text of pdf's 1-to-1. 
It could be seen as the first step in our long way to the perfect text 
extraction... (some things are better than the acrobat text extration, some 
things are worse...)

For more information about how acrobat works, I need to know the internals of 
acrobat.
Does Adobe provide any papers for acrobat?


was (Author: andreasmeier):
After some investigation I can say: Acrobat uses nearly no markers. It strictly 
relies on strong characters, affecting surrounded neutrals so they will be 
written in the correct direction.

I also found out, that acrobat got a problem extracting arabic dates. In one 
case I got a document with

"number text number" ( all arabic,  my guess this is a date:  year text/month 
day ), where acrobat extracted the date in the wrong order: "number number 
text". In this special case acrobat would have needed one or more RTL/LTR marks 
to provide the correct result.

There is a lot of "know how" in acrobats text extraction mechanism, but it also 
got bugs with mixed language directions.

If we want to provide a really good text extraction without the amount of 
LTR/RTL marks I used (while having good or even better 1-to-1 text extraction 
than acrobat), we need to do this in many steps: 
1. Extract the text and set the LTR/RTL marks I already set, so we got the 
correct order and layout of the text
2. Check whether RTL/LTR marks are not needed, due to Strong characters (at 
that time we need to check words/characters or tuples around the markers
3. Remove not needed marks
4. Add a few new marks to remove many others

The above patch is only my first approach to extract the text of pdf's 1-to-1. 
It could be seen as the first step in our long way to the perfect text 
extraction... (some things are better than the acrobat text extration, some 
things are worse...)

For more information about how acrobat works, I need to know the internals of 
acrobat.
Does Adobe provide any papers for acrobat?

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Reply via email to