[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630834#comment-14630834 ]
Andreas Meier edited comment on PDFBOX-2252 at 7/17/15 8:02 AM: ---------------------------------------------------------------- After some investigation I can say: Acrobat uses nearly no markers. It strictly relies on strong characters, affecting surrounded neutrals so they will be written in the correct direction. I also found out, that acrobat got a problem extracting arabic dates. In one case I got a document with "number text number" ( all arabic, my guess this is a date: year text/month day ), where acrobat extracted the date in the wrong order: "number number text". In this special case acrobat would have needed one or more RTL/LTR marks to provide the correct result. I guess there is a lot of "know how" in acrobats text extraction mechanism, because normally it handles text very well without using many marks, but it also got bugs with mixed language directions. If we want to provide a really good text extraction without the amount of LTR/RTL marks I used (while having good or even better 1-to-1 text extraction than acrobat), we need to do this in many steps: 1. Extract the text and set the LTR/RTL marks I already set, so we got the correct order and layout of the text 2. Check whether RTL/LTR marks are not needed, due to Strong characters (at that time we need to check words/characters or tuples around the markers 3. Remove not needed marks 4. Add a few new marks to remove many others The above patch is only my first approach to extract the text of pdf's 1-to-1. It could be seen as the first step in our long way to the perfect text extraction... (some things are better than the acrobat text extration, some things are worse...) For more information about how acrobat works, I need to know the internals of acrobat. Does Adobe provide any papers for acrobat? was (Author: andreasmeier): After some investigation I can say: Acrobat uses nearly no markers. It strictly relies on strong characters, affecting surrounded neutrals so they will be written in the correct direction. I also found out, that acrobat got a problem extracting arabic dates. In one case I got a document with "number text number" ( all arabic, my guess this is a date: year text/month day ), where acrobat extracted the date in the wrong order: "number number text". In this special case acrobat would have needed one or more RTL/LTR marks to provide the correct result. There is a lot of "know how" in acrobats text extraction mechanism, but it also got bugs with mixed language directions. If we want to provide a really good text extraction without the amount of LTR/RTL marks I used (while having good or even better 1-to-1 text extraction than acrobat), we need to do this in many steps: 1. Extract the text and set the LTR/RTL marks I already set, so we got the correct order and layout of the text 2. Check whether RTL/LTR marks are not needed, due to Strong characters (at that time we need to check words/characters or tuples around the markers 3. Remove not needed marks 4. Add a few new marks to remove many others The above patch is only my first approach to extract the text of pdf's 1-to-1. It could be seen as the first step in our long way to the perfect text extraction... (some things are better than the acrobat text extration, some things are worse...) For more information about how acrobat works, I need to know the internals of acrobat. Does Adobe provide any papers for acrobat? > PDFTextStripper has problem with documents with mixed language directions > ------------------------------------------------------------------------- > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.6, 2.0.0 > Reporter: Amir > Priority: Critical > Fix For: 2.1.0 > > Attachments: PDFTextStripper.java.patch, test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org