[ 
https://issues.apache.org/jira/browse/PDFBOX-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029925#comment-14029925
 ] 

John Hewson edited comment on PDFBOX-1919 at 6/12/14 10:34 PM:
---------------------------------------------------------------

I took a detailed look at the PDF in question, using Acrobat Pro XI I get "IN 
NoRtheRN IReLAND" -and I always get the same result if I copy & paste,- when I 
export plain text, or export accessible text. OS X Preview gives the same 
result, as does Chrome's PDF viewer. *Update: * Actually, when doing copy & 
paste with Acrobat I get "IN NORTHE RN IRELAND" which still isn't correct, but 
it's more like what was expected.

I -really don't think- *still doubt* that Acrobat is using the span tags to 
repair the ToUnicode table (how would it know the table was bad? What if the 
span tags were bad?). Andreas, what version of Acrobat did you use? Given that 
every PDF viewer I've tried produces the same text in all cases, I'd say that 
PDFBox's behaviour is standard - *but is it correct? Could it be better?*

I'll do some more investigating...


was (Author: jahewson):
I took a detailed look at the PDF in question, using Acrobat Pro XI I get "IN 
NoRtheRN IReLAND" -and I always get the same result if I copy & paste,- when I 
export plain text, or export accessible text. OS X Preview gives the same 
result, as does Chrome's PDF viewer. *Update: * Actually, when doing copy & 
paste with Acrobat I get "IN NORTHE RN IRELAND" which still isn't correct, but 
it's more like what was expected.

I -really don't think- *still find it hard to believe* that Acrobat is using 
the span tags to repair the ToUnicode table (how would it know the table was 
bad? What if the span tags were bad?). Andreas, what version of Acrobat did you 
use? Given that every PDF viewer I've tried produces the same text in all 
cases, I'd say that PDFBox's behaviour is standard - *but is it correct? Could 
it be better?*

I'll do some more investigating...

> Font descriptor flags are not implemented
> -----------------------------------------
>
>                 Key: PDFBOX-1919
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1919
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>            Reporter: Corentin Regal
>         Attachments: PDFBOX-1919.AdobeReader.txt, PDFBOX-1919.pdf, 
> PDFBOX-1919.txt
>
>
> The font descriptor flags are not set.
> They are described in the document "PDF reference 1.7" at : 5.7.1 Font 
> Descriptor Flags
> The methods in PDFontDescriptor are ready but never called :
> setFlags()
> setSerif()
> setAllCap() which is used in a lot of PDF
> ...
> I saw some TODO that relate to that issue in the code, is it planned to be 
> implemented soon?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to