[jira] [Commented] (PDFBOX-1572) PDFBox ExtracText problems with "ª"

Timo Boehme (JIRA) Fri, 19 Apr 2013 03:53:19 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636268#comment-13636268
 ]


Timo Boehme commented on PDFBOX-1572:
-------------------------------------

To my knowledge there is no planning of adding text extraction using OCR to 
PDFBox. That is more an option for an OCR engine to read/write PDF. In your 
case the problem is not the missing OCR but the OCR which was used to create 
your PDF produced wrong text content which is parsed by PDFBox. The only thing 
PDFBox could do is to try to correct typical OCR errors - however again this is 
not the field for PDFBox but for another specialized filter. 
                
> PDFBox ExtracText problems with "ª"
> -----------------------------------
>
>                 Key: PDFBOX-1572
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1572
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Daniel Tizon
>
> PDFBox have problems to detect ª in some PDF's.
> Examples: 
> I have in my PDF: 1ª
> When I extract text: P
> I have in my PDF: 2ª
> When I extract text: 22
> I have in my PDF: 3ª
> When I extract text: 32
> and there are a lot of more examples related with "ª"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1572) PDFBox ExtracText problems with "ª"

Reply via email to