[ 
https://issues.apache.org/jira/browse/PDFBOX-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964781#action_12964781
 ] 

Martijn Brinkers commented on PDFBOX-860:
-----------------------------------------

Could you show us the raw proof instead of the text converted to RTF/DOC/GIF. 

When I try the text extractor and grep for "officers" I get:

signed by military or police officers accepting you as a journalist 
journalists, using liaison officers to feed them propaganda and 
than to your pictures or copy. Junior soldiers or officers have little 
rebels of Casamance, I had problems with some officers who suspected me of 
officers say they need to "control sensitive information". For the soldiers 
"sensi-
whether police officers, paramedics or journalists, are at risk of being 
southern Liberia, was publicly flogged by four police officers for 
While support networks have long been in place for police officers 



> 'fi' getting converted to '?'
> -----------------------------
>
>                 Key: PDFBOX-860
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-860
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Solaris 10
>            Reporter: Saurabh Mehrotra
>         Attachments: INSI-SURVIVAL-GUIDE-4-JOURNALISTS.zip, new_evidence.zip
>
>
> Hi
> I am trying to use PDF box 1.2.1 version to extract text from PDF files.
> The following issue is observed in the extracted text:
> 1. Combination of the characters 'fi' is converted to a '?'
> example:  first becomes ?rst
>                   classifier becomes classi?er
>                   find becomes ?nd
> Is this a known bug? Can some setting of the PDF box be turned of to prevent 
> this?
> Thanks & Regards
> Saurabh

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to