RE: Issues with extraction content of PDF files

Allison, Timothy B. Fri, 18 Dec 2015 10:41:17 -0800

Colleagues,
  So that you don't have to do the initial diagnosis at least.  From [0]:


>>That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode 
>>mapping for CID+71
(71) in font 505Eddc6Arial
>>So, if the file has no Unicode mapping for the font, I doubt they'll be able 
>>to fix it.
>>pdftotext is also unable to extract anything useful from the file.

 [0]  
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3cby2pr09mb11297223e13e266cfb2a5ffc7...@by2pr09mb112.namprd09.prod.outlook.com%3E


-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Friday, December 18, 2015 12:58 PM
To: users@pdfbox.apache.org
Subject: Issues with extraction content of PDF files

Hi,

I'm indexing some PDF documents in Solr. However, for certain PDF files, there 
are chinese text in the documents, but after indexing, what is indexed in the 
content is either a series of "??????" or an empty content.

i've also tried on the Tika app, and I get the same results.

What could be the reason that causes this?

I've shared one of the file with the issue on dropbox, which you can access via 
the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


Regards,
Edwin

RE: Issues with extraction content of PDF files

Reply via email to