Generally, I'd recommend opening an issue on PDFBox's Jira with the file that you shared. Tika uses PDFBox...if a fix can be made there, it will propagate back through Tika to Solr.
That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode mapping for CID+71 (71) in font 505Eddc6Arial So, if the file has no Unicode mapping for the font, I doubt they'll be able to fix it. pdftotext is also unable to extract anything useful from the file. Sorry. Best, Tim -----Original Message----- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Thursday, December 17, 2015 5:48 AM To: solr-user@lucene.apache.org Subject: Re: Issues when indexing PDF files On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote: > Hi Alexandre, > > Thanks for your reply. > > So the only way to solve this issue is to explore with PDF specific > tools and change the encoding of the file? > Is there any way to configure it in Solr? Solr uses Tika to extract plain text from PDFs. If the PDFs have been created in a way that Tika cannot easily extract the text, there's nothing you can do in Solr that will help. Unfortunately PDF isn't a content format but a presentation format - so extracting plain text is fraught with difficulty. You may see a character on a PDF page, but exactly how that character is generated (using a specific encoding, font, or even by drawing a picture) is outside your control. There are various businesses built on this premise - they charge for creating clean extracted text from PDFs - and even they have trouble with some PDFs. HTH Charlie > > Regards, > Edwin > > > On 17 December 2015 at 15:42, Alexandre Rafalovitch > <arafa...@gmail.com> > wrote: > >> They could be using custom fonts and non-Unicode characters. That's >> probably something to explore with PDF specific tools. >> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> >> wrote: >> >>> I've checked all the files which has problem with the content in the >>> Solr index using the Tika app. All of them shows the same issues as >>> what I see in the Solr index. >>> >>> So does the issues lies with the encoding of the file? Are we able >>> to >> check >>> the encoding of the file? >>> >>> >>> Regards, >>> Edwin >>> >>> >>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo >>> <edwinye...@gmail.com> >>> wrote: >>> >>>> Hi Erik, >>>> >>>> I've shared the file on dropbox, which you can access via the link >> here: >>>> >> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?d >> l=0 >>>> >>>> This is what I get from the Tika app after dropping the file in. >>>> >>>> Content-Length: 75092 >>>> Content-Type: application/pdf >>>> Type: COSName{Info} >>>> X-Parsed-By: org.apache.tika.parser.DefaultParser >>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf >>>> X-TIKA:digest:SHA256: >>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 >>>> access_permission:assemble_document: true >>>> access_permission:can_modify: true >>>> access_permission:can_print: true >>>> access_permission:can_print_degraded: true >>>> access_permission:extract_content: true >>>> access_permission:extract_for_accessibility: true >>>> access_permission:fill_in_form: true >>>> access_permission:modify_annotations: true >>>> dc:format: application/pdf; version=1.3 >>>> pdf:PDFVersion: 1.3 >>>> pdf:encrypted: false >>>> producer: null >>>> resourceName: Desmophen+670+BAe.pdf >>>> xmpTPg:NPages: 3 >>>> >>>> >>>> Regards, >>>> Edwin >>>> >>>> >>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> >>> wrote: >>>> >>>>> Edwin - Can you share one of those PDF files? >>>>> >>>>> Also, drop the file into the Tika app and see what it sees >>>>> directly - >>> get >>>>> the tika-app JAR and run that desktop application. >>>>> >>>>> Could be an encoding issue? >>>>> >>>>> Erik >>>>> >>>>> — >>>>> Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com >>>>> <http://www.lucidworks.com/> >>>>> >>>>> >>>>> >>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < >>> edwinye...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm using Solr 5.3.0 >>>>>> >>>>>> I'm indexing some PDF documents. However, for certain PDF files, >> there >>>>> are >>>>>> chinese text in the documents, but after indexing, what is >>>>>> indexed >> in >>>>> the >>>>>> content is either a series of "??????" or an empty content. >>>>>> >>>>>> I'm using the post.jar that comes together with Solr. >>>>>> >>>>>> What could be the reason that causes this? >>>>>> >>>>>> Regards, >>>>>> Edwin >>>>> >>>>> >>>> >>> >> > -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk