RE: Issues when indexing PDF files

Allison, Timothy B. Thu, 17 Dec 2015 05:57:13 -0800

Generally, I'd recommend opening an issue on PDFBox's Jira with the file that 
you shared.  Tika uses PDFBox...if a fix can be made there, it will propagate 
back through Tika to Solr.


That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode 
mapping for CID+71 (71) in font 505Eddc6Arial

So, if the file has no Unicode mapping for the font, I doubt they'll be able to 
fix it.

pdftotext is also unable to extract anything useful from the file.

Sorry.

Best,

            Tim
-----Original Message-----
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Thursday, December 17, 2015 5:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Issues when indexing PDF files

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
> Hi Alexandre,
>
> Thanks for your reply.
>
> So the only way to solve this issue is to explore with PDF specific 
> tools and change the encoding of the file?
> Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have been created 
in a way that Tika cannot easily extract the text, there's nothing you can do 
in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - so 
extracting plain text is fraught with difficulty. You may see a character on a 
PDF page, but exactly how that character is generated (using a specific 
encoding, font, or even by drawing a picture) is outside your control. There 
are various businesses built on this premise
- they charge for creating clean extracted text from PDFs - and even they have 
trouble with some PDFs.

HTH

Charlie

>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 15:42, Alexandre Rafalovitch 
> <arafa...@gmail.com>
> wrote:
>
>> They could be using custom fonts and non-Unicode characters. That's 
>> probably something to explore with PDF specific tools.
>> On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
>> wrote:
>>
>>> I've checked all the files which has problem with the content in the 
>>> Solr index using the Tika app. All of them shows the same issues as 
>>> what I see in the Solr index.
>>>
>>> So does the issues lies with the encoding of the file? Are we able 
>>> to
>> check
>>> the encoding of the file?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo 
>>> <edwinye...@gmail.com>
>>> wrote:
>>>
>>>> Hi Erik,
>>>>
>>>> I've shared the file on dropbox, which you can access via the link
>> here:
>>>>
>> https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?d
>> l=0
>>>>
>>>> This is what I get from the Tika app after dropping the file in.
>>>>
>>>> Content-Length: 75092
>>>> Content-Type: application/pdf
>>>> Type: COSName{Info}
>>>> X-Parsed-By: org.apache.tika.parser.DefaultParser
>>>> X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
>>>> X-TIKA:digest:SHA256:
>>>> d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
>>>> access_permission:assemble_document: true
>>>> access_permission:can_modify: true
>>>> access_permission:can_print: true
>>>> access_permission:can_print_degraded: true
>>>> access_permission:extract_content: true
>>>> access_permission:extract_for_accessibility: true
>>>> access_permission:fill_in_form: true
>>>> access_permission:modify_annotations: true
>>>> dc:format: application/pdf; version=1.3
>>>> pdf:PDFVersion: 1.3
>>>> pdf:encrypted: false
>>>> producer: null
>>>> resourceName: Desmophen+670+BAe.pdf
>>>> xmpTPg:NPages: 3
>>>>
>>>>
>>>> Regards,
>>>> Edwin
>>>>
>>>>
>>>> On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
>>> wrote:
>>>>
>>>>> Edwin - Can you share one of those PDF files?
>>>>>
>>>>> Also, drop the file into the Tika app and see what it sees 
>>>>> directly -
>>> get
>>>>> the tika-app JAR and run that desktop application.
>>>>>
>>>>> Could be an encoding issue?
>>>>>
>>>>>          Erik
>>>>>
>>>>> —
>>>>> Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com 
>>>>> <http://www.lucidworks.com/>
>>>>>
>>>>>
>>>>>
>>>>>> On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm using Solr 5.3.0
>>>>>>
>>>>>> I'm indexing some PDF documents. However, for certain PDF files,
>> there
>>>>> are
>>>>>> chinese text in the documents, but after indexing, what is 
>>>>>> indexed
>> in
>>>>> the
>>>>>> content is either a series of "??????" or an empty content.
>>>>>>
>>>>>> I'm using the post.jar that comes together with Solr.
>>>>>>
>>>>>> What could be the reason that causes this?
>>>>>>
>>>>>> Regards,
>>>>>> Edwin
>>>>>
>>>>>
>>>>
>>>
>>
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

RE: Issues when indexing PDF files

Reply via email to