Re: Issues when indexing PDF files

Zheng Lin Edwin Yeo Wed, 16 Dec 2015 08:34:19 -0800

Hi Erik,

I've shared the file on dropbox, which you can access via the link here:
https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> wrote:

> Edwin - Can you share one of those PDF files?
>
> Also, drop the file into the Tika app and see what it sees directly - get
> the tika-app JAR and run that desktop application.
>
> Could be an encoding issue?
>
>         Erik
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'm using Solr 5.3.0
> >
> > I'm indexing some PDF documents. However, for certain PDF files, there
> are
> > chinese text in the documents, but after indexing, what is indexed in the
> > content is either a series of "??????" or an empty content.
> >
> > I'm using the post.jar that comes together with Solr.
> >
> > What could be the reason that causes this?
> >
> > Regards,
> > Edwin
>
>

Re: Issues when indexing PDF files

Reply via email to