Re: Issues when indexing PDF files

Alexandre Rafalovitch Wed, 16 Dec 2015 23:43:56 -0800

They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> wrote:


> I've checked all the files which has problem with the content in the Solr
> index using the Tika app. All of them shows the same issues as what I see
> in the Solr index.
>
> So does the issues lies with the encoding of the file? Are we able to check
> the encoding of the file?
>
>
> Regards,
> Edwin
>
>
> On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
>
> > Hi Erik,
> >
> > I've shared the file on dropbox, which you can access via the link here:
> > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0
> >
> > This is what I get from the Tika app after dropping the file in.
> >
> > Content-Length: 75092
> > Content-Type: application/pdf
> > Type: COSName{Info}
> > X-Parsed-By: org.apache.tika.parser.DefaultParser
> > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
> > X-TIKA:digest:SHA256:
> > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
> > access_permission:assemble_document: true
> > access_permission:can_modify: true
> > access_permission:can_print: true
> > access_permission:can_print_degraded: true
> > access_permission:extract_content: true
> > access_permission:extract_for_accessibility: true
> > access_permission:fill_in_form: true
> > access_permission:modify_annotations: true
> > dc:format: application/pdf; version=1.3
> > pdf:PDFVersion: 1.3
> > pdf:encrypted: false
> > producer: null
> > resourceName: Desmophen+670+BAe.pdf
> > xmpTPg:NPages: 3
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
> wrote:
> >
> >> Edwin - Can you share one of those PDF files?
> >>
> >> Also, drop the file into the Tika app and see what it sees directly -
> get
> >> the tika-app JAR and run that desktop application.
> >>
> >> Could be an encoding issue?
> >>
> >>         Erik
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com <http://www.lucidworks.com/>
> >>
> >>
> >>
> >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'm using Solr 5.3.0
> >> >
> >> > I'm indexing some PDF documents. However, for certain PDF files, there
> >> are
> >> > chinese text in the documents, but after indexing, what is indexed in
> >> the
> >> > content is either a series of "??????" or an empty content.
> >> >
> >> > I'm using the post.jar that comes together with Solr.
> >> >
> >> > What could be the reason that causes this?
> >> >
> >> > Regards,
> >> > Edwin
> >>
> >>
> >
>

Re: Issues when indexing PDF files

Reply via email to