They could be using custom fonts and non-Unicode characters. That's probably something to explore with PDF specific tools. On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> wrote:
> I've checked all the files which has problem with the content in the Solr > index using the Tika app. All of them shows the same issues as what I see > in the Solr index. > > So does the issues lies with the encoding of the file? Are we able to check > the encoding of the file? > > > Regards, > Edwin > > > On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > > Hi Erik, > > > > I've shared the file on dropbox, which you can access via the link here: > > https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0 > > > > This is what I get from the Tika app after dropping the file in. > > > > Content-Length: 75092 > > Content-Type: application/pdf > > Type: COSName{Info} > > X-Parsed-By: org.apache.tika.parser.DefaultParser > > X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf > > X-TIKA:digest:SHA256: > > d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7 > > access_permission:assemble_document: true > > access_permission:can_modify: true > > access_permission:can_print: true > > access_permission:can_print_degraded: true > > access_permission:extract_content: true > > access_permission:extract_for_accessibility: true > > access_permission:fill_in_form: true > > access_permission:modify_annotations: true > > dc:format: application/pdf; version=1.3 > > pdf:PDFVersion: 1.3 > > pdf:encrypted: false > > producer: null > > resourceName: Desmophen+670+BAe.pdf > > xmpTPg:NPages: 3 > > > > > > Regards, > > Edwin > > > > > > On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com> > wrote: > > > >> Edwin - Can you share one of those PDF files? > >> > >> Also, drop the file into the Tika app and see what it sees directly - > get > >> the tika-app JAR and run that desktop application. > >> > >> Could be an encoding issue? > >> > >> Erik > >> > >> — > >> Erik Hatcher, Senior Solutions Architect > >> http://www.lucidworks.com <http://www.lucidworks.com/> > >> > >> > >> > >> > On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > >> wrote: > >> > > >> > Hi, > >> > > >> > I'm using Solr 5.3.0 > >> > > >> > I'm indexing some PDF documents. However, for certain PDF files, there > >> are > >> > chinese text in the documents, but after indexing, what is indexed in > >> the > >> > content is either a series of "??????" or an empty content. > >> > > >> > I'm using the post.jar that comes together with Solr. > >> > > >> > What could be the reason that causes this? > >> > > >> > Regards, > >> > Edwin > >> > >> > > >