I tried this solution from Tim Allison, and it works.

http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files

Regards,
Edwin

On 27 March 2017 at 20:07, Allison, Timothy B. <talli...@mitre.org> wrote:

> Please also see:
>
> https://wiki.apache.org/tika/TikaOCR
>
> and
>
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
>
> If you have any other questions about Apache Tika and OCR, please feel
> free to ask on our users list as well: u...@tika.apache.org
>
> Cheers,
>
>            Tim
>
> -----Original Message-----
> From: Arian Pasquali [mailto:arianpasqu...@gmail.com]
> Sent: Sunday, March 26, 2017 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Index scanned documents
>
> Hi Walled,
>
> I've never done that with solr, but you would probably need to use some
> OCR preprocessing before indexing.
> The most popular library I know for the job is tesseract-orc <
> https://github.com/tesseract-ocr>.
>
> If you want to do that inside solr I've found that Tika has some support
> for that too.
> Take a look Vijay Mhaskar's post on how to do this using TikaOCR
>
> http://blog.thedigitalgroup.com/vijaym/using-solr-and-
> tikaocr-to-search-text-inside-an-image/
>
> I hope that guides you
>
> Em dom, 26 de mar de 2017 às 16:09, Waleed Raza <
> waleed.raza.parhi...@gmail.com> escreveu:
>
> > Hello
> > I want to ask you that how can we extract text in solr from images
> > which are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> > On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza <
> > waleed.raza.parhi...@gmail.com
> > > wrote:
> >
> > > Hello
> > > I want to ask you that how can we extract in solr text from images
> > > which are inside pdf and MS office documents ?
> > > i found many websites but did not get a reply of it please guide me.
> > >
> > >
> >
> --
> [image: INESC TEC]
>
> *Arian Rodrigo Pasquali*
> Laboratório de Inteligência Artificial e Apoio à Decisão Laboratory of
> Artificial Intelligence and Decision Support
>
> *INESC TEC*
> Campus da FEUP
> Rua Dr Roberto Frias
> 4200-465 Porto
> Portugal
>
> T +351 22 040 2963
> F +351 22 209 4050
> arian.r.pasqu...@inesctec.pt
> www.inesctec.pt
>

Reply via email to