Re: Index scanned documents

Arian Pasquali Sun, 26 Mar 2017 08:51:04 -0700

Hi Walled,

I've never done that with solr, but you would probably need to use some OCR
preprocessing before indexing.
The most popular library I know for the job is tesseract-orc
<https://github.com/tesseract-ocr>.


If you want to do that inside solr I've found that Tika has some support
for that too.
Take a look Vijay Mhaskar's post on how to do this using TikaOCR

http://blog.thedigitalgroup.com/vijaym/using-solr-and-tikaocr-to-search-text-inside-an-image/

I hope that guides you

Em dom, 26 de mar de 2017 às 16:09, Waleed Raza <
waleed.raza.parhi...@gmail.com> escreveu:

> Hello
> I want to ask you that how can we extract text in solr from images which
> are inside pdf and MS office documents ?
> i found many websites but did not get a reply of it please guide me.
>
> On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza <
> waleed.raza.parhi...@gmail.com
> > wrote:
>
> > Hello
> > I want to ask you that how can we extract in solr text from images which
> > are inside pdf and MS office documents ?
> > i found many websites but did not get a reply of it please guide me.
> >
> >
>
-- 
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão
Laboratory of Artificial Intelligence and Decision Support

*INESC TEC*
Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto
Portugal

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasqu...@inesctec.pt
www.inesctec.pt

Re: Index scanned documents

Reply via email to