To index text in images the image needs to be searchable i. e. text needs
to be overlayed on the image like a searchable pdf. You can do this using
ocr but it is a bit unreliable if the images are scanned copies of written
text.

On 10-Apr-2018 4:12 PM, "Rahul Singh" <rahul.xavier.si...@gmail.com> wrote:

May need to extract outside SolR and index pure text with an external
ingestion process. You have much more control over the Tika attributes and
behaviors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation


On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <edwinye...@gmail.com>,
wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other
types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin

Reply via email to