Re: Tiff extraction question

Jukka Zitting Wed, 09 Mar 2011 02:38:55 -0800

Hi,

On 03/09/2011 11:19 AM, Eliott wrote:

during the final phase of a project came into my attention that tiff
files are also capable of storing the image and the ocr-ed text in a
same file, just like PDFs do. Since we have many of such files, we have
a business need to extract text from these tiffs.


Has anybody written a text extractor or knows a library that can get the
text layer from these files? Is there any specific reason why JR does
not support this out of the box?

Jackrabbit uses Apache Tika [1] that contains a parser for TIFF images.Currently the parser only extracts XMP and EXIF metadata embedded inTIFFs (and we've disabled it by default in Jackrabbit), but you mightwant to check to see if you can extend it to also handle such text layers.


[1] http://tika.apache.org/

--
Jukka Zitting

Re: Tiff extraction question

Reply via email to