Hi,
On 03/09/2011 11:19 AM, Eliott wrote:
during the final phase of a project came into my attention that tiff
files are also capable of storing the image and the ocr-ed text in a
same file, just like PDFs do. Since we have many of such files, we have
a business need to extract text from these tiffs.
Has anybody written a text extractor or knows a library that can get the
text layer from these files? Is there any specific reason why JR does
not support this out of the box?
Jackrabbit uses Apache Tika [1] that contains a parser for TIFF images.
Currently the parser only extracts XMP and EXIF metadata embedded in
TIFFs (and we've disabled it by default in Jackrabbit), but you might
want to check to see if you can extend it to also handle such text layers.
[1] http://tika.apache.org/
--
Jukka Zitting