[CODE4LIB] Extracting Text From .tiff Files

Gavin Spomer Mon, 12 May 2014 15:10:18 -0700

Hello folks, 

I'm in the process of migrating a student newspaper collection, currently 
implemented with ResCarta, into our new bepress institutional repository. 
ResCarta has each page of a newspaper stored as a tiff file. Not only does the 
tiff file contain the graphics data, but it has some metadata in xml format and 
the fulltext of the page. I know this because I opened up some of the tiffs 
with a plain-text editor (Vim).


Although I can see the text in the file, I've only been about 90% accurate in 
extracting it with a script. Some of those "weird" characters seem to do some 
wonky things when doing file IO for some reason. Is there a more reliable way 
to extract text stored in a tiff file? I've Googled and Googled and have pulled 
up almost nothing. But there's got to be a way, since ResCarta stores it there 
and can extract it. 

Any ideas? 
Gavin Spomer
Systems Programmer
Brooks Library
Central Washington University

[CODE4LIB] Extracting Text From .tiff Files

Reply via email to