I've been investigating the dejavu (djvu) format to find out if it could do what you mentioned. I asked the guys at Lizardtech (www.lizardtech.com) if they had some off-the-shelf product. They don't but they could do some custom job. Djvu has what is called "a hidden text layer". Since it is regular text, I figured there could be also xml markup. Linux has text-based djvu utilities. I think the tools to manipulate the hidden text layer are all Windows-based.
Rich On Sunday 18 March 2007 10:08, Kai Ponte wrote: > On Sunday 18 March 2007 02:57:49 am Anders Norrbring wrote: > > Does anybody know of a way to scan several thousands pictures on disk > > with an OCR application to look for a specific text, and then list the > > images where that text was found? > > I've been doing apps like that for the better part of twelve years. > However, I've yet to see an OCR app in Linux. That doesn't mean they don't > exist, however, because I'm stuck in a predominantly windows world at work. > I know we currently either use OCR For Anydocs or Kofax Ascent. In fact, > we're looking at replacing our current systems in the next few years. > > What you will need to do is probably write some program to take the imaged > documents - done so with whatever scanner you've got - and then process the > documents through the OCR engine supplied by the manufacturer. Typically > this is a library like the AVI or MP3 libraries used by your most commonly > requested SUSE applications. > > Keep in mind, that you'll need to also have a retrieval program of some > sort, to actually get the documents and view them - along with the OCR data > - in some manner. This is one I wrote in 2003, which combined OCR from > barcode and an imaging application based on FileNet: > http://www.filesite.org/viewtopic.php?t=173 > > You didn't mention whether you're doing spot or forms recognition or > full-text OCR. You might also look at barcode recognition, because those > are VERY reliable, even over fax. > > > Try these links for the OCR software: > > Google apparently has an OCR engine that is now OSS.. > > http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.ht >ml > > http://sourceforge.net/projects/tesseract-ocr > > ..never heard of it before this morning. Should be interesting to look at > though. Apparently it is an old HP-based software that had been shelved > for twelve years and now resurrected. > > > ABBYY is a well-known industrial-strength app for OCR. I've never > personally used them (mostly stick with Caere) but have heard great > things....AND...they have an SDK for *nix and/or TheCultOfMac. > > http://www.abbyy.com/sdk/?param=59956 > > I aslo saw this one... > > http://www.linux-ocr.ekitap.gen.tr/ > > Keep in mind that we process over 3M documents/year - that comes out to > roughly 15,000 every day, including weekends. We currently have eight > high-speed scanners, and are evaluating whether to purchase some new Kodak > i860 models at $75,000 each. I just state this so you know our volume. > > -- > kai > www.perfectreign.com || www.4thedadz.com > www.filesite.org || www.donutmonster.com > > closing the doors that surround me > so no one will ever penetrate > complete my retreat just to wait for the day > that never comes so i will laugh alone -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]