I've been investigating the dejavu (djvu) format to find out if it
could do what you mentioned.  I asked the guys at Lizardtech
(www.lizardtech.com) if they had some off-the-shelf product.  They
don't but they could do some custom job.  Djvu has what is called "a
hidden text layer".  Since it is regular text, I figured there could be
also xml markup.  Linux has text-based djvu utilities.  I think the
tools to manipulate the hidden text layer are all Windows-based.

Rich


On Sunday 18 March 2007 10:08, Kai Ponte wrote:
> On Sunday 18 March 2007 02:57:49 am Anders Norrbring wrote:
> > Does anybody know of a way to scan several thousands pictures on disk
> > with an OCR application to look for a specific text, and then list the
> > images where that text was found?
>
> I've been doing apps like that for the better part of twelve years.
> However, I've yet to see an OCR app in Linux. That doesn't mean they
don't
> exist, however, because I'm stuck in a predominantly windows world at
work.
> I know we currently either use OCR For Anydocs or Kofax Ascent. In fact,
> we're looking at replacing our current systems in the next few years.
>
> What you will need to do is probably write some program to take the
imaged
> documents - done so with whatever scanner you've got - and then
process the
> documents through the OCR engine supplied by the manufacturer. Typically
> this is a library like the AVI or MP3 libraries used by your most
commonly
> requested SUSE applications.
>
> Keep in mind, that you'll need to also have a retrieval program of some
> sort, to actually get the documents and view them - along with the
OCR data
> - in some manner.  This is one I wrote in 2003, which combined OCR from
> barcode and an imaging application based on FileNet:
> http://www.filesite.org/viewtopic.php?t=173
>
> You didn't mention whether you're doing spot or forms recognition or
> full-text OCR. You might also look at barcode recognition, because those
> are VERY reliable, even over fax.
>
>
> Try these links for the OCR software:
>
> Google apparently has an OCR engine that is now OSS..
>
>
http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.ht
>ml
>
> http://sourceforge.net/projects/tesseract-ocr
>
> ..never heard of it before this morning. Should be interesting to look at
> though.  Apparently it is an old HP-based software that had been shelved
> for twelve years and now resurrected.
>
>
> ABBYY is a well-known industrial-strength app for OCR. I've never
> personally used them (mostly stick with Caere) but have heard great
> things....AND...they have an SDK for *nix and/or TheCultOfMac.
>
> http://www.abbyy.com/sdk/?param=59956
>
> I aslo saw this one...
>
> http://www.linux-ocr.ekitap.gen.tr/
>
> Keep in mind that we process over 3M documents/year - that comes out to
> roughly 15,000 every day, including weekends.  We currently have eight
> high-speed scanners, and are evaluating whether to purchase some new
Kodak
> i860 models at $75,000 each. I just state this so you know our volume.
>
> --
> kai
> www.perfectreign.com || www.4thedadz.com
> www.filesite.org || www.donutmonster.com
>
> closing the doors that surround me
> so no one will ever penetrate
> complete my retreat just to wait for the day
> that never comes so i will laugh alone


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to