Geoff Hutchison wrote:
>
> On Tue, 28 May 2002, Elaine Fortin wrote:
>
> > We want to store faxes and be able to search them with htdig.
> ...
> > In order to be able to search the content, would we have to run them
> > through an OCR program, or is there something else that can translate them?
>
> You'd have to have some sort of OCR in there.
>
> A fax TIFF file is pure graphic--there's very little text content. (TIFF
> files in general can have some useful text info, but I think you're
> looking for the text in the fax, not text that the fax program may or may
> not store in the TIFF.)
OCR software programs need to be "trained" for successfully recognizing
any textual contents in a graphic. Textual graphics need to be properly
aligned in order for the OCR software to successfully recognize the text
content as such.
That means: The OCR programs need to know about the font used in the
graphics files *plus* there should be little (less than 3�) alignment
offset or else any (even the best commercially available OCR software
program) will produce near completely unreadable output.
In the case of facsimiles to be indexed by ht://Dig via translating
graphics
to text content with any give OCR software, one has to take into account
that
(a) facsimiles are in most cases *not* correctly enough aligned to be
analyzed by an OCR program (hand-faxed sheets will normally have
offsets of 3�+)
(b) facsimiles cannot be controlled with regards to character based
training of the OCR software (they will especially never be sent
using specially designed OCR fonts)
(c) facsimiles (especially hand-transmitted ones) will in many cases
contain valuable information added in hand-writing
(d) facsimiles will contain "useless" information that cannot be
skipped
by text-indexing software like ht://Dig (since there is no way of
inserting the respective control statements for the analyzing
software)
All this makes facsimiles (and most scanned texts) nearly unfit for
automatic processing with OCR and indexing programs.
cheers,
Torsten
--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14 Tel: +49-4101-403605
D-25474 Ellerbek Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED] Internet: http://www.inwise.de
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev