Hyphenation example:
On our site we are doing large scale conversion of previously published
material to html via OCR. As we are reproducing format as well as text,
his results in many hyphenations. For a page with several examples:
http://mdsa.net/megafile/msa/speccol/sc2900/sc2908/000001/000138/html/am138--606.html
The hyphens appear as regular (-). No special characters are inserted by
the OCR programs.
At 10:35 AM 2/6/01, you wrote:
>Geoff Hutchison wrote:
> >
> > On Mon, 5 Feb 2001, Greg Lepore wrote:
> >
> > > Have searched the site and the faq with no results. Is there
> > > any way for HTDIG to re-create words that are broken across two lines
> > > with a hypen?
> >
> > I suspect you're talking about external documents as I've never seen
> > hyphenation in HTML documents (and rarely seen it in text documents).
> >
> > You'd probably have to tackle this on the converter or parser level and I
> > don't know if this can happen at the moment. Of course if you give us more
> > detail (like the file types you're considering), someone might be able to
> > come up with a solution for you.
> >
> > You could undoubtedly do it in the source itself by keeping track of the
> > last word requested if it ends in a hyphen. But this hasn't been requested
> > before. Test documents would be quite welcome.
>
>I think that this feature would significantly increase the useability of
>Ht://Dig on PDF and other "pre-print" document types. However, recons-
>truction of hyphenated words would need an additional database -
>probably
>something similar to the TeX hyphenation database - and slow down the
>indexing process for those documents.
>
>If the TeX hyphenation databases could be transformed into a pattern re-
>cognition database for hyphenated words, slow-down of the indexer
>process
>would not hurt to much - after all, only words ending with "-" would be
>considered for lookup in the de-hyphenation database. If those words
>pro-
>duce a hit, the next portion of the document could be checked against
>the
>value parts of the pattern database.
>
>E.g. the TeX patterns "hy\-per hy\-phe\-na\-tion" could be transformed
>into
>the following key/value pairs:
> "hy-" -> ( "per" "phenation" )
> "hyphe-" -> ( "nation" )
> "hyphena-" -> ( "tion" )
>
>This is quite a simple approach and does not take multiple hyphenated
>words
>into account, but it might work for most cases where hyphenation occurs
>in
>PDF or Postscript documents. It also requires quite some storage space
>for
>de-hyphenation lookup tables, so maybe there is a somewhat nicer
>approach?
>
>HTML documents *could* (in theory) be hyphenated as well - there is a
>special entity (soft hyphen, "­") which could be used to
>automagically
>hyphenate documents in the web client. It should be no problem to make
>the Ht://Dig indexer recognizing this special entity (by simply skipping
>over it instead of translating it to "-"). However, there are only few
>browsers out there which support the "­" hyphenation feature - AFAIK
>only Lynx is able to display "­"-hyphenated documents correctly (all
>other browsers translate it to "-" regardless whether hyphenation is re-
>quired or not).
>
>
>ciao,
>
> Torsten
>
>--
>InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
>Waldhofstraße 14 Tel: +49-4101-403605
>D-25474 Ellerbek Fax: +49-4101-403606
>E-Mail: [EMAIL PROTECTED] Internet: http://www.inwise.de
Gregory Lepore
Maryland Electronic Capital Webmaster
410-260-6425
[EMAIL PROTECTED]
_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general