| ï
Not sure where '^\376\067\0\043' came from. I've pulled down the newest doc2html.pl
version (http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/contrib/doc2html/doc2html.pl)
and reconfigured. Adding the "binmode FILE;" line in sub
read_magic has corrected the problem.
Most of the document's parsed are clean, however,
their are a number that contain characters as displayed below. Any ideas on how
to clean this up further?
| | | | | ÃâÅ | d
Ã
8 8 8 8 L Ãâ Ã Ã Ã Ã ÃÅ
k ÃÅ ÃÂ ÃÂ ÃÂ ÃÂ ÃÂ 8
ÃÅ ÃÅ ÃÅ ÃÅ ÃÅ ÃÅ ÃÅ ÃÅ ÃÅ
ÃÅ ÃÅ ÃÅ ÃÅ Ãâ ÃÅ ÃÅ Ãâ Ãâ Ãâ ÃÅ ÃÅ ÃÅ Ãâ ÃÅ Ãâ Ãâ ÃÅ Ãââ Ãâ Ãâ Ãâ
--- NOTE for developers
---
In doc2html.pl (all versions?);
around line 253; in the if($XLS2HTML) statement; the $cmdl = "$cmd -fw $input"
line uses options "-fw". It appears those are no longer
valid in xls2csv (Catdoc Version 0.93.3).
Thanks very much for your
suggestion! From: David Adams [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 22, 2004 4:02 AM To: Wendt, Trevor; [EMAIL PROTECTED] Subject: Re: [htdig] Issues when parsing msword docs Trevor,
The magic number for Word documents (Word6 &
later) is \320\317\021\340, the same as Excel spreadsheets. That is
why doc2html.pl has to check both magic number and MIME-type.
I don't recognise '^\376\067\0\043',
could it be the magic number for some form of Word for MacIntosh
document?
Doc2html.pl should read the start of each file as
binary to get the magic number, but on some systems it reads it as text. Add a
line to sub read_magic so that it becomes:
open(FILE, "< $Input") || die "Can't open
file $Input\n";
binmode FILE;
read FILE,$Magic,256;
close FILE;
Let us know how you get on.
David Adams
Corporate Information Services Information Systems Services University of Southampton
|
- [htdig] Issues when parsing msword docs Wendt, Trevor
- Re: [htdig] Issues when parsing msword docs David Adams
- Wendt, Trevor

