I'm working
upgrading to the new 3.2.0b6 version of htdig. Running into issues on when
parsing msword docs (doc, xls).
When running
doc2html.pl via command line directly on a word doc I see the $MAGIC
number's aren't being determined correctly for the doc
file. Instead of pulling: $magic = '^\376\067\0\043'; (word6,7/97)
it pulls: ��ࡱ�>��
.0����-�������(continuous)
Similar with
xls, when running doc2html.pl via command line the $MAGIC numbers aren't: $magic
= '^\320\317\021\340'; as expected but:
��ࡱ�>�� �������(continuous)
It looks like doc2html.pl is receiving these as binary
files instead text. Doing the command "file -i" on the documents confirms
the mime type to be application/msword for
both.
Doing a octal dump on the first line of the file does
reveal the correct $magic numbers:
$od -b /opt/htdig/contrib/doc2html/todays-news.doc |
head -1
0000000 320 317 021 340 241 261 032 341 000 000 000 000 000 000 000 000
0000000 320 317 021 340 241 261 032 341 000 000 000 000 000 000 000 000
Running the doc file directly through catdoc returns
pristine text output as does running the xls file directly through xls2csv.
Server is RH Linux Enterprise with perl
5.8.0
Any ideas on how to counter and/or correct this
problem?
Thanks for any
suggestions.

