>
> > > Hallo,
> > >
> > > does anybody know a converter for MS powerpoint anc MS excel documents
> > > or some other trick to index this document types.
> > >
> > > Thank you.
> > >
> > > Herbert Hölzlwimmer
> >
> > David J Adams wrote:
> >
> > Recent versions of the catdoc MS Word to text conve
>
> I have set up htdig so that every night, it indexes a long list of small
> web sites. In general, this works very well, but I've found that I have to
> be very careful when adding new sites to the end of the list.
>
> Some sites seem to cause htdig to fail. When this happens, htdig doesn't
>
> Hello David,
>
> thanX for your feedback.
>
> I have not played with ppHtml, as this was no requirement in my project -
> thus I do not have any experience of it failing...
I found out that the "Not enough space" message was only produced by
files so large that they exceeded the maximum do
I think there is a confusion here as to the meaning of "pages".
By default, when a large number of documents satisfy the search criteria
what you will get from htsearch is a maximum of ten pages, each page
containing links to ten documents - 100 documents in total.
Changing the configuration at
>
> Hallo,
>
> does anybody know a converter for MS powerpoint anc MS excel documents
> or some other trick to index this document types.
>
> Thank you.
>
> Herbert Hölzlwimmer
Recent versions of the catdoc MS Word to text converter come with a program
for converting MS excel files into .CSV
> Hmmm... well, if you go http://archive.midrange.com/rpg400-l/index.htm and
> search for "afp print" using the form's default values, you'll see what I mean.
I tried, but:
lynx: Can't access startfile http://archive.midrange.com/rpg400-l/index.htm
--
David J Adams
<[EMAIL PROTECTED]>
Comp
Geoff Hutchison wrote:
>
> On Wed, 26 Jul 2000, Gilles Detillieux wrote:
>
> > According to [EMAIL PROTECTED]:
> > > Now I have to investigate why certain pages are flagged as
> > > "Deleted, noexcerpt"!
> >
> > Main causes:
> > - disallowed in robots.txt
> > - indexing turned off by meta robo
I've been bombarding this list with emails about htmerge 3.1.5 producing
the message "Deleted, invalid:" for pages which were apparently OK.
I still don't know why this happens, but I have found a way of avoiding
it. I was running htdig twice to produce two indexes and then htmerge
just once to
>
> According to David Adams:
> > I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a
> > year and I have been very pleased with it. I would say that we've given it a
> > good workout here. The problem with the "Deleted, invalid" messages only
> > occurs with a secon
Sorry for the length of this!
>
> According to David Adams:
> > Why does htmerge 3.1.5 flag some pages, which look OK to me, as
> > "Deleted, invalid" and not index them?
> >
> > This is happening not just with .html pages but also .doc and .pdf files.
> >
> > It happens with a simple merge f
Re using doc2html to process .xls files.
> >
> > I don't think it is quite so simple: doc2html.pl (and
> > parse_doc and conv_doc) only use the "magic number" of the
> > file to decide which utility to use for conversion.
> >
> > MS Word and Excel files can have the same magic number.
>
> Oh
Here is a second attempt to send the path for doc2html.pl.
The error is that the line
$cmdl = "$cmd '$Input' | $ED 's#^$Input#[$Name]#";
should read:
$cmdl = "$cmd '$Input' | $ED 's#^$Input#[$Name]#'";
And the patch is:
*** doc2html.pl.err Thu Jun 8 11:41:16 2000
--- doc2html.pl Tue
I've had a report of a bug in doc2html.pl from Alain Forcioli:
>
> Dear David,
>
> I allow myself to send you a patch for the doc2html.pl perl script.
>
> There is just a little syntax error in the case where the RTF
> converters is called. A single quote is missing at the end of the
> line.
>
>
> Hello,
>
> I use htdig 3.1.5 on linux Redhat 6.1.
>
> I have configured htdig.conf file as follows :
>
> valid_extensions: .html .htm .doc .pdf .txt
> local_default_doc: new_index.html index.html index.htm main.htm
> main_frame.htm frame.htm content.htm title.htm main2.htm
>
> local_urls
>
> hi,
>
> I have a menu system that is used to access my site. But I don't want to
> index that menu, or I don't want the menu files in search results. Is
> that possible? And how? i looked in the configuration section but
> couldn't find anything.
>
> rutger
>
> --
> Homepage: http://huize
> >Rather than a fuzzy accents search method, why not make the htdig database
> >accent independent? After all, it is case independent already!
> >For example:
> >
> >Garçon -> Garçon -> garçon -> garcon
>
> I would make the analogy to word suffixes rather than to case. There
> is a
Our web pages are overwhelmingly in English, but we do have academics who
put up pages in other languages and would like them to be searchable. I'm
sure that this is quite common.
Rather than a fuzzy accents search method, why not make the htdig database
accent independent? After all, it is ca
>
> I've spend the whole day trying...
> This is what I have in my htdig.conf file
>
> external_parsers: application/pdf /export/home/htdig/bin/parse_doc.pl
>
> This is what I have in my parse_doc.pl
>
> $CATPDF = /export/home/xpdf/bin/pdftotext";
^
|
Double
Your SGI compilation looks OK to me: there are warnings but no errors that I can see.
I get similar warnings and htdig, etc. work fine for me.
Are you saying that you get no executables built?
--
David J Adams
<[EMAIL PROTECTED]>
Computing Services
University of Southampton
---
>
> According to Geoff Hutchison:
> > At 9:09 AM + 2/9/00, [EMAIL PROTECTED] wrote:
> > >Does htdig recognize that some files may have server-side includes and
> > >always fetch them via http despite these attributes in the config file?
> > >
> > >An Apache server will process SSIs in .shtm a
How smart is the handling of the local_user_urls: and local_urls:
attributes?
Does htdig recognize that some files may have server-side includes and
always fetch them via http despite these attributes in the config file?
An Apache server will process SSIs in .shtm and .shtml files, plus .htm
and
I have found that if a page contains keywords with just a space in the
contents, like so:
Test
then the page is indexed ok, but no excerpt is shown by htsearch with
Format=Long.
Just changing that to:
Test
clears the problem.
How did I find this, and why does it matter?
Well I'
I'm testing htdig 3.1.4 in preparation to upgrading from version 3.1.2.
It looks good, but a bug I reported to this list months ago is still
there. No one commented on at the time, so I guess it got overlooked.
A match-all search for "the problem", for example, works as expected:
"the" is a ba
I've done a quick investigation of two programs which parse Word
documents, and I thought it might interest others on the htdig list.
The two are Catdoc from http://www.fe.msk.ru/~vitus/catdoc/,
and Wp2html from http://www.res.bbsrc.ac.uk/wp2html/.
Catdoc is freeware. Wp2html is available for a
24 matches
Mail list logo