Re: [htdig] converter

2000-08-29 Thread D . J . Adams
> > > > Hallo, > > > > > > does anybody know a converter for MS powerpoint anc MS excel documents > > > or some other trick to index this document types. > > > > > > Thank you. > > > > > > Herbert Hölzlwimmer > > > > David J Adams wrote: > > > > Recent versions of the catdoc MS Word to text conve

Re: [htdig] Indexing a list of sites -- catching failures

2000-08-29 Thread D . J . Adams
> > I have set up htdig so that every night, it indexes a long list of small > web sites. In general, this works very well, but I've found that I have to > be very careful when adding new sites to the end of the list. > > Some sites seem to cause htdig to fail. When this happens, htdig doesn't

Re: [htdig] converter

2000-08-29 Thread D . J . Adams
> > Hello David, > > thanX for your feedback. > > I have not played with ppHtml, as this was no requirement in my project - > thus I do not have any experience of it failing... I found out that the "Not enough space" message was only produced by files so large that they exceeded the maximum do

Re: [htdig] Htdig not displaying results more than 9 pages

2000-08-14 Thread D . J . Adams
I think there is a confusion here as to the meaning of "pages". By default, when a large number of documents satisfy the search criteria what you will get from htsearch is a maximum of ten pages, each page containing links to ten documents - 100 documents in total. Changing the configuration at

Re: [htdig] converter

2000-08-08 Thread D . J . Adams
> > Hallo, > > does anybody know a converter for MS powerpoint anc MS excel documents > or some other trick to index this document types. > > Thank you. > > Herbert Hölzlwimmer Recent versions of the catdoc MS Word to text converter come with a program for converting MS excel files into .CSV

Re: [htdig] First file found: Invalid?

2000-07-27 Thread D . J . Adams
> Hmmm... well, if you go http://archive.midrange.com/rpg400-l/index.htm and > search for "afp print" using the form's default values, you'll see what I mean. I tried, but: lynx: Can't access startfile http://archive.midrange.com/rpg400-l/index.htm -- David J Adams <[EMAIL PROTECTED]> Comp

[htdig] "Deleted, noexcerpt"

2000-07-27 Thread D . J . Adams
Geoff Hutchison wrote: > > On Wed, 26 Jul 2000, Gilles Detillieux wrote: > > > According to [EMAIL PROTECTED]: > > > Now I have to investigate why certain pages are flagged as > > > "Deleted, noexcerpt"! > > > > Main causes: > > - disallowed in robots.txt > > - indexing turned off by meta robo

[htdig] "Deleted, invalid" messages solved

2000-07-26 Thread D . J . Adams
I've been bombarding this list with emails about htmerge 3.1.5 producing the message "Deleted, invalid:" for pages which were apparently OK. I still don't know why this happens, but I have found a way of avoiding it. I was running htdig twice to produce two indexes and then htmerge just once to

Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-25 Thread D . J . Adams
> > According to David Adams: > > I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a > > year and I have been very pleased with it. I would say that we've given it a > > good workout here. The problem with the "Deleted, invalid" messages only > > occurs with a secon

Re: [htdig] Htmerge: "Deleted, invalid"

2000-07-14 Thread D . J . Adams
Sorry for the length of this! > > According to David Adams: > > Why does htmerge 3.1.5 flag some pages, which look OK to me, as > > "Deleted, invalid" and not index them? > > > > This is happening not just with .html pages but also .doc and .pdf files. > > > > It happens with a simple merge f

Re: [htdig] .pdf and .doc-files (fwd)

2000-06-09 Thread D . J . Adams
Re using doc2html to process .xls files. > > > > I don't think it is quite so simple: doc2html.pl (and > > parse_doc and conv_doc) only use the "magic number" of the > > file to decide which utility to use for conversion. > > > > MS Word and Excel files can have the same magic number. > > Oh

[htdig] Patch for doc2html.pl

2000-06-08 Thread D . J . Adams
Here is a second attempt to send the path for doc2html.pl. The error is that the line $cmdl = "$cmd '$Input' | $ED 's#^$Input#[$Name]#"; should read: $cmdl = "$cmd '$Input' | $ED 's#^$Input#[$Name]#'"; And the patch is: *** doc2html.pl.err Thu Jun 8 11:41:16 2000 --- doc2html.pl Tue

[htdig] Patch for doc2html.pl

2000-06-08 Thread D . J . Adams
I've had a report of a bug in doc2html.pl from Alain Forcioli: > > Dear David, > > I allow myself to send you a patch for the doc2html.pl perl script. > > There is just a little syntax error in the case where the RTF > converters is called. A single quote is missing at the end of the > line. >

Re: [htdig] Word documents indexing problem

2000-06-07 Thread D . J . Adams
> > Hello, > > I use htdig 3.1.5 on linux Redhat 6.1. > > I have configured htdig.conf file as follows : > > valid_extensions: .html .htm .doc .pdf .txt > local_default_doc: new_index.html index.html index.htm main.htm > main_frame.htm frame.htm content.htm title.htm main2.htm > > local_urls

Re: [htdig] Spidering but not indexing

2000-06-07 Thread D . J . Adams
> > hi, > > I have a menu system that is used to access my site. But I don't want to > index that menu, or I don't want the menu files in search results. Is > that possible? And how? i looked in the configuration section but > couldn't find anything. > > rutger > > -- > Homepage: http://huize

Re: [htdig] A Suggestion on Accents

2000-05-16 Thread D . J . Adams
> >Rather than a fuzzy accents search method, why not make the htdig database > >accent independent? After all, it is case independent already! > >For example: > > > >Garçon -> Garçon -> garçon -> garcon > > I would make the analogy to word suffixes rather than to case. There > is a

[htdig] A Suggestion on Accents

2000-05-15 Thread D . J . Adams
Our web pages are overwhelmingly in English, but we do have academics who put up pages in other languages and would like them to be searchable. I'm sure that this is quite common. Rather than a fuzzy accents search method, why not make the htdig database accent independent? After all, it is ca

Re: [htdig] indexing pdf files

2000-05-11 Thread D . J . Adams
> > I've spend the whole day trying... > This is what I have in my htdig.conf file > > external_parsers: application/pdf /export/home/htdig/bin/parse_doc.pl > > This is what I have in my parse_doc.pl > > $CATPDF = /export/home/xpdf/bin/pdftotext"; ^ | Double

Re: [htdig] Compile problems on SGI system

2000-03-27 Thread D . J . Adams
Your SGI compilation looks OK to me: there are warnings but no errors that I can see. I get similar warnings and htdig, etc. work fine for me. Are you saying that you get no executables built? -- David J Adams <[EMAIL PROTECTED]> Computing Services University of Southampton ---

Re: [htdig] local_user_urls: query

2000-02-10 Thread D . J . Adams
> > According to Geoff Hutchison: > > At 9:09 AM + 2/9/00, [EMAIL PROTECTED] wrote: > > >Does htdig recognize that some files may have server-side includes and > > >always fetch them via http despite these attributes in the config file? > > > > > >An Apache server will process SSIs in .shtm a

[htdig] local_user_urls: query

2000-02-09 Thread D . J . Adams
How smart is the handling of the local_user_urls: and local_urls: attributes? Does htdig recognize that some files may have server-side includes and always fetch them via http despite these attributes in the config file? An Apache server will process SSIs in .shtm and .shtml files, plus .htm and

[htdig] odd little bug in htdig 3.1.4

2000-02-03 Thread D . J . Adams
I have found that if a page contains keywords with just a space in the contents, like so: Test then the page is indexed ok, but no excerpt is shown by htsearch with Format=Long. Just changing that to: Test clears the problem. How did I find this, and why does it matter? Well I'

[htdig] Bug in 3.1.4

2000-01-27 Thread D . J . Adams
I'm testing htdig 3.1.4 in preparation to upgrading from version 3.1.2. It looks good, but a bug I reported to this list months ago is still there. No one commented on at the time, so I guess it got overlooked. A match-all search for "the problem", for example, works as expected: "the" is a ba

[htdig] Word parsers

2000-01-26 Thread D . J . Adams
I've done a quick investigation of two programs which parse Word documents, and I thought it might interest others on the htdig list. The two are Catdoc from http://www.fe.msk.ru/~vitus/catdoc/, and Wp2html from http://www.res.bbsrc.ac.uk/wp2html/. Catdoc is freeware. Wp2html is available for a