Re: [htdig] Indexing a list of sites -- catching failures

2000-08-29 Thread D . J . Adams

 
 I have set up htdig so that every night, it indexes a long list of small
 web sites. In general, this works very well, but I've found that I have to
 be very careful when adding new sites to the end of the list. 
 
 Some sites seem to cause htdig to fail. When this happens, htdig doesn't 
 continue on with the rest of the list -- it simply skips to the next step
 in rundig. This means that I have to do some careful adding and substracting
 to the list of sites before I figure out what caused the index to fail.
 
 What I would like to do is to somehow index each site separately and have
 some kind of error log if htdig hits a site that it fails on (for whatever
 reason). Then, I would like for it to procede to the next site in the list,
 whether or not it failed to index the previous site.
 
 Thanks for any help,
 Todd Wallace

I do a weekly index of over 900 servers and have never had this happen.
What version of htdig are you using?

Also, I would recommend using the http_proxy attribute in the config
file if you possibly can. 

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] converter

2000-08-29 Thread D . J . Adams

 
   Hallo,
  
   does anybody know a converter for MS powerpoint anc MS excel documents
   or some other trick to index this document types.
  
   Thank you.
  
   Herbert Hölzlwimmer
 
  David J Adams wrote:
 
  Recent versions of the catdoc MS Word to text converter come with a
 program
  for converting MS excel files into .CSV files which should do what you
 want.
 
  I too would like to know of a MS powerpoint converter for Unix.
 
 Hi,
 
 just having gone through the same problem, I used xlHtml to index Excel
 files. For this I had to change the parse_doc file, it can be found together
 with the instructions at the adress below:
 http://www.haberer-online.de/htdig/default.htm
 
 xlHtml also has an option to convert MS powerpoint, but I did not take a
 look at this.
 
 Hope that helps,
 Sven
 

Sven,
Thanks for this very useful tip.

I've tried xlHtml (version 0.2.7.2) and it seems at least as good
xls2csv, the converter that comes with catdoc, though it could be
better:

option handling seems flaky.

HTML output can be generated, but hyperlinks in spread sheets
 are not marked up as links.

I've also tried ppHtml which converts PowerPoint files to HTML, and it
seems adequate as a converter.  While indexing our web pages using
doc2html.pl it processed about a hundred .ppt files ok, and failed on
three with the message "Not enough space".  (As I'm using a sizable IRIX
system with plenty of memory and disk space I don't know why I should
get such a message.)

The next version of doc2html.pl will include examples of using both
pptHTML and xlHtml as converters.  I should be releasing it sometime in
September. 

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Htdig not displaying results more than 9 pages

2000-08-14 Thread D . J . Adams

I think there is a confusion here as to the meaning of "pages".

By default, when a large number of documents satisfy the search criteria
what you will get from htsearch is a maximum of ten pages, each page
containing links to ten documents - 100 documents in total. 

Changing the configuration attribute "maximum_pages" will change the number 
of pages of links that htsearch will return to you, but as has been pointed out,
graphics are only available for pages 1 to 10.

You can also change the configuration attribute "matches_per_page" from 10
to any number you like.  I think this will achieve what you want.

NB: "matches_per_page" is documented in the alphabetical list of all attributes,
but not in the list of htsearch-specific attributes.  It should be.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Deleted, noexcerpt

2000-07-27 Thread D . J . Adams

Geoff Hutchison wrote:
 
 On Wed, 26 Jul 2000, Gilles Detillieux wrote:
 
  According to [EMAIL PROTECTED]:
   Now I have to investigate why certain pages are flagged as 
   "Deleted, noexcerpt"!
  
  Main causes:
  - disallowed in robots.txt
  - indexing turned off by meta robots or noindex tags
  - no indexable text in documents
  - server_max_docs exceeded
 
 Also when merging:
 - duplicates between the two databases (oldest is removed)

Ah!  That last might explain a lot of them.  Any chance of more helpful
messages in a future version, eg:  "Deleted, duplicate:" ?

If "indexing turned off by meta robots or noindex tags" results in
"Deleted, noexcerpt", what condition gives the message "Deleted, noindex:" ?

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] First file found: Invalid?

2000-07-27 Thread D . J . Adams

 Hmmm... well, if you go http://archive.midrange.com/rpg400-l/index.htm and 
 search for "afp print" using the form's default values, you'll see what I mean.

I tried, but:

lynx: Can't access startfile http://archive.midrange.com/rpg400-l/index.htm


-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Deleted, invalid messages solved

2000-07-26 Thread D . J . Adams

I've been bombarding this list with emails about htmerge 3.1.5 producing
the message "Deleted, invalid:" for pages which were apparently OK. 

I still don't know why this happens, but I have found a way of avoiding
it.  I was running htdig twice to produce two indexes and then htmerge
just once to merge them.  If I instead run htmerge three times (one for
each index on its own, and then once more to merge them), I don't
get any "Deleted, invalid:" pages.  SOLVED!

Two lessons learnt on the way:

Htdig will run perfectly well on Irix compiled with the SGI MIPSpro compiler
Do use http_proxy when indexing servers not in your local domain

My thanks especially to Gilles for his patience and for lots of suggested
lines of investigation.

Now I have to investigate why certain pages are flagged as 
"Deleted, noexcerpt"!

-- 
 
David J Adams 
[EMAIL PROTECTED] 
Computing Services 
University of Southampton



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-25 Thread D . J . Adams

 
 According to David Adams:
  I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
  year and I have been very pleased with it.  I would say that we've given it a 
  good workout here.  The problem with the "Deleted, invalid" messages only 
  occurs with a second, relatively new search index.
 
 I guess I should have read your message before responding to Geoff's!
 
  The first index is made from a single run of htdig covering 33 servers, all in 
  the local domain, and on this week's initial dig htmerge reports 49,233 
  documents and not a single "Deleted, invalid".
  
  The second index is made from two runs of htdig covering a total 969 (yes 969 
  !) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
  "Deleted, invalid".
  
  I have looked at the db.wordlist files (which are written to only by htdig - is 
  that right?)
 
 Yes and no.  htdig creates and writes the initial db.wordlist, then htmerge
 sorts it, merges words together, and processes flags for page removals.  It
 then rewrites this file before creating the word index database.
 
  and it would appear that htdig is flagging the pages for htmerge 
  to delete and is not finding any words in them.
  
  I can advance these theories:
  
  It is not a bug, but is due to the use of a proxy. (I use a proxy 
  because without one, a portion of the sites on any run of htdig were 
  found to be not responding or even unknown.  With a proxy, htdig appears
  to have no such problems.)
 
 Hold on there!  The problem of sites being down (unknown or not
 responding) is exactly the sort of thing that causes the "Deleted,
 invalid" situation, and I said so last week.  How did you conclude that
 htdig appears to have no such problems with a proxy, when it does indeed
 appear to be having exactly that problem?  It would make sense that if
 a site is not responding, the proxy would inform htdig of this (unless
 it happened to quietly substitute a cached copy of the requested page
 - assuming it had one), and htdig would respond the same way it would
 without a proxy.  I think this is the most likely theory.

How did I conclude that htdig is having no such problems?
Two reasons: 
1). At least one page on our main server, covered by my
http_proxy_exclude statement, is "Deleted, invalid".
2). When I do not use http_proxy then htdig -v gives clear
messages, such as "Unable to connect to server" and
"Server not responding".
With http_proxy I get no such messages, not even with htdig -vvv

Additionally:
3). I can access the pages using IE (same proxy) the same day,
no problem. 
4). One or two pages from a site may be affected while others
are not.

I have now re-run the index with htdig -i -vvv etc.  I have rather a lot of 
information to go through, but I've found nothing yet.

And that nothing is significant.  What do you make of this, the log from htmerge
includes:

Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm

While the log from htdig includes this (slightly mangled by "more" command), which 
looks OK to me:

pick: www.folkmania.org.uk, # servers = 246
1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for 
http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee
Zachinfo.htm HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://www.folkmania.org.uk/
Host: www.folkmania.org.uk

Header line: HTTP/1.0 200 OK
Header line: Server: thttpd/2.07 02dec99
Header line: Content-Type: text/html
Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT
Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT
Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100)
And converted to Fri, 23 Jun 2000 18:34:50
Header line: Accept-Ranges: bytes
Header line: Content-Length: 4586
Header line: Age: 127170
Header line: X-Cache: HIT from www-cacheb.soton.ac.uk
Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128
Header line: X-Cache: MISS from www-cachea.soton.ac.uk
Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128
Header line: Proxy-Connection: close
Header line: 
returnStatus = 0
Read 4586 from document
Read a total of 4586 bytes

title: LeeZachInfo
[snip]
 size = 4586

And that page is only retrieved once.

 
  It is a bug due to the use of a proxy.
  
  It is a bug which only shows when compiled under IRIX.
  
  It is a bug which only occurs when there many different servers.
  

I can add another theory:

It is a bug when merging a second index
 - all the "Deleted, invalid" pages come from the htdig run specified
   with the htmerge -m option

This theory is easy to check out, I'll investigate tomorrow.


  I intend to re-build the second index using htdig -vvv and perhaps learn 
  something.
 
 The only sure way to rule 

Re: [htdig] Htmerge: Deleted, invalid

2000-07-14 Thread D . J . Adams

Sorry for the length of this!

 
 According to David Adams:
  Why does htmerge 3.1.5 flag some pages, which look OK to me, as 
  "Deleted, invalid" and not index them?
  
  This is happening not just with .html pages but also .doc and .pdf files.
  
  It happens with a simple merge following a run of htdig -i -a
  and also when two htdig runs are merged using the htdig -m option.
 
 htmerge does this when the remove_bad_urls attribute is true, and the
 page in question is not found (404 error), the server name no longer
 exists, the server is down, or in the case of an update dig, the page
 has been updated, superceding the old document database record for it.
 In the latter case, htdig creates a new record for the updated document,
 with a new DocID, so the old one is discarded.  As this only happens in
 update digs, it wouldn't be the case during an htdig -i, so I'd look at
 the other possibilities.
 
 In any case, run both htdig and htmerge with at least two verbose options,
 and cross-reference the DocID of the "Deleted, invalid" messages to other
 messages with the same ID, to get a clearer picture of what's happening.
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 
 

I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid.  None of the reasons given above seem to fit.

I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is
one of many in the limit_urls_to directive. 

Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then
http://www.tregalic.co.uk/sacred-heart/churchpage1.html
http://www.tregalic.co.uk/sacred-heart/churchpage2.html
  ...
http://www.tregalic.co.uk/sacred-heart/churchpage7.html
amongst others.

Grepping for "churchpage" in the htmerge log I find:

htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html
1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html

So I try an experiment: I reduce limit_urls_to include only the starting URL
and http://www.tregalic.co.uk/sacred-heart/ and run htdig  htmerge.

Then htmerge reports:

htmerge: Total word count: 3806
0/http://www.soton.ac.uk/services/local/alpha.html
1/http://www.tregalic.co.uk/sacred-heart/
9/http://www.tregalic.co.uk/sacred-heart/baptism.html
2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html
htmerge: 10
12/http://www.tregalic.co.uk/sacred-heart/information.html
11/http://www.tregalic.co.uk/sacred-heart/links.html
10/http://www.tregalic.co.uk/sacred-heart/newsletter.html

I do not accept that pages 4  5 just happened to unavailable on the
first occasion and available on the second.  Nor can I see any
differences in the htdig logs for these pages.  The same sizes are
reported in both cases. 

I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Patch for doc2html.pl

2000-06-08 Thread D . J . Adams

I've had a report of a bug in doc2html.pl from Alain Forcioli:

 
 Dear David,
 
 I allow myself to send you a patch for the doc2html.pl perl script.
 
 There is just a little syntax error in the case where the RTF
 converters is called. A single quote is missing at the end of the
 line.
 
 Best regards,
 
 

He is quite right there is a bug, and I attach the patch to fix it that
he sent me. 

What puzzles us both is that in my case doc2html.pl was still handling
RTF files correctly. 



--8Iq4pIbWMy
Content-Type: application/octet-stream
Content-Disposition: attachment;
filename="doc2html.pl.patch"
Content-Transfer-Encoding: base64

LS0tIGRvYzJodG1sLnBsCUZyaSBNYXIgMzEgMTY6MzY6MTEgMjAwMAorKysgZG9jMmh0bWwu
cGwuZ29vZAlUdWUgTWF5IDMwIDE0OjE5OjU1IDIwMDAKQEAgLTEzNiw3ICsxMzYsNyBAQAog
ICBpZiAoKGRlZmluZWQgJFJURjJIVE1MKSBhbmQgKGxlbmd0aCAkUlRGMkhUTUwpKSB7CiAg
ICAgJGNtZCA9ICRSVEYySFRNTDsKICAgICAjIFJ0ZjJodG1sIHVzZXMgZmlsZW5hbWUgYXMg
dGl0bGUsIGNoYW5nZSB0aGlzOgotICAgICRjbWRsID0gIiRjbWQgJyRJbnB1dCcgfCAkRUQg
J3MjXjxUSVRMRT4kSW5wdXQ8L1RJVExFPiM8VElUTEU+WyROYW1lXTwvVElUTEU+IyI7Cisg
ICAgJGNtZGwgPSAiJGNtZCAnJElucHV0JyB8ICRFRCAncyNePFRJVExFPiRJbnB1dDwvVElU
TEU+IzxUSVRMRT5bJE5hbWVdPC9USVRMRT4jJyI7CiAgICAgJG1hZ2ljID0gJ157XDEzNHJ0
Zic7CiAgICAgJnN0b3JlX2h0bWxfbWV0aG9kKCdSVEYnLCRjbWQsJGNtZGwsJG1hZ2ljKTsK
ICAgfQo=

--8Iq4pIbWMy
Content-Type: text/plain; charset=us-ascii
Content-Description: .signature
Content-Transfer-Encoding: 7bit


-- 
Alain FORCIOLI  ``Who belong to the Dream Starting 5 ?'' 
---
RISC Technology http://www.risc.fr/ [EMAIL PROTECTED]
APRIL   http://www.april.org/   [EMAIL PROTECTED]
Debian GNU/Linuxhttp://www.debian.org/
---
"Resistance is futile. Open your source code and prepare for assimilation."
--8Iq4pIbWMy--



-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Patch for doc2html.pl

2000-06-08 Thread D . J . Adams

Here is a second attempt to send the path for doc2html.pl.

The error is that the line

  $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#";

should read:

  $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#'";

And the patch is:

*** doc2html.pl.err Thu Jun  8 11:41:16 2000
--- doc2html.pl Tue May 30 14:21:25 2000
***
*** 136,142 
if ((defined $RTF2HTML) and (length $RTF2HTML)) {
  $cmd = $RTF2HTML;
  # Rtf2html uses filename as title, change this:
! $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#";
  $magic = '^{\134rtf';
  store_html_method('RTF',$cmd,$cmdl,$magic);
}
--- 136,142 
if ((defined $RTF2HTML) and (length $RTF2HTML)) {
  $cmd = $RTF2HTML;
  # Rtf2html uses filename as title, change this:
! $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#'";
  $magic = '^{\134rtf';
  store_html_method('RTF',$cmd,$cmdl,$magic);
}


-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Spidering but not indexing

2000-06-07 Thread D . J . Adams

 
 hi,
 
 I have a menu system that is used to access my site. But I don't want to
 index that menu, or I don't want the menu files in search results. Is
 that possible? And how? i looked in the configuration section but
 couldn't find anything.
 
 rutger
 
 -- 
 Homepage: http://huizen.dds.nl/~rarwes

Yes, There are two different approaches, use which ever suits you:

1)  Stop your menu files being indexed by adding this line to the head
section of each one:

META NAME="robots" content="noindex, follow"

2)  Allow the menu files to be indexed, but have them omitted from
the results of a search by a hidden "exclude" field on the
HTML form.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Word documents indexing problem

2000-06-07 Thread D . J . Adams


 
 Hello,
 
 I use htdig 3.1.5 on linux Redhat 6.1.
 
 I have configured htdig.conf file as follows :
 
 valid_extensions: .html .htm .doc .pdf .txt
 local_default_doc: new_index.html index.html index.htm main.htm
 main_frame.htm frame.htm content.htm title.htm main2.htm
 
 local_urls_only: true
 
 local_urls: http://gnbuxsl.grenoble.hp.com:8090/=/var/opt/web/
 
 #
 # Since ht://Dig does not (and cannot) parse every document type, this
 # attribute is a list of strings (extensions) that will be ignored
 during
 # indexing. These are *only* checked at the end of a URL, whereas
 # exclude_url patterns are matched anywhere.
 #
 bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com
 .gif \
 .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg
 .mov .avi
 
 max_doc_size:   2000
 
 external_parsers: application/msword-text/html
 /usr/local/bin/parse_doc.pl \
   application/postscript-text/html
 /usr/local/bin/parse_doc.pl \
   application/pdf-text/html /usr/local/bin/parse_doc.pl
 
 pdf files indexing works fine whereas I get the following message when
 indexing msword files :
  
 30:30:2:http://gnbuxsl.grenoble.hp.com:8090/doc/tech/casc/details_casc.doc:
 Trying local files
   found existing file /var/opt/web/doc/tech/casc/details_casc.doc
  not found
 
 The file /var/opt/web/doc/tech/casc/details_casc.doc actually exists...
 
 I don't understand what the problem can be. Running rundig with several
 additional -v options does not help.
 
 Could somebody help me ? 
 
 Thanks,
 Jean-Francois.
 --

I think the "not found" could refer to the utility which you are using
within parse_doc.pl to handle word documents.

Try calling parse_doc.pl from the command line:

parse_doc.pl /var/opt/web/doc/tech/casc/details_casc.doc arg2 arg3

and see what happens.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] A Suggestion on Accents

2000-05-15 Thread D . J . Adams

Our web pages are overwhelmingly in English, but we do have academics who
put up pages in other languages and would like them to be searchable.  I'm
sure that this is quite common.

Rather than a fuzzy accents search method, why not make the htdig database 
accent independent?  After all, it is case independent already!
For example:

Garccedil;on  -   Garçon   -   garçon   -   garcon

and 'garcon' goes into the database.

Is this a sensible suggestion?  Entering 'garcon' into an English-language
version of (say) Netscape is a lot easier than entering 'garçon', and it
seems reasonable to me that a search for 'garcon' will find not only 'garcon'
and 'Garcon' but also 'garçon' and 'Garçon'. 

I would even volunteer to work on a patch myself, but I lack knowledge of
locales, and anything I wrote would probably cause more problems than it
would solve. 

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Compile problems on SGI system

2000-03-27 Thread D . J . Adams

Your SGI compilation looks OK to me:  there are warnings but no errors that I can see.

I get similar warnings and htdig, etc. work fine for me.

Are you saying that you get no executables built?

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] local_user_urls: query

2000-02-10 Thread D . J . Adams

 
 According to Geoff Hutchison:
  At 9:09 AM + 2/9/00, [EMAIL PROTECTED] wrote:
  Does htdig recognize that some files may have server-side includes and
  always fetch them via http despite these attributes in the config file?
  
  An Apache server will process SSIs in .shtm and .shtml files, plus .htm
  and .html files with execution permission set.
  
  Currently, the local_* attributes only read .htm and .html files. It 
  makes no attempt to emulate server-parsing. So if you have set 
  XBitHack for your Apache server, there isn't any way htdig will know 
  that and it will fly right through, ignoring your SSI code.
  
  However, .shtml, .phtml, .php3 files and the like will not be indexed 
  through the local filesystem, instead going to HTTP.
 
 Actually, htdig 3.1.4 also accepts .txt, .asc, .pdf, .ps and .eps files
 locally.  For some reason, that change never made it into 3.2.0b1.
 I imagine it got lost in the merge.  Anyway, with 3.2's mime.types
 support, that's the way RetrieveLocal() should determine the content-type
 for local files.  It'll just need a few lines of code to add that in,
 I expect.
 
 In any case, htdig has no equivalent to Apache's XBitHack, so for SSI
 documents, I'd recommend using .shtml if you want server-side parsing.
 For my own system, I use SSI only to add a few bits and pieces, so I
 don't mind that that stuff doesn't get indexed.  I now index everything
 through local_urls.

I must ask: why does htdig handle local_urls this way?  

Doesn't it make more sense, when the attribute is set, to get ALL pages
locally except when:

the file does not exist
or
the file has execution permission set
or
the file name ends in .shtm, .shtml and one or two others
which have special meanings to servers.


 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] local_user_urls: query

2000-02-09 Thread D . J . Adams

How smart is the handling of the local_user_urls: and local_urls:
attributes?

Does htdig recognize that some files may have server-side includes and
always fetch them via http despite these attributes in the config file?

An Apache server will process SSIs in .shtm and .shtml files, plus .htm
and .html files with execution permission set. 

Thanks.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] odd little bug in htdig 3.1.4

2000-02-03 Thread D . J . Adams

I have found that if a page contains keywords with just a space in the
contents, like so:

html
head
titleTest/title
META name=keywords content=" "
META name=description content=" "
/head
body

then the page is indexed ok, but no excerpt is shown by htsearch with
Format=Long. 

Just changing that to:

html
head
titleTest/title
META name=keywords content=""
META name=description content=""
/head
body

clears the problem.

How did I find this, and why does it matter?

Well I'm working on an external conversion script which tries to extract
the keywords and summary from WordPerfect documents.  In real life such
documents often have no summary or keywords and I was using a space as
the default.

I can work around this, so its no great deal, but the bug may have other
consequences I havn't found yet.

By the way, my script is based on conv_doc.pl and can be used in its
place.  I hope to send it in when I've finished polishing it. 

---
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.



[htdig] Word parsers

2000-01-26 Thread D . J . Adams

I've done a quick investigation of two programs which parse Word
documents, and I thought it might interest others on the htdig list.

The two are Catdoc from http://www.fe.msk.ru/~vitus/catdoc/,
and Wp2html from http://www.res.bbsrc.ac.uk/wp2html/.

Catdoc is freeware.  Wp2html is available for a small sum from a one-man
business, and the source code is made available.  (It cost us, as a
University, a mere 25 pounds for the right to run it in one Unix server
and receive upgrades.)

I saved a Word97 document in Word2 and Word6 formats and then tried
to see if the programs could extract text from the files:

Version Catdoc  Catdoc  Wp2html
of Word 0.90a   0.91.2  version 3.2

2.0 Yes(1)  Yes No

6.0 Yes(2)  Yes No

97  Yes(2)  Yes Yes


Notes (1) - Very large number of spurious characters output with text
  (2) - A few spurious characters at the end of output.

For conversion of Word documents to plain text Catdoc-0.91.2 is a clear
winner and comes bundled with a utility for creating CSV files from
Excel spread sheets.  If you are using an earlier version of Catdoc then
there are good grounds for upgrading. 

Wp2html is sold as a utility for converting WordPerfect documents to
HTML, and works with everything from version 5.1 to 8.0 that I have
tried.  It is very configurable and I was able to get it to output plain
text without too much trouble.  If you want to convert Word97 files into
HTML then it is the clear choice.  It continues in development and we
may hope that later versions will cope with other Word formats. 


Is somebody able to try these products with Word2000 ?


-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.