Re: [htdig] Indexing a list of sites -- catching failures
I have set up htdig so that every night, it indexes a long list of small web sites. In general, this works very well, but I've found that I have to be very careful when adding new sites to the end of the list. Some sites seem to cause htdig to fail. When this happens, htdig doesn't continue on with the rest of the list -- it simply skips to the next step in rundig. This means that I have to do some careful adding and substracting to the list of sites before I figure out what caused the index to fail. What I would like to do is to somehow index each site separately and have some kind of error log if htdig hits a site that it fails on (for whatever reason). Then, I would like for it to procede to the next site in the list, whether or not it failed to index the previous site. Thanks for any help, Todd Wallace I do a weekly index of over 900 servers and have never had this happen. What version of htdig are you using? Also, I would recommend using the http_proxy attribute in the config file if you possibly can. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] converter
Hallo, does anybody know a converter for MS powerpoint anc MS excel documents or some other trick to index this document types. Thank you. Herbert Hölzlwimmer David J Adams wrote: Recent versions of the catdoc MS Word to text converter come with a program for converting MS excel files into .CSV files which should do what you want. I too would like to know of a MS powerpoint converter for Unix. Hi, just having gone through the same problem, I used xlHtml to index Excel files. For this I had to change the parse_doc file, it can be found together with the instructions at the adress below: http://www.haberer-online.de/htdig/default.htm xlHtml also has an option to convert MS powerpoint, but I did not take a look at this. Hope that helps, Sven Sven, Thanks for this very useful tip. I've tried xlHtml (version 0.2.7.2) and it seems at least as good xls2csv, the converter that comes with catdoc, though it could be better: option handling seems flaky. HTML output can be generated, but hyperlinks in spread sheets are not marked up as links. I've also tried ppHtml which converts PowerPoint files to HTML, and it seems adequate as a converter. While indexing our web pages using doc2html.pl it processed about a hundred .ppt files ok, and failed on three with the message "Not enough space". (As I'm using a sizable IRIX system with plenty of memory and disk space I don't know why I should get such a message.) The next version of doc2html.pl will include examples of using both pptHTML and xlHtml as converters. I should be releasing it sometime in September. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Htdig not displaying results more than 9 pages
I think there is a confusion here as to the meaning of "pages". By default, when a large number of documents satisfy the search criteria what you will get from htsearch is a maximum of ten pages, each page containing links to ten documents - 100 documents in total. Changing the configuration attribute "maximum_pages" will change the number of pages of links that htsearch will return to you, but as has been pointed out, graphics are only available for pages 1 to 10. You can also change the configuration attribute "matches_per_page" from 10 to any number you like. I think this will achieve what you want. NB: "matches_per_page" is documented in the alphabetical list of all attributes, but not in the list of htsearch-specific attributes. It should be. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Deleted, noexcerpt
Geoff Hutchison wrote: On Wed, 26 Jul 2000, Gilles Detillieux wrote: According to [EMAIL PROTECTED]: Now I have to investigate why certain pages are flagged as "Deleted, noexcerpt"! Main causes: - disallowed in robots.txt - indexing turned off by meta robots or noindex tags - no indexable text in documents - server_max_docs exceeded Also when merging: - duplicates between the two databases (oldest is removed) Ah! That last might explain a lot of them. Any chance of more helpful messages in a future version, eg: "Deleted, duplicate:" ? If "indexing turned off by meta robots or noindex tags" results in "Deleted, noexcerpt", what condition gives the message "Deleted, noindex:" ? -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] First file found: Invalid?
Hmmm... well, if you go http://archive.midrange.com/rpg400-l/index.htm and search for "afp print" using the form's default values, you'll see what I mean. I tried, but: lynx: Can't access startfile http://archive.midrange.com/rpg400-l/index.htm -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Deleted, invalid messages solved
I've been bombarding this list with emails about htmerge 3.1.5 producing the message "Deleted, invalid:" for pages which were apparently OK. I still don't know why this happens, but I have found a way of avoiding it. I was running htdig twice to produce two indexes and then htmerge just once to merge them. If I instead run htmerge three times (one for each index on its own, and then once more to merge them), I don't get any "Deleted, invalid:" pages. SOLVED! Two lessons learnt on the way: Htdig will run perfectly well on Irix compiled with the SGI MIPSpro compiler Do use http_proxy when indexing servers not in your local domain My thanks especially to Gilles for his patience and for lots of suggested lines of investigation. Now I have to investigate why certain pages are flagged as "Deleted, noexcerpt"! -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to David Adams: I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a year and I have been very pleased with it. I would say that we've given it a good workout here. The problem with the "Deleted, invalid" messages only occurs with a second, relatively new search index. I guess I should have read your message before responding to Geoff's! The first index is made from a single run of htdig covering 33 servers, all in the local domain, and on this week's initial dig htmerge reports 49,233 documents and not a single "Deleted, invalid". The second index is made from two runs of htdig covering a total 969 (yes 969 !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86 "Deleted, invalid". I have looked at the db.wordlist files (which are written to only by htdig - is that right?) Yes and no. htdig creates and writes the initial db.wordlist, then htmerge sorts it, merges words together, and processes flags for page removals. It then rewrites this file before creating the word index database. and it would appear that htdig is flagging the pages for htmerge to delete and is not finding any words in them. I can advance these theories: It is not a bug, but is due to the use of a proxy. (I use a proxy because without one, a portion of the sites on any run of htdig were found to be not responding or even unknown. With a proxy, htdig appears to have no such problems.) Hold on there! The problem of sites being down (unknown or not responding) is exactly the sort of thing that causes the "Deleted, invalid" situation, and I said so last week. How did you conclude that htdig appears to have no such problems with a proxy, when it does indeed appear to be having exactly that problem? It would make sense that if a site is not responding, the proxy would inform htdig of this (unless it happened to quietly substitute a cached copy of the requested page - assuming it had one), and htdig would respond the same way it would without a proxy. I think this is the most likely theory. How did I conclude that htdig is having no such problems? Two reasons: 1). At least one page on our main server, covered by my http_proxy_exclude statement, is "Deleted, invalid". 2). When I do not use http_proxy then htdig -v gives clear messages, such as "Unable to connect to server" and "Server not responding". With http_proxy I get no such messages, not even with htdig -vvv Additionally: 3). I can access the pages using IE (same proxy) the same day, no problem. 4). One or two pages from a site may be affected while others are not. I have now re-run the index with htdig -i -vvv etc. I have rather a lot of information to go through, but I've found nothing yet. And that nothing is significant. What do you make of this, the log from htmerge includes: Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm While the log from htdig includes this (slightly mangled by "more" command), which looks OK to me: pick: www.folkmania.org.uk, # servers = 246 1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee Zachinfo.htm HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Referer: http://www.folkmania.org.uk/ Host: www.folkmania.org.uk Header line: HTTP/1.0 200 OK Header line: Server: thttpd/2.07 02dec99 Header line: Content-Type: text/html Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100) And converted to Fri, 23 Jun 2000 18:34:50 Header line: Accept-Ranges: bytes Header line: Content-Length: 4586 Header line: Age: 127170 Header line: X-Cache: HIT from www-cacheb.soton.ac.uk Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128 Header line: X-Cache: MISS from www-cachea.soton.ac.uk Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128 Header line: Proxy-Connection: close Header line: returnStatus = 0 Read 4586 from document Read a total of 4586 bytes title: LeeZachInfo [snip] size = 4586 And that page is only retrieved once. It is a bug due to the use of a proxy. It is a bug which only shows when compiled under IRIX. It is a bug which only occurs when there many different servers. I can add another theory: It is a bug when merging a second index - all the "Deleted, invalid" pages come from the htdig run specified with the htmerge -m option This theory is easy to check out, I'll investigate tomorrow. I intend to re-build the second index using htdig -vvv and perhaps learn something. The only sure way to rule
Re: [htdig] Htmerge: Deleted, invalid
Sorry for the length of this! According to David Adams: Why does htmerge 3.1.5 flag some pages, which look OK to me, as "Deleted, invalid" and not index them? This is happening not just with .html pages but also .doc and .pdf files. It happens with a simple merge following a run of htdig -i -a and also when two htdig runs are merged using the htdig -m option. htmerge does this when the remove_bad_urls attribute is true, and the page in question is not found (404 error), the server name no longer exists, the server is down, or in the case of an update dig, the page has been updated, superceding the old document database record for it. In the latter case, htdig creates a new record for the updated document, with a new DocID, so the old one is discarded. As this only happens in update digs, it wouldn't be the case during an htdig -i, so I'd look at the other possibilities. In any case, run both htdig and htmerge with at least two verbose options, and cross-reference the DocID of the "Deleted, invalid" messages to other messages with the same ID, to get a clearer picture of what's happening. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 I've run htdig -vv followed by htmerge -vvv and I still cannot see any reason why htmerge decides, apparently arbitrarily, that a page is invalid. None of the reasons given above seem to fit. I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is one of many in the limit_urls_to directive. Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then http://www.tregalic.co.uk/sacred-heart/churchpage1.html http://www.tregalic.co.uk/sacred-heart/churchpage2.html ... http://www.tregalic.co.uk/sacred-heart/churchpage7.html amongst others. Grepping for "churchpage" in the htmerge log I find: htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html 1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html 1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html 1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html 1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html 1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html So I try an experiment: I reduce limit_urls_to include only the starting URL and http://www.tregalic.co.uk/sacred-heart/ and run htdig htmerge. Then htmerge reports: htmerge: Total word count: 3806 0/http://www.soton.ac.uk/services/local/alpha.html 1/http://www.tregalic.co.uk/sacred-heart/ 9/http://www.tregalic.co.uk/sacred-heart/baptism.html 2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html 3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html 4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html 5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html 6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html 7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html 8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html htmerge: 10 12/http://www.tregalic.co.uk/sacred-heart/information.html 11/http://www.tregalic.co.uk/sacred-heart/links.html 10/http://www.tregalic.co.uk/sacred-heart/newsletter.html I do not accept that pages 4 5 just happened to unavailable on the first occasion and available on the second. Nor can I see any differences in the htdig logs for these pages. The same sizes are reported in both cases. I think there is a bug in htmerge 3.1.5 which causes it to declare some pages as "invalid" in some cases. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Patch for doc2html.pl
I've had a report of a bug in doc2html.pl from Alain Forcioli: Dear David, I allow myself to send you a patch for the doc2html.pl perl script. There is just a little syntax error in the case where the RTF converters is called. A single quote is missing at the end of the line. Best regards, He is quite right there is a bug, and I attach the patch to fix it that he sent me. What puzzles us both is that in my case doc2html.pl was still handling RTF files correctly. --8Iq4pIbWMy Content-Type: application/octet-stream Content-Disposition: attachment; filename="doc2html.pl.patch" Content-Transfer-Encoding: base64 LS0tIGRvYzJodG1sLnBsCUZyaSBNYXIgMzEgMTY6MzY6MTEgMjAwMAorKysgZG9jMmh0bWwu cGwuZ29vZAlUdWUgTWF5IDMwIDE0OjE5OjU1IDIwMDAKQEAgLTEzNiw3ICsxMzYsNyBAQAog ICBpZiAoKGRlZmluZWQgJFJURjJIVE1MKSBhbmQgKGxlbmd0aCAkUlRGMkhUTUwpKSB7CiAg ICAgJGNtZCA9ICRSVEYySFRNTDsKICAgICAjIFJ0ZjJodG1sIHVzZXMgZmlsZW5hbWUgYXMg dGl0bGUsIGNoYW5nZSB0aGlzOgotICAgICRjbWRsID0gIiRjbWQgJyRJbnB1dCcgfCAkRUQg J3MjXjxUSVRMRT4kSW5wdXQ8L1RJVExFPiM8VElUTEU+WyROYW1lXTwvVElUTEU+IyI7Cisg ICAgJGNtZGwgPSAiJGNtZCAnJElucHV0JyB8ICRFRCAncyNePFRJVExFPiRJbnB1dDwvVElU TEU+IzxUSVRMRT5bJE5hbWVdPC9USVRMRT4jJyI7CiAgICAgJG1hZ2ljID0gJ157XDEzNHJ0 Zic7CiAgICAgJnN0b3JlX2h0bWxfbWV0aG9kKCdSVEYnLCRjbWQsJGNtZGwsJG1hZ2ljKTsK ICAgfQo= --8Iq4pIbWMy Content-Type: text/plain; charset=us-ascii Content-Description: .signature Content-Transfer-Encoding: 7bit -- Alain FORCIOLI ``Who belong to the Dream Starting 5 ?'' --- RISC Technology http://www.risc.fr/ [EMAIL PROTECTED] APRIL http://www.april.org/ [EMAIL PROTECTED] Debian GNU/Linuxhttp://www.debian.org/ --- "Resistance is futile. Open your source code and prepare for assimilation." --8Iq4pIbWMy-- -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Patch for doc2html.pl
Here is a second attempt to send the path for doc2html.pl. The error is that the line $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#"; should read: $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#'"; And the patch is: *** doc2html.pl.err Thu Jun 8 11:41:16 2000 --- doc2html.pl Tue May 30 14:21:25 2000 *** *** 136,142 if ((defined $RTF2HTML) and (length $RTF2HTML)) { $cmd = $RTF2HTML; # Rtf2html uses filename as title, change this: ! $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#"; $magic = '^{\134rtf'; store_html_method('RTF',$cmd,$cmdl,$magic); } --- 136,142 if ((defined $RTF2HTML) and (length $RTF2HTML)) { $cmd = $RTF2HTML; # Rtf2html uses filename as title, change this: ! $cmdl = "$cmd '$Input' | $ED 's#^TITLE$Input/TITLE#TITLE[$Name]/TITLE#'"; $magic = '^{\134rtf'; store_html_method('RTF',$cmd,$cmdl,$magic); } -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Spidering but not indexing
hi, I have a menu system that is used to access my site. But I don't want to index that menu, or I don't want the menu files in search results. Is that possible? And how? i looked in the configuration section but couldn't find anything. rutger -- Homepage: http://huizen.dds.nl/~rarwes Yes, There are two different approaches, use which ever suits you: 1) Stop your menu files being indexed by adding this line to the head section of each one: META NAME="robots" content="noindex, follow" 2) Allow the menu files to be indexed, but have them omitted from the results of a search by a hidden "exclude" field on the HTML form. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Word documents indexing problem
Hello, I use htdig 3.1.5 on linux Redhat 6.1. I have configured htdig.conf file as follows : valid_extensions: .html .htm .doc .pdf .txt local_default_doc: new_index.html index.html index.htm main.htm main_frame.htm frame.htm content.htm title.htm main2.htm local_urls_only: true local_urls: http://gnbuxsl.grenoble.hp.com:8090/=/var/opt/web/ # # Since ht://Dig does not (and cannot) parse every document type, this # attribute is a list of strings (extensions) that will be ignored during # indexing. These are *only* checked at the end of a URL, whereas # exclude_url patterns are matched anywhere. # bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi max_doc_size: 2000 external_parsers: application/msword-text/html /usr/local/bin/parse_doc.pl \ application/postscript-text/html /usr/local/bin/parse_doc.pl \ application/pdf-text/html /usr/local/bin/parse_doc.pl pdf files indexing works fine whereas I get the following message when indexing msword files : 30:30:2:http://gnbuxsl.grenoble.hp.com:8090/doc/tech/casc/details_casc.doc: Trying local files found existing file /var/opt/web/doc/tech/casc/details_casc.doc not found The file /var/opt/web/doc/tech/casc/details_casc.doc actually exists... I don't understand what the problem can be. Running rundig with several additional -v options does not help. Could somebody help me ? Thanks, Jean-Francois. -- I think the "not found" could refer to the utility which you are using within parse_doc.pl to handle word documents. Try calling parse_doc.pl from the command line: parse_doc.pl /var/opt/web/doc/tech/casc/details_casc.doc arg2 arg3 and see what happens. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] A Suggestion on Accents
Our web pages are overwhelmingly in English, but we do have academics who put up pages in other languages and would like them to be searchable. I'm sure that this is quite common. Rather than a fuzzy accents search method, why not make the htdig database accent independent? After all, it is case independent already! For example: Garccedil;on - Garçon - garçon - garcon and 'garcon' goes into the database. Is this a sensible suggestion? Entering 'garcon' into an English-language version of (say) Netscape is a lot easier than entering 'garçon', and it seems reasonable to me that a search for 'garcon' will find not only 'garcon' and 'Garcon' but also 'garçon' and 'Garçon'. I would even volunteer to work on a patch myself, but I lack knowledge of locales, and anything I wrote would probably cause more problems than it would solve. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Compile problems on SGI system
Your SGI compilation looks OK to me: there are warnings but no errors that I can see. I get similar warnings and htdig, etc. work fine for me. Are you saying that you get no executables built? -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] local_user_urls: query
According to Geoff Hutchison: At 9:09 AM + 2/9/00, [EMAIL PROTECTED] wrote: Does htdig recognize that some files may have server-side includes and always fetch them via http despite these attributes in the config file? An Apache server will process SSIs in .shtm and .shtml files, plus .htm and .html files with execution permission set. Currently, the local_* attributes only read .htm and .html files. It makes no attempt to emulate server-parsing. So if you have set XBitHack for your Apache server, there isn't any way htdig will know that and it will fly right through, ignoring your SSI code. However, .shtml, .phtml, .php3 files and the like will not be indexed through the local filesystem, instead going to HTTP. Actually, htdig 3.1.4 also accepts .txt, .asc, .pdf, .ps and .eps files locally. For some reason, that change never made it into 3.2.0b1. I imagine it got lost in the merge. Anyway, with 3.2's mime.types support, that's the way RetrieveLocal() should determine the content-type for local files. It'll just need a few lines of code to add that in, I expect. In any case, htdig has no equivalent to Apache's XBitHack, so for SSI documents, I'd recommend using .shtml if you want server-side parsing. For my own system, I use SSI only to add a few bits and pieces, so I don't mind that that stuff doesn't get indexed. I now index everything through local_urls. I must ask: why does htdig handle local_urls this way? Doesn't it make more sense, when the attribute is set, to get ALL pages locally except when: the file does not exist or the file has execution permission set or the file name ends in .shtm, .shtml and one or two others which have special meanings to servers. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] local_user_urls: query
How smart is the handling of the local_user_urls: and local_urls: attributes? Does htdig recognize that some files may have server-side includes and always fetch them via http despite these attributes in the config file? An Apache server will process SSIs in .shtm and .shtml files, plus .htm and .html files with execution permission set. Thanks. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] odd little bug in htdig 3.1.4
I have found that if a page contains keywords with just a space in the contents, like so: html head titleTest/title META name=keywords content=" " META name=description content=" " /head body then the page is indexed ok, but no excerpt is shown by htsearch with Format=Long. Just changing that to: html head titleTest/title META name=keywords content="" META name=description content="" /head body clears the problem. How did I find this, and why does it matter? Well I'm working on an external conversion script which tries to extract the keywords and summary from WordPerfect documents. In real life such documents often have no summary or keywords and I was using a space as the default. I can work around this, so its no great deal, but the bug may have other consequences I havn't found yet. By the way, my script is based on conv_doc.pl and can be used in its place. I hope to send it in when I've finished polishing it. --- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Word parsers
I've done a quick investigation of two programs which parse Word documents, and I thought it might interest others on the htdig list. The two are Catdoc from http://www.fe.msk.ru/~vitus/catdoc/, and Wp2html from http://www.res.bbsrc.ac.uk/wp2html/. Catdoc is freeware. Wp2html is available for a small sum from a one-man business, and the source code is made available. (It cost us, as a University, a mere 25 pounds for the right to run it in one Unix server and receive upgrades.) I saved a Word97 document in Word2 and Word6 formats and then tried to see if the programs could extract text from the files: Version Catdoc Catdoc Wp2html of Word 0.90a 0.91.2 version 3.2 2.0 Yes(1) Yes No 6.0 Yes(2) Yes No 97 Yes(2) Yes Yes Notes (1) - Very large number of spurious characters output with text (2) - A few spurious characters at the end of output. For conversion of Word documents to plain text Catdoc-0.91.2 is a clear winner and comes bundled with a utility for creating CSV files from Excel spread sheets. If you are using an earlier version of Catdoc then there are good grounds for upgrading. Wp2html is sold as a utility for converting WordPerfect documents to HTML, and works with everything from version 5.1 to 8.0 that I have tried. It is very configurable and I was able to get it to output plain text without too much trouble. If you want to convert Word97 files into HTML then it is the clear choice. It continues in development and we may hope that later versions will cope with other Word formats. Is somebody able to try these products with Word2000 ? -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.