[htdig] Pdf search
Hello, If in my result of search i have pdf documents, only the link is OK the name result is Microsoft Word - "filename".doc and filename is not the good name Regards -- Benoit LEROYER - G.I.D.E ([EMAIL PROTECTED]) Tél : 02.40.89.92.87 Web : http://www.gide.net -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] pdf result
Hello, If in my result of search i have pdf documents, only the link is OK the name result is Microsoft Word - "filename".doc and filename is not the good name Regards -- -- Benoit LEROYER - G.I.D.E ([EMAIL PROTECTED]) Tél : 02.40.89.92.87 Web : http://www.gide.net -- To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] htdig and symbolic links
Hi, We are implementing htdig (v3.1.2 + the patch kit on Solaris 2.6) on our main web server. One comment we have had is that there are alot of duplicate search results pointing to the same web pages. This is usually caused by having several different Unix symbolic links pointing to the same directory/file in the web document tree. Is there any way we can prevent the indexing of these duplicates? I see from the mailing list archives that for previous versions of htdig there were patches to fix this issue but they are not available for the current version. I see from the bug database the latest advice is to eliminate symbolic links - however for many practical reasons it is not possible for us to do this. Is it for example possible to configure htdig to index our URLs via the filesystem instead of HTTP (i.e using local_urls) and to ignore the symbolic links? How are people on the list working round this problem? Or is this an unresolved bug I will need to (re)log with the htdig developers? Rgds., Nick. "Animal? No, worse - human!", Manny - "Runaway Train" Nick O'BrienPhone: +44 118 931 8432 Computer OfficerEmail: [EMAIL PROTECTED] Reading University, UK Web: http://www.rdg.ac.uk/~suq98ngo/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] XHTML compliance/Tidy
Apologies if this has been discussed on the developers version of this forum, but what XHTML 1.0/4.01 conformance for htdig is planned/implemented? The second question is slightly off topic for htdig, but does anyone know of a robot-version of Tidy, Dave Raggett's HTML to XHTML converter, that could "dig" and convert a site automatically. The command line versions of Tidy http://www.w3.org/People/Raggett/tidy/ seem to process single files. Dr Henry Rzepa, Dept. Chemistry, Imperial College, LONDON SW7 2AY; mailto:[EMAIL PROTECTED]; Tel (44) 171 594 5774; Fax: (44) 171 594 5804. URL: http://www.ch.ic.ac.uk/rzepa/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] XHTML compliance/Tidy
According to Rzepa, Henry: Apologies if this has been discussed on the developers version of this forum, but what XHTML 1.0/4.01 conformance for htdig is planned/implemented? The second question is slightly off topic for htdig, but does anyone know of a robot-version of Tidy, Dave Raggett's HTML to XHTML converter, that could "dig" and convert a site automatically. The command line versions of Tidy http://www.w3.org/People/Raggett/tidy/ seem to process single files. Quick solution: Dig the site with ht://Dig using the URL list output directive. The use the generated URL list as an input for the tidy program (e.g. "for d in `cat url.list | sort | uniq` ; do tidy $d ; done") hth, Torsten -- InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH Waldhofstraße 14Tel: +49-4101-403605 D-25474 EllerbekFax: +49-4101-403606 E-Mail: [EMAIL PROTECTED]Internet: http://www.inwise.de To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] htdig and symbolic links
On Fri, 10 Sep 1999, Nick O'Brien wrote: Date: Fri, 10 Sep 1999 15:13:20 +0100 (GMT Daylight Time) From: Nick O'Brien [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [htdig] htdig and symbolic links Hi, We are implementing htdig (v3.1.2 + the patch kit on Solaris 2.6) on our main web server. One comment we have had is that there are alot of duplicate search results pointing to the same web pages. This is usually caused by having several different Unix symbolic links pointing to the same directory/file in the web document tree. Is there any way we can prevent the indexing of these duplicates? I see from the mailing list archives that for previous versions of htdig there were patches to fix this issue but they are not available for the current version. I see from the bug database the latest advice is to eliminate symbolic links - however for many practical reasons it is not possible for us to do this. Is it for example possible to configure htdig to index our URLs via the filesystem instead of HTTP (i.e using local_urls) and to ignore the symbolic links? How are people on the list working round this problem? Or is this an unresolved bug I will need to (re)log with the htdig developers? Our site is in the same boat that your site is in; I use the same old patch for version 3.0.8b2, but I apply it manually at every new release. You can get it from: ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0 Then with an ugly extensive set of local_urls for each and every symbolic link in the site:( I mange to suppress duplicates, quadruplicates, and multuplicates;) Boy, do I look forward to 3.2, which is promised to take care of the menace of duplicates. Regards, Joe -- _/ _/_/_/ _/ __o _/ _/ _/ _/ __ _-\,_ _/ _/ _/_/_/ _/ _/ ..(_)/ (_) _/_/ oe _/ _/. _/_/ ah[EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] problem indexing a site - no errors but nothing is
On Fri, 10 Sep 1999, Jay Tsao wrote: sites within our intranet. I am running with -v output but the output does not indicate any errors. It looks like as follows: New server: site1.hp.com, 80 New server: site2.hp.com, 80 0:0:0:http://site2.hp.com/: *+*+++--++-+++--+---+-+-- size = 17070 You'll probably see what's going on better with -vvv or -. This will show the connection status, any HTTP headers, and the results of the robots.txt file. -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] Verbose Mode Indexing
Just curious. What do the pluses and asterisks mean when indexing in verbose mode? *+*+++--++-+++--+---+-+-- Frank Frank Martini Voice: 713/621-1917 Cadence Development FAX: 713/621-1960 5075 Westheimer, Ste. 1266 eMail: [EMAIL PROTECTED] Houston, Texas 77056 WWW: http://www.caddev.com/ Cadence Fact: The Carolyn Farb WebSite is actually a database written in 4th Dimension which dynamically serves HTML. The site is updated via the web, which allows posting of new stories from anywhere in the world. Check it out at http://www.CarolynFarb.com/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Verbose Mode Indexing
On Fri, 10 Sep 1999, Frank Martini wrote: *+*+++--++-+++--+---+-+-- I took a look just now at the source itself. + new URL - rejected URL * URL already indexed -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] doclist, perl db.docdb access
Hello, I am stumbling into some problems using any of the contrib perl scripts. I understand that various fields have been added to the docdb that aren't in some of the scripts; I have accounted for those. I have access to the database, but the hashed information doesn't seem right. For example, the key should be the URL in question, yet when running doclist.pl for example, the output is something like: ^Gwww.somewhere.org/index.html^S where those are control characters that only show when piping through less. I modified the script to use BerekeleyDB instead of GDBM_File, but no change. Any pointers? Thanks, Bill Carlson Systems Programmer[EMAIL PROTECTED]| Opinions are mine, Virtual Hospital http://www.vh.org/| not my employer's. University of Iowa Hospitals and Clinics| To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] I'm heading out
Hi, folks. Just a quick note to let you all know I'll be away for 3 weeks of much needed vacation. I'm unsubscribing from the lists, 'cause I know there's no way I can catch up with three weeks worth of postings piling up in my mailbox. I'll resubscribe when I'm back and caught up in other stuff. Good luck with the ongoing development! -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.