[htdig] Pdf search

1999-09-10 Thread Benoit LEROYER


Hello,

If in my result of search i have pdf documents, only the link is OK 
the name result is Microsoft Word - "filename".doc  
and filename is not the good name



Regards

--
Benoit LEROYER - G.I.D.E ([EMAIL PROTECTED])
Tél : 02.40.89.92.87
Web : http://www.gide.net
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] pdf result

1999-09-10 Thread Benoit LEROYER


Hello,

If in my result of search i have pdf documents, only the link is OK 
the name result is Microsoft Word - "filename".doc  
and filename is not the good name



Regards

-- 
--
Benoit LEROYER - G.I.D.E ([EMAIL PROTECTED])
Tél : 02.40.89.92.87
Web : http://www.gide.net
--


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] htdig and symbolic links

1999-09-10 Thread Nick O'Brien


Hi,

We are implementing htdig (v3.1.2 + the patch kit on Solaris 2.6) on our 
main web server. One comment we have had is that there are alot of 
duplicate search results pointing to the same web pages. This is usually 
caused by having several different Unix symbolic links pointing to the 
same directory/file in the web document tree.

Is there any way we can prevent the indexing of these duplicates? I see 
from the mailing list archives that for previous versions of htdig there 
were patches to fix this issue but they are not available for the current 
version.

I see from the bug database the latest advice is to eliminate symbolic 
links - however for many practical reasons it is not possible for us to 
do this.


Is it for example possible to configure htdig to index our URLs via the 
filesystem instead of HTTP (i.e using local_urls) and to ignore the 
symbolic links?

How are people on the list working round this problem? Or is this an 
unresolved bug I will need to (re)log with the htdig developers?

Rgds.,

Nick.


"Animal? No, worse - human!", Manny - "Runaway Train"
Nick O'BrienPhone: +44 118 931 8432
Computer OfficerEmail: [EMAIL PROTECTED]
Reading University, UK  Web: http://www.rdg.ac.uk/~suq98ngo/
 



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] XHTML compliance/Tidy

1999-09-10 Thread Rzepa, Henry


Apologies if this has been discussed on the developers version of this forum,
but what XHTML 1.0/4.01 conformance for  htdig is planned/implemented?  

The second question is slightly off topic for  htdig,  but does anyone know
of a robot-version of  Tidy, Dave Raggett's HTML to XHTML
converter, that could "dig" and convert a site automatically. The
command line versions of  Tidy

http://www.w3.org/People/Raggett/tidy/

seem to process single files. 

Dr Henry Rzepa,  Dept. Chemistry,  Imperial College,  LONDON SW7 2AY;
mailto:[EMAIL PROTECTED]; Tel  (44) 171 594 5774; Fax: (44) 171 594 5804.
URL: http://www.ch.ic.ac.uk/rzepa/ 


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] XHTML compliance/Tidy

1999-09-10 Thread Torsten Neuer


According to Rzepa, Henry:
Apologies if this has been discussed on the developers version of this forum,
but what XHTML 1.0/4.01 conformance for  htdig is planned/implemented?  

The second question is slightly off topic for  htdig,  but does anyone know
of a robot-version of  Tidy, Dave Raggett's HTML to XHTML
converter, that could "dig" and convert a site automatically. The
command line versions of  Tidy

http://www.w3.org/People/Raggett/tidy/

seem to process single files. 

Quick solution:

Dig the site with ht://Dig using the URL list output directive.
The use the generated URL list as an input for the tidy program
(e.g. "for d in `cat url.list | sort | uniq` ; do tidy $d ; done")


hth,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstraße 14Tel: +49-4101-403605
D-25474 EllerbekFax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]Internet: http://www.inwise.de


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] htdig and symbolic links

1999-09-10 Thread Joe R. Jah


On Fri, 10 Sep 1999, Nick O'Brien wrote:

 Date: Fri, 10 Sep 1999 15:13:20 +0100 (GMT Daylight Time)
 From: Nick O'Brien [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: [htdig] htdig and symbolic links
 
 
 Hi,
 
 We are implementing htdig (v3.1.2 + the patch kit on Solaris 2.6) on our 
 main web server. One comment we have had is that there are alot of 
 duplicate search results pointing to the same web pages. This is usually 
 caused by having several different Unix symbolic links pointing to the 
 same directory/file in the web document tree.
 
 Is there any way we can prevent the indexing of these duplicates? I see 
 from the mailing list archives that for previous versions of htdig there 
 were patches to fix this issue but they are not available for the current 
 version.
 
 I see from the bug database the latest advice is to eliminate symbolic 
 links - however for many practical reasons it is not possible for us to 
 do this.
 
 
 Is it for example possible to configure htdig to index our URLs via the 
 filesystem instead of HTTP (i.e using local_urls) and to ignore the 
 symbolic links?
 
 How are people on the list working round this problem? Or is this an 
 unresolved bug I will need to (re)log with the htdig developers?

Our site is in the same boat that your site is in; I use the same old
patch for version 3.0.8b2, but I apply it manually at every new release.
You can get it from:

ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0

Then with an ugly extensive set of local_urls for each and every symbolic
link in the site:( I mange to suppress duplicates, quadruplicates, and
multuplicates;)

Boy, do I look forward to 3.2, which is promised to take care of the
menace of duplicates. 

Regards,

Joe
-- 
 _/   _/_/_/   _/  __o
 _/   _/   _/  _/ __ _-\,_
 _/  _/   _/_/_/   _/  _/ ..(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah[EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] problem indexing a site - no errors but nothing is

1999-09-10 Thread Geoff Hutchison


On Fri, 10 Sep 1999, Jay Tsao wrote:

 sites within our intranet.  I am running with -v output but the output does
 not indicate any errors.  It looks like as follows:
 
 New server: site1.hp.com, 80
 
 New server: site2.hp.com, 80
 0:0:0:http://site2.hp.com/:
 *+*+++--++-+++--+---+-+-- size = 17070

You'll probably see what's going on better with -vvv or -. This will
show the connection status, any HTTP headers, and the results of the
robots.txt file.

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] Verbose Mode Indexing

1999-09-10 Thread Frank Martini


Just curious.

What do the pluses and asterisks mean when indexing in verbose mode?


*+*+++--++-+++--+---+-+--

Frank

Frank Martini Voice: 713/621-1917
Cadence Development FAX: 713/621-1960
5075 Westheimer, Ste. 1266 eMail: [EMAIL PROTECTED]
Houston, Texas 77056  WWW: http://www.caddev.com/

Cadence Fact: The Carolyn Farb WebSite is actually a database written in 
4th Dimension which dynamically serves HTML. The site is updated via the 
web, which allows posting of new stories from anywhere in the world. 
Check it out at http://www.CarolynFarb.com/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



Re: [htdig] Verbose Mode Indexing

1999-09-10 Thread Geoff Hutchison


On Fri, 10 Sep 1999, Frank Martini wrote:

 *+*+++--++-+++--+---+-+--

I took a look just now at the source itself.

+ new URL
- rejected URL
* URL already indexed

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] doclist, perl db.docdb access

1999-09-10 Thread Bill Carlson


Hello,

I am stumbling into some problems using any of the contrib perl scripts. I
understand that various fields have been added to the docdb that aren't in
some of the scripts; I have accounted for those.

I have access to the database, but the hashed information doesn't seem
right. For example, the key should be the URL in question, yet when
running doclist.pl for example, the output is something like:

^Gwww.somewhere.org/index.html^S

where those are control characters that only show when piping through
less.

I modified the script to use BerekeleyDB instead of GDBM_File, but no
change.

Any pointers?

Thanks,

Bill Carlson

Systems Programmer[EMAIL PROTECTED]|  Opinions are mine,
Virtual Hospital  http://www.vh.org/|  not my employer's.
University of Iowa Hospitals and Clinics|




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.



[htdig] I'm heading out

1999-09-10 Thread Gilles Detillieux


Hi, folks.  Just a quick note to let you all know I'll be away for 3 weeks
of much needed vacation.  I'm unsubscribing from the lists, 'cause I know
there's no way I can catch up with three weeks worth of postings piling
up in my mailbox.  I'll resubscribe when I'm back and caught up in other
stuff.  Good luck with the ongoing development!

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.