On Mon, 11 Mar 2002, Gilles Detillieux wrote:

> Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: Geoff Hutchison <[EMAIL PROTECTED]>,
>     [EMAIL PROTECTED]
> Subject: Re: [htdig] "file name.html" -> "filename.html";(
> 
> According to Joe R. Jah:
> > On Sat, 9 Mar 2002, Geoff Hutchison wrote:
> > > On Friday, March 8, 2002, at 01:51  PM, Joe R. Jah wrote:
> > > > Unfortunately htdig removes the space. and looks for "filename.html" and
> > > > reports:
> > > >
> > > > Not found: http://domain.com/some/path/filename.html Ref: 
> > > > http://domain.com/some/path/file.html
> > > 
> > > Joe, I think you should understand that this isn't much help as a bug 
> > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the 
> > > space seem to "disappear?" Is it when it first encounters the link 
> > > (parser error), as it normalizes and accepts/rejects the URL (retriever 
> > > or URL parser error) or as it tries to fetch it?
> > > 
> > > A bit more feedback would go a long way towards debugging this.
> > 
> > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one
> > file:
> > ----------------------------------8<-------------------------------
> > 0:0:0:http://domain.com/Path/To/: Trying local files
> >   tried local file /domain.com/Path/To/index.html
> >   tried local file /domain.com/Path/To/index.shtml
> >   found existing file /domain.com/Path/To/index.htm
> > Read 5785 from document
> > Read a total of 5785 bytes
> > Tag: <html>, matched -1
> > Tag: <head>, matched -1
> > Tag: <title>, matched 0
> > word: Handouts@7
> > Tag: </title>, matched 1
> > title: Handouts
> > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2
> > word: Basic@696
> > word: UNIX@698
> > word: Commands@700
> > Tag: </a>, matched 3
> > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX
> > Commands)
> > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm'
> >    pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> > ----------------------------------8<-------------------------------
> > ...
> > ----------------------------------8<-------------------------------
> > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files
> >   tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> > Local retrieval failed, trying HTTP
> > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET 
>/Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0
> > User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
> > Referer: http://domain.com/Path/To/
> > Host: domain.com   
> > 
> > Header line: HTTP/1.1 404 Not Found
> > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT
> > ----------------------------------8<-------------------------------
> > 
> > And it reports:
> > ----------------------------------8<-------------------------------
> > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: 
>http://domain.com/Path/To/
> > ----------------------------------8<-------------------------------
> 
> What most browsers do with unencoded spaces within URLs is a violation of
> RFC 1738 and RFC 2396.  htdig does the correct thing, if not what some
> users would prefer it did.  You can of course patch the URL class to leave
> the spaces in there, in violation of the standard, to conform with the
> incorrect behaviour of most browsers and, apparently, some really bad
> HTML code generators.  That would save you from having to fix all the bad
> HTML code you're indexing.  Spaces within URLs should always always be
> encoded as %20.
> 
> See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/
> and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/
> 
> My recommendation, if you have a choice, is to avoid spaces in filenames
> altogether, because they cause all sorts of grief.  Some caching proxy
> servers mess up URLs with spaces, even if the space is properly encoded
> as %20.

You are absolutely right.  I made a patch from your tips in the above
thread:
-----------------------8<-----------------------
*** htlib/URL.cc.orig   Thu Feb  7 17:15:38 2002
--- htlib/URL.cc        Tue Mar 12 12:54:45 2002
***************
*** 75,81 ****
  URL::URL(char *ref, URL &parent)
  {
      String    temp(ref);
-     temp.remove(" \r\n\t");
      ref = temp;
  
      _host = parent._host;
--- 75,82 ----
  URL::URL(char *ref, URL &parent)
  {
      String    temp(ref);
+     temp.remove("\r\n\t");
+     temp.chop(' ');
      ref = temp;
  
      _host = parent._host;
***************
*** 249,255 ****
  void URL::parse(char *u)
  {
      String    temp(u);
-     temp.remove(" \t\r\n");
      char      *nurl = temp;
  
      //
--- 250,257 ----
  void URL::parse(char *u)
  {
      String    temp(u);
+     temp.remove("\t\r\n");
+     temp.chop(' ');
      char      *nurl = temp;
  
      //
-----------------------8<-----------------------

Applied it and randig, and waited for the dig to finish, and waited, and
waited, ...;(  Finally I killed the process.  I humbly switch my previous
+1 vote to -1.

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to