On Mon, 11 Mar 2002, Gilles Detillieux wrote:
> Date: Mon, 11 Mar 2002 17:22:15 -0600 (CST)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: Geoff Hutchison <[EMAIL PROTECTED]>,
> [EMAIL PROTECTED]
> Subject: Re: [htdig] "file name.html" -> "filename.html";(
>
> According to Joe R. Jah:
> > On Sat, 9 Mar 2002, Geoff Hutchison wrote:
> > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote:
> > > > Unfortunately htdig removes the space. and looks for "filename.html" and
> > > > reports:
> > > >
> > > > Not found: http://domain.com/some/path/filename.html Ref:
> > > > http://domain.com/some/path/file.html
> > >
> > > Joe, I think you should understand that this isn't much help as a bug
> > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the
> > > space seem to "disappear?" Is it when it first encounters the link
> > > (parser error), as it normalizes and accepts/rejects the URL (retriever
> > > or URL parser error) or as it tries to fetch it?
> > >
> > > A bit more feedback would go a long way towards debugging this.
> >
> > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one
> > file:
> > ----------------------------------8<-------------------------------
> > 0:0:0:http://domain.com/Path/To/: Trying local files
> > tried local file /domain.com/Path/To/index.html
> > tried local file /domain.com/Path/To/index.shtml
> > found existing file /domain.com/Path/To/index.htm
> > Read 5785 from document
> > Read a total of 5785 bytes
> > Tag: <html>, matched -1
> > Tag: <head>, matched -1
> > Tag: <title>, matched 0
> > word: Handouts@7
> > Tag: </title>, matched 1
> > title: Handouts
> > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2
> > word: Basic@696
> > word: UNIX@698
> > word: Commands@700
> > Tag: </a>, matched 3
> > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX
> > Commands)
> > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm'
> > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> > ----------------------------------8<-------------------------------
> > ...
> > ----------------------------------8<-------------------------------
> > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files
> > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> > Local retrieval failed, trying HTTP
> > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET
>/Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0
> > User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
> > Referer: http://domain.com/Path/To/
> > Host: domain.com
> >
> > Header line: HTTP/1.1 404 Not Found
> > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT
> > ----------------------------------8<-------------------------------
> >
> > And it reports:
> > ----------------------------------8<-------------------------------
> > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref:
>http://domain.com/Path/To/
> > ----------------------------------8<-------------------------------
>
> What most browsers do with unencoded spaces within URLs is a violation of
> RFC 1738 and RFC 2396. htdig does the correct thing, if not what some
> users would prefer it did. You can of course patch the URL class to leave
> the spaces in there, in violation of the standard, to conform with the
> incorrect behaviour of most browsers and, apparently, some really bad
> HTML code generators. That would save you from having to fix all the bad
> HTML code you're indexing. Spaces within URLs should always always be
> encoded as %20.
>
> See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/
> and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/
>
> My recommendation, if you have a choice, is to avoid spaces in filenames
> altogether, because they cause all sorts of grief. Some caching proxy
> servers mess up URLs with spaces, even if the space is properly encoded
> as %20.
I am sorry I missed that thread. I believe the above situation is
certainly becoming more and more pervasive. I vote +1 to tweak the
HTML parser to handle space in filenames.
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev