I found a solution. We were seeing the same problem with
linklint, a link checker written in perl. We use Apache,
so I played around with mod_rewrite and finally got it to
recognize the problem and return file gone.
If anyone is interested I can post the mod_rewrite config
and some example HTML.
Geoff Hutchison wrote:
>
> At 6:31 AM -0600 12/28/99, Glenn Nielsen wrote:
> >PROBLEM
> >-------
> >
> >The following is a valid URL for a document...
> >
> ><a href="/parent/parent.html/index.html">Parent Page</a>
> >
> >where "/parent/parent.html" is a file on the server that is
> >returned by the webserver from the above URL.
>
> Is it valid? Yes.
> Is it a good URL. No.
>
Thats true, but I'm not the one publishing the content, in fact,
I try very hard not to create content (HTML) ;-) But when you
administer 10 web servers with 100's of different customers
publishing content you have no control over the correctness of
the URL and BCC errors BCC error => when the problem is Between
the Chair and Computer.
> Now this has come up recently on the bug report list. But when I
> tried this at "home" so to speak, the server returned a 404. (IMHO,
> if parent.html is NOT server-parsed, this is the Right Thing To Do
> TM.)
>
> >A possible solution would be to compare the contents of the parent and
> >child documents when the child comes from a relative URL. If the
> >document contents for the parent and child are identical and have the
> >same last modification date stamp, ignore the child document and report
> >an error. Then continue, digging the next href in the parent.
>
> Maybe. This is a bit of a pain though since you have to "remember"
> that it came from a relative URL. The whole problem is resolved when
> you have duplicate-document detection, which has been on the plate
> for a while. Unless someone volunteers to do it, it may be some time
> before it sees light of day, though.
>
Duplicate doc detection would help me since we index some external
sites on which we have no admin priveledges and http has to be used.
Sounds like a good idea.
Thanks,
Glenn
----------------------------------------------------------------------
Glenn Nielsen [EMAIL PROTECTED] | /* Spelin donut madder |
MOREnet System Programming | * if iz ina coment. |
Missouri Research and Education Network | */ |
----------------------------------------------------------------------
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.