According to Geoff Hutchison:
> I looked through the patch thoroughly today. I have some notes below.
> 
> At 4:54 PM -0600 11/25/99, Gilles Detillieux wrote:
> >1) htcommon/DocumentDB.cc & htdig/Retriever.cc: allow file:... as well
> >as http:... URLs.  (This doesn't change anything in htlib/URL.cc, so I'm
> >not sure about how well it'll handle hrefs in documents from the file:
> 
> Yeah, this is why I think moving file:// access to 3.2 is a better 
> idea. A lot of work has been done on that tree (esp. in URL.cc) to 
> allow multiple URL schemes. I still need to commit my URL test suite, 
> but the URL class is much more robust.

OK, my feeling about this is not to include it in 3.1.4, but it'd be OK
for 3.2.0b1 (though it may be mostly in there already, so any further
changes in this direction would be bug fixes, I think).

> One thing I didn't think about when making URL revisions was the use 
> of 'localhost' in a file:// URL. I've mostly seen them of the form:
> 
> file:///home/ghutchis/www/index.html
> 
> So I'll need to add some file://localhost tests to my collection--I 
> bet the current URL parser won't like them.

Yeah, the whole non http stuff in 3.2 will need some testing and fixing.

> >2) htdig/HTML.cc: add support for an ignore_noindex attribute.  This is
> >undocumented and no default is defined, but I think the behaviour is
> >pretty obvious from the code.  I'd question the desirability/need for
> >this, but it seems harmless enough.  The value should be set in a static
> 
> I actually disagree on this. I don't think the indexer should ever 
> ignore the directive of the page author. If the author intended that 
> the page should not be indexed, then ht://Dig should follow those 
> wishes. I'd have a similar opinion about something that ignored 
> robots.txt.

I'm inclinded to agree with you here.  Sure, you could make a case for
allowing it for local files, but then someone could say "if it's allowed
for local files, why not for local intranet digs via HTTP?"  Where does
it stop?  How does htdig differentiate between situations it's allowable,
and those where it isn't.  Ultimately, for any spidering, you have to go
with the honour system, but as developers we don't want to make things
too easy for the dishonourable.  I'd say nay for 3.1.4 & 3.2.0b1, and
suggest further discussion for 3.2 development.

> >3) htdig/Retriever.cc & htdig/Server.cc: modified to allow local file
> >access to persist even if the HTTP server is down.  Looks good to me.
> 
> I'm still thinking about this. It looks good, but I'm wondering why 
> every server needs to have a boolean, when only 'localhost' is going 
> to allow local file access. Otherwise, this is fine. The 
> Retriever/Server classes are still in need of some work.

It's not just localhost - any host that's mapped via local_urls could
benefit from this patch.  When you consider host aliases, virtual hosts,
and NFS/SMB mounted trees from other intranet servers, the potential
list of hosts one could use in local_urls really explodes.  In all of
these situations, one could make a case for allowing digs even when the
http server is down.  I must say I like this patch, and vote to include
it in both 3.1.4 and 3.2.0b1.  Something like this has been requested
many times, so there's a real need for it.

> >6) htlib/cgi.cc & htsearch/htsearch.cc: add a -a option to htsearch, to
> >add name=value parameters to those in query string.  This is undocumented
> >as well.  I'm not sure how it relates to the other changes, but it seems
> >simple enough.
> 
> I have to think about this one too. It seems reasonable, but I always 
> think through changes that might allow the CGI to be cracked. Since I 
> can't think of a way to send command-line arguments to the CGI, this 
> seems OK.

<a href="/cgi-bin/htsearch?-a+words=foo">click</a>

When you use the GET method as above, a partially parsed query string is
always passed as arguments to the CGI program.  Also, you can include
arguments to a CGI program the same way in the action parameter to a
<form> tag, as long as the CGI is called with a POST method (with GET,
the query string from the form overrides the specified arguments).

There's no risk involved in the -a option, though, because all it does
is define input parameters that can already be defined as easily through
a standard CGI query string.  The more crackable command-line option
right now is the -c option, by which one can slip past the compiled-in
CONFIG_DIR setting to use config files anywhere on the file system.

In any case, I think the code already committed to 3.1.4 and 3.2.0b1,
which allows an entire query string as a single argument, is probably
more practical for this use, and more in line with what others (esp.
Torsten) have requested.  (This argument is used only if REQUEST_METHOD
is not set, to avoid conflicts with arguments passed to htsearch as a
side effect of the GET method.)

Again, I'd say no to the -a option, for 3.1.4 & 3.2.0b1, but maybe for
future 3.2 releases if the current fix doesn't fit the bill for everyone.

By the way, Geoff, the main reason for this is in PHP code, where you
can't set environment variables without using a wrapper script.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.

Reply via email to