Re: Various musings about the request URL / URI / whatever

2005-12-01 Thread Graham Dumpleton

Hmmm, go away for two days and a mail storm erupts. :-(

I may never be able to catch up and digest this mail thread, but I'll  
try and add a few comments of my own.


On 01/12/2005, at 8:41 AM, Nicolas Lehuen wrote:

c) We don't have a req.base_uri (to follow Jim's naming suggestion)  
or req.script_name that would be equivalent to  
req.subprocess_env.get(SCRIPT_NAME), but we have a  
req.path_info... Why is this missing ?


Note that SCRIPT_NAME as obtainable from Apache appears to be broken.  
See:


  http://issues.apache.org/jira/browse/MODPYTHON-68

Also note that with respect to req.uri  and req.path_info, be aware  
that req.path_info is normalised and req.uri is not. Ie., the latter  
may have duplicated instances of '/' in it whereas that cannot occur  
in req.path_info. This makes it error prone to take the len() of  
req.path_info to work out how much to drop off req.uri to obtain the  
base uri or script name, you need to normalise req.uri first, but  
then the result will be missing the duplicates instances of '/'  
unless you use some more elaborate algorithm to work it out.


Graham


Re: Various musings about the request URL / URI / whatever

2005-11-30 Thread Daniel J. Popowich

Gregory (Grisha) Trubetskoy writes:
 I think a properly designed site should insist on its host name, i.e. I 
 see you think I'm gobbledygook.bleh, but I'm going to redirect you to 
 http://www.modpython.org/ because that is my true name. This is very 
 common with sites that respond to both www.site.com and site.com, but 
 insist on only one of those by redirecting.

As I said in my previous email to the list, I *think* if you use
virtual hosts and your real sites are NOT the first real host, then
you are forcing clients to speak HTTP/1.1, thus forcing the Host
header to be sent.  If you then put in your first, default
virtualhost:

  RedirectPermanent / http://realserver/

then you protect yourself from gobbledygook.bleh because that will
be sent to the default virtualhost which will redirect.

Right?  If so, a bit convoluted and not accessible to the novice.
Perhaps a Tips  Tricks chapter to the manual?


Daniel Popowich
---
http://home.comcast.net/~d.popowich/mpservlets/



Re: Various musings about the request URL / URI / whatever

2005-11-30 Thread Jorey Bump

Jim Gallacher wrote:

Gregory (Grisha) Trubetskoy wrote:


I don't know what the specific issue is with parsed_uri, if this is a 
mod_python bug it should just be fixed BUT if this is an issue with 
httpd, I don't think we should cover the problem up by having 
mod_python fix it. Since we are part of the HTTP Server project, we 
should just fix it in httpd.


Either way, it should be fixed.


I think maybe it's not really broken.

In case anyone is not familiar with the issue, a request for 
http://example.com/tests/mptest?view=form currently gives a tuple that 
looks something like this:


That's not true. That's what you might see in your client browser, but 
(usually) it only asks for /tests/mptest?view=form, regardless of the 
name it used to find the server. It may use the Host: header to 
negotiate the right virtual host, but the Host: header is not part of 
the string that parsed_uri is actually parsing.



(None, None, None, None, None, None, '/tests/mptest', 'view=form', None)

which is not what we expect. This is what the mod_python docs have to say:

parsed_uri
Tuple. The URI broken down into pieces. (scheme, hostinfo, user, 
password, hostname, port, path, query, fragment). The apache module 
defines a set of URI_* constants that should be used to access elements 
of this tuple. Example:


fname = req.parsed_uri[apache.URI_PATH]

(Read-Only)


This is all correct. I think the problem is that developers are hoping 
to use parsed_uri in a use case for which it is inappropriate. Those 
values are populated *if present* in the supplied request URI, but the 
only *minimal* requirement would be a / for the path.


If you want to know what resource the client *really* requested (and 
inquiring minds do), you *must not* attempt to rewrite or repopulate this.




Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Jim Gallacher

Nicolas Lehuen wrote:

Hi,

Is it me or is it quite tiresome to get the URL that called us, or get 
the complete URL that would call another function ?


When performing an external redirect (using mod_python.util.redirect for 
example), we MUST (as per RFC) provide a full URL, not a relative one. 
Instead of util.redirect(req,'/foo/bar.py'), we should write 
util.redirect(req,'https://whatever:8443/foo/bar.py').


The problem is, writing this is always tiresome, as it means building a 
string like this :


def current_url(req):
req.add_common_vars()
current_url = []

# protocol
if req.subprocess_env.get('HTTPS') == 'on':
current_url.append('https')
default_port = 443
else:
current_url.append('http')
default_port = 80
current_url.append('://')

# host
current_url.append(req.hostname)

# port
port = req.connection.local_addr[1]
if port != default_port:
current_url.append(':')
current_url.append(str(port))

# URI

current_url.append(req.uri)

return ''.join(current_url)


So I have two questions :

First question, is there a simpler way to do this ? Ironically, when 
using mod_rewrite, you get an environment variable named SCRIPT_URI 
which is precisely what I need (SCRIPT_URL, also added by mod_rewrite, 
is equivalent to req.uri... Don't ask we why). But relying on it isn't 
safe since mod_rewrite isn't always used.


I guess you could just assemble the parts from the req.parsed_uri tuple, 
except that apache doesn't actually fill in parsed_uri. :(


Second question, if there isn't any simpler way to do this, should we 
add it to mod_python ? Either as a function like above in 
mod_python.util, or as a member of the request object (named something 
like url to match the other member named uri, but that's just teasing).


I'm not against it, but for my purposes I think it would be more useful 
for parsed_uri to just work properly.


And third question (in pure Spanish inquisition style) : why is 
req.parsed_uri returning me a tuple full of Nones except for the uri and 
path_info part ?


It comes from apache that way. I sure don't know why though. Maybe we're 
missing some magic apache call that would fill it in?


Ah, fourth question : why are we (mod_python, mod_rewrite and the CGI 
environment variables) using the terms URI and URL to distinguish 
between a full, absolute resource path and a path relative to the 
server, whereas the definition of URLs and URIs is very vague and 
nothing close to this 
(http://www.w3.org/TR/uri-clarification/#contemporary) ? Shouldn't we 
save our souls and a lot of saliva by choosing better names ?


Strangely I was reading the cited page just last week, for perhaps  the 
100th time. I keep hoping I'll find enlightment but alas no. The danger 
of choosing new names (ie absolute_thingy or relative_thingy) is  that 
we also add another layer of confusion. I'm not saying new names are a 
bad idea, just that we need to be very careful.


OK, OK, fifth question : we made req.filename and other members 
writable. But when those attributes are changed, as Graham noted a while 
ago, the other dependent ones aren't, leading to inconsitencies (for 
example, if you change req.filename, req.canonical_filename isn't 
changed). Should we try to solve this and provide clear definition of 
the various parts of a request for mod_python 3.3 ?


That would make sense. I'm wondering how often people make use of 
req.canonical_filename (CFN*)?  Also, just how would the CFN  be 
adjusted if the user code changes req.filename, since the user is free 
to put any string in there they want? Maybe CFN just gets changed to the 
same string. Hopefully Graham will shed some light on this, since it was 
his use case.


Regards,
Jim

* Because I can't type canonical_filename the same way twice. Stupid 
fingers.


Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Gregory (Grisha) Trubetskoy


On Tue, 29 Nov 2005, Nicolas Lehuen wrote:


def current_url(req):


[snip]



   # host
   current_url.append(req.hostname)


[snip]

This part isn't going to work reliably if you are not using virtual hosts 
and just bind to an IP number. Deciphering the URL is an impossible task - 
I used to have similar code in my apllications, but lately I realized that 
it does not work reliably and it is much simpler to just treat it as a 
configuration item...



First question, is there a simpler way to do this ? Ironically, when using
mod_rewrite, you get an environment variable named SCRIPT_URI which is
precisely what I need (SCRIPT_URL, also added by mod_rewrite, is equivalent
to req.uri... Don't ask we why). But relying on it isn't safe since
mod_rewrite isn't always used.


well - here's how it does it.

/*
 *  create the SCRIPT_URI variable for the env
 */

/* add the canonical URI of this URL */
thisserver = ap_get_server_name(r);
port = ap_get_server_port(r);
if (ap_is_default_port(port, r)) {
thisport = ;
}
else {
apr_snprintf(buf, sizeof(buf), :%u, port);
thisport = buf;
}
thisurl = apr_table_get(r-subprocess_env, ENVVAR_SCRIPT_URL);

/* set the variable */
var = apr_pstrcat(r-pool, ap_http_method(r), ://, thisserver, thisport,
 thisurl, NULL);
apr_table_setn(r-subprocess_env, ENVVAR_SCRIPT_URI, var);

/* if filename was not initially set,
 * we start with the requested URI
 */
if (r-filename == NULL) {
r-filename = apr_pstrdup(r-pool, r-uri);
rewritelog(r, 2, init rewrite engine with requested uri %s,
   r-filename);
}


Second question, if there isn't any simpler way to do this, should we add it
to mod_python ? Either as a function like above in mod_python.util, or as a
member of the request object (named something like url to match the other
member named uri, but that's just teasing).


I don't know... Since the result is going to be half-baked... I think a 
more interesting and mod_python-ish thing to do would be to expose all the 
API's used in the above code (e.g. ap_get_server_name, ap_is_default_port, 
ap_http_method) FIRST, then think about this.



And third question (in pure Spanish inquisition style) : why is
req.parsed_uri returning me a tuple full of Nones except for the uri and
path_info part ?


This is an httpd question most likely...


Ah, fourth question : why are we (mod_python, mod_rewrite and the CGI
environment variables) using the terms URI and URL to distinguish
between a full, absolute resource path and a path relative to the server,
whereas the definition of URLs and URIs is very vague and nothing close to
this (http://www.w3.org/TR/uri-clarification/#contemporary) ? Shouldn't we
save our souls and a lot of saliva by choosing better names ?


No, we (mod_python) should just use the exact same name that httpd uses. 
If we come up better names, then it's just going to make it even more 
confusing.



OK, OK, fifth question : we made req.filename and other members writable.
But when those attributes are changed, as Graham noted a while ago, the
other dependent ones aren't, leading to inconsitencies (for example, if you
change req.filename, req.canonical_filename isn't changed). Should we try to
solve this


The solutions is to make req.canonical_filename writable too and document 
that if you change req.filename, you may consider changing 
canonical_filename as well and what will happen if you do not.



and provide clear definition of the various parts of a request
for mod_python 3.3 ?


Yes, that'd be good :)

Grisha


Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Jim Gallacher

Daniel J. Popowich wrote:

Jim Gallacher writes:


Nicolas Lehuen wrote:

Second question, if there isn't any simpler way to do this, should we 
add it to mod_python ? Either as a function like above in 
mod_python.util, or as a member of the request object (named something 
like url to match the other member named uri, but that's just teasing).


I'm not against it, but for my purposes I think it would be more useful 
for parsed_uri to just work properly.



Here, here!!  I've wanted parsed_uri to work as expected for quite
some time...I'm actually in a position where I could devote some time
to tracking this down.  If apache doesn't provide it, I think
mod_python should at least fill it in, right? 


+1


Can someone knudge me
in the right direction to start?  Haven't poked around apache source
and/or developer docs in years.


All I can say is grep is your friend. :)

I've found http://docx.webperf.org to be useful. Unfortunately you can
only drill down into the header files, not c files (unless I'm missing
something). I might even be tempted to generate my own local copy of the
apache docs using doxygen so that the c-files get included. I've been
playing with doxygen + mod_python and it's pretty cool.

Searching docx for parse_uri turns up ap_parse_uri.
http://docx.webperf.org/group__APACHE__CORE__PROTO.html#ga44

Grab the src and put grep to work. I'll dig in and help any way I can.

Jim





Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Jim Gallacher

Gregory (Grisha) Trubetskoy wrote:


On Tue, 29 Nov 2005, Jim Gallacher wrote:


Daniel J. Popowich wrote:



Here, here!!  I've wanted parsed_uri to work as expected for quite
some time...I'm actually in a position where I could devote some time
to tracking this down.  If apache doesn't provide it, I think
mod_python should at least fill it in, right? 



+1



I don't know what the specific issue is with parsed_uri, if this is a 
mod_python bug it should just be fixed BUT if this is an issue with 
httpd, I don't think we should cover the problem up by having mod_python 
fix it. Since we are part of the HTTP Server project, we should just 
fix it in httpd.


Either way, it should be fixed.

In case anyone is not familiar with the issue, a request for 
http://example.com/tests/mptest?view=form currently gives a tuple that 
looks something like this:


(None, None, None, None, None, None, '/tests/mptest', 'view=form', None)

which is not what we expect. This is what the mod_python docs have to say:

parsed_uri
Tuple. The URI broken down into pieces. (scheme, hostinfo, user, 
password, hostname, port, path, query, fragment). The apache module 
defines a set of URI_* constants that should be used to access elements 
of this tuple. Example:


fname = req.parsed_uri[apache.URI_PATH]

(Read-Only)

Jim




Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Daniel J. Popowich

Gregory (Grisha) Trubetskoy writes:

 On Tue, 29 Nov 2005, Nicolas Lehuen wrote:

  If I understand you correctly, req.hostname is not reliable in case where
  virtual hosting is not used. What about server.server_hostname, which seems
  to be used by the code from mod_rewrite you posted below ? Can it be used
  reliably ?

 I don't think so.

 if I do this:

 telnet some.host.com 80

 GET /index.html

 How would apache know what the hostname is?

By the Host header. I've been looking into this issue tonight and
think I have the answers (but it's really late, so I'll save the gory
details for tomorrow). In brief: typically, req.hostname is set from
the Host header and, in fact, when I telnet to apache and issue a GET
by hand, if I don't send the Host header, apache barfs with a 400, Bad
Request, response. (apache 2.0.54, debian testing)

As for the larger issue at hand: the reason req.parsed_uri is not
filled in is because browsers don't send the info in the GET, e.g.,
browsers send this:

GET /index.py?a=bc=d HTTP/1.1

not

GET http://user:[EMAIL PROTECTED]:80/index.py?a=bc=d#here HTTP/1.1

if they did, parsed_uri would be filled in. req.unparsed_uri is
whatever the word after GET in the http protocol exchange;
req.parsed_uri is the parsing of that word.

Given the full URI spec:

SCHEME://[USER[:[EMAIL PROTECTED]:PORT]/PATH?QUERY#FRAGMENT

you can see where eight of the nine elements of the parsed_uri tuple
come from; the ninth, hostinfo, is the combination of
[USER[:[EMAIL PROTECTED]:PORT] (everything between // and /).

Unfortunately, browsers only send:

/PATH?QUERY

and that's why we only ever see it in unparsed_uri and parsed_uri.



Again, lots more to share...in the morrow...



Daniel Popowich
---
http://home.comcast.net/~d.popowich/mpservlets/



Re: Various musings about the request URL / URI / whatever

2005-11-29 Thread Mike Looijmans

Daniel J. Popowich wrote:


By the Host header. I've been looking into this issue tonight and
think I have the answers (but it's really late, so I'll save the gory
details for tomorrow). In brief: typically, req.hostname is set from
the Host header and, in fact, when I telnet to apache and issue a GET
by hand, if I don't send the Host header, apache barfs with a 400, Bad
Request, response. (apache 2.0.54, debian testing)


It will only do that if you claim to be a HTTP/1.1 client. If you send 
GET / HTTP/1.0

it will not complain about the host header. Sending:
GET / HTTP/1.1
will get you a 400 response, because you MUST supply it (says RFC 2068, 
and whatever superseded that one). There is more you must do to be able 
to call yourself HTTP/1.1 by the way, such as keep-alive connections and 
chunked encoding.


--
Mike Looijmans
Philips Natlab / Topic Automation